Schedule - FOSDEM PGDay 2019

If the data do not come to R, then R must go to the data

Date: 2019-02-01
Time: 12:30–13:20
Room: Hotel

Modern molecular biology produces experimental data in gigabyte and terabyte amounts that require statistical evaluation. Statistical packages such as R, SAS, certain Python and MatLab libraries allow for effective data analysis, but are typically suboptimal for data storage and management. Thus a common pipeline of data analysis would include uploading the data into a database, selection and manipulation on them in the database, writing the subsets into a text file, uploading the text file into a data analysis language of one’s choice and doing the statistical analysis there. In this talk, I will show how large volumes of data can be analysed directly from the database using the PL/R PostgreSQL language extension. In particular, I will show how to use custom aggregate functions to pass lists of data values pre-selected in the database using certain parameters and perform statistical tests and real-time plotting using the R built-in functions. Since PL/R implements the full R functionality, it is also possible to use external libraries to produce print-ready plots. Nevertheless, passing several lists of values from the database to PL/R is still cumbersome, and I will show a work-around to do this. Taken together, PL/R provides a powerful and extendable means to perform on-flight data analysis that avoids additional writing to disk and using multiple tools, and thus streamlines the data analysis and potentially makes it more efficient. However, there is probably still a long way to go before this combination of tools gains broad acceptance, and I will discuss some ways that we can make this happen.


The following slides have been made available for this session:


Olga Kalinina