Time: 14:40 - 15:25
Feedback: Leave feedback
Modern-day science becomes data-driven science, for example molecular biology produces experimental data in gigabytes and terabytes. All scientific data require statistical evaluation. Statistical software such as R, SAS, Python and MatLab libraries allow for effective data analysis, but are typically not optimally tuned for data storage and management. Thus a common pipeline of data analysis would include uploading the data into a database, selection and manipulation on them in the database, writing the subsets into a text file, uploading the text file into a data analysis language of one’s choice and doing the statistical analysis there, which ensues significant overheads. A workaround for this would be to establish connection to a database from a statistical computing environment, or, conversely, connecting to a statistical environment from a database. In this talk, I will show how large amounts of data can be analysed directly from the database using the PL/R PostgreSQL language extension. In particular, I will show how to use custom aggregate functions to pass lists of data values pre-selected in the database and perform statistical tests and real-time plotting using the built-in R functions. I will also show how the full implementation of the R functionality in PL/R, allows to use external libraries, for example to produce print-ready plots. Nevertheless, passing several lists of values from the database to PL/R is still cumbersome, and I will show a work-around to do this. Taken together, PL/R provides a powerful and extendable means to perform on-flight data analysis that avoids additional writing to disk, and thus streamlines the data analysis and potentially makes it more efficient. However, there is probably still a long way to go before this combination of tools gains broad acceptance, and I will discuss some ways in that we can make this happen.