PivotalR package is an R front-end to PostgreSQL, Pivotal (Greenplum) database, and a wrapper for machine learning open-source library MADlib. It also interacts with Pivotal HD/HAWQ for Big Data analytics by providing an interface to the operations on tables/views in the database that is similar to data.frame. Hence it eliminates the need to learn SQL for the users of R when they work on objects in the database.
This package enables R users to operate on big data sets that would not fit into R memory and let them use R scripts to leverage MPP database as well as in-database analytics libraries. It also minimizes the data transferred between R and database. Big data is stored in database. When the user enters R commands, this package effectively translates into SQL queries and sends them to database for parallel execution. After execution the computed result is returned to R. Thereby using the powerful analytical capabilities of database and plotting the result with graphical functionalities of R.
PivotalR provides the core R infrastructure and over 50 analytical functions in R that leverage in-database execution. These include
- Data Connectivity – db.connect, db.disconnect, db.Rquery
- Data Exploration – db.data.frame, subsets
- R language features – dim, names, min, max, nrow, ncol, summary etc
- Reorganization Functions – merge, by (group-by), samples
- Transformations – as.factor, null replacement
- Algorithms – linear regression and logistic regression wrappers for MADlib