Distributed R

Last updated
Distributed R
Developer(s) HP
Stable release
1.2.0 [1] / 22 October 2015;5 years ago (22 October 2015)
Repository OOjs UI icon edit-ltr-progressive.svg
Written in C++, R
Operating system Linux
Type machine learning algorithms
License GNU General Public License
Website www.distributedr.org

Distributed R is an open source, high-performance platform for the R language. It splits tasks between multiple processing nodes to reduce execution time and analyze large data sets. Distributed R enhances R by adding distributed data structures, parallelism primitives to run functions on distributed data, a task scheduler, and multiple data loaders. [2] It is mostly used to implement distributed versions of machine learning tasks. Distributed R is written in C++ and R, and retains the familiar look and feel of R. As of February 2015, Hewlett-Packard (HP) provides enterprise support for Distributed R with proprietary additions such as a fast data loader from the Vertica database. [3]

Contents

History

Distributed R was begun in 2011 by Indrajit Roy, Shivaram Venkataraman, Alvin AuYoung, and Robert S. Schreiber as a research project at HP Labs. [4] It was open sourced in 2014 under the GPLv2 license and is available at GitHub.

In February 2015, Distributed R reached its first stable version 1.0, along with enterprise support from HP. [5]

Components

Distributed R is a platform to implement and execute distributed applications in R. The goal is to extend R for distributed computing, while retaining the simplicity and look-and-feel of R. Distributed R consists of the following components:

Integration with databases

HP Vertica provides tight integration with their database and the open source Distributed R platform. HP Vertica 7.1 includes features that enable fast, parallel loading from the Vertica database to Distribute R. This parallel Vertica loader can be more than five times (5x) faster than using traditional ODBC based connectors. The Vertica database also supports deployment of machine learning models in the database. Distributed R users can call the distributed algorithms to create machine learning models, deploy them in the Vertica database, and use the model for in-database scoring and predictions. Architectural details of the Vertica database and Distributed R integration are described in the Sigmod 2015 paper. [6]

Related Research Articles

Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Load balancing (computing)

In computing, load balancing refers to the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing techniques can optimize the response time for each task, avoiding unevenly overloading compute nodes while other compute nodes are left idle.

IBM Db2 Family Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. They initially supported the relational model, but were extended to support object-relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

NonStop SQL is a commercial relational database management system that is designed for fault tolerance and scalability, currently offered by Hewlett Packard Enterprise. The latest version is SQL/MX 3.4.

Theoretical computer science

Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on more mathematical topics of computing, and includes the theory of computation.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, Chubby Lock Service, SSTable and a few other Google technologies. On May 6, 2015, a public version of Bigtable was made available as a service. Bigtable also underlies Google Cloud Datastore, which is available as a part of the Google Cloud Platform.

Vertica

Vertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker and Andrew Palmer. Palmer was the founding CEO. Ralph Breslauer and Christopher P. Lynch served as later CEOs.

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing, for modeling, data analysis and visualization without, or with only minimal, programming.

ELKI Data mining framework

ELKI is a data mining software framework developed for use in research and teaching. It was originally at the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany, and now continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Eclipse Deeplearning4j is the only deep learning programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Apache SINGA

Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are learned.

References

  1. "Release 1.2.0". 22 October 2015. Retrieved 20 July 2018.
  2. Venkataraman, Shivaram; Bodzsar, Erik; Roy, Indrajit; AuYoung, Alvin; Schreiber, Robert S. (2013). "Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices" (PDF). European Conference on Computer Systems (EuroSys). Archived from the original (PDF) on 2015-03-01.
  3. Gagliordi, Natalie. "HP adds scale to open-source R in latest big data platform". ZDNet. Retrieved 17 February 2015.
  4. Venkataraman, Shivaram; Roy, Indrajit; AuYoung, Alvin; Schreiber, Robert S. (2012). "Using R for Iterative and Incremental Processing". Workshop on Hot Topics in Cloud Computing (HotCloud).
  5. "HP Delivers Predictive Analytics at Big Data Scale". hp.com. 17 February 2015. Retrieved 17 February 2015.
  6. Prasad, Shreya; Fard, Arash; Gupta, Vishrut; Martinez, Jorge; LeFevre, Jeff; Xu, Vincent; Hsu, Meichun; Roy, Indrajit (2015). "Enabling predictive analytics in Vertica: Fast data transfer, distributed model creation and in-database prediction". ACM SIGMOD International Conference on Management of Data.