Distributed R

Distributed R
Developer(s)	HP
Stable release	1.2.0 / 22 October 2015;8 years ago
Repository	github.com/vertica/DistributedR ;
Written in	C++, R
Operating system	Linux
Type	machine learning algorithms
License	GNU General Public License
Website	www.distributedr.org

Last updated July 06, 2024

Distributed R is an open source, high-performance platform for the R language. It splits tasks between multiple processing nodes to reduce execution time and analyze large data sets. Distributed R enhances R by adding distributed data structures, parallelism primitives to run functions on distributed data, a task scheduler, and multiple data loaders.^[2] It is mostly used to implement distributed versions of machine learning tasks. Distributed R is written in C++ and R, and retains the familiar look and feel of R. As of February 2015^[update], Hewlett-Packard (HP) provides enterprise support for Distributed R with proprietary additions such as a fast data loader from the Vertica database.^[3]

History

Distributed R was begun in 2011 by Indrajit Roy, Shivaram Venkataraman, Alvin AuYoung, and Robert S. Schreiber as a research project at HP Labs.^[4] It was open sourced in 2014 under the GPLv2 license and is available at GitHub.

In February 2015, Distributed R reached its first stable version 1.0, along with enterprise support from HP.^[5]

Components

Distributed R is a platform to implement and execute distributed applications in R. The goal is to extend R for distributed computing, while retaining the simplicity and look-and-feel of R. Distributed R consists of the following components:

Distributed data structures: Distributed R extends R's common data structures such as array, data.frame, and list to store data across multiple nodes. The corresponding Distributed R data structures are darray, dframe, and dlist. Many of the common data structure operations in R, such as colSums, rowSums, nrow and others, are also available on distributed data structures.
Parallel loop: Programmers can use the parallel loop, called foreach, to manipulate distributed data structures and execute tasks in parallel. Programmers only specify the data structure and function to express applications, while the runtime schedules tasks and, if required, moves around data.
Distributed algorithms: Distributed versions of common machine learning and graph algorithms, such as clustering, classification, and regression.
Data loaders: Users can leverage Distributed R constructs to implement parallel connectors that load data from different sources. Distributed R already provides implementations to load data from files and databases to distributed data structures.

Integration with databases

HP Vertica provides tight integration with their database and the open source Distributed R platform. HP Vertica 7.1 includes features that enable fast, parallel loading from the Vertica database to Distribute R. This parallel Vertica loader can be more than five times (5x) faster than using traditional ODBC based connectors. The Vertica database also supports deployment of machine learning models in the database. Distributed R users can call the distributed algorithms to create machine learning models, deploy them in the Vertica database, and use the model for in-database scoring and predictions. Architectural details of the Vertica database and Distributed R integration are described in the Sigmod 2015 paper.^[6]

Related Research Articles

Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

In computing, load balancing is the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

NonStop SQL is a commercial relational database management system that is designed for fault tolerance and scalability, currently offered by Hewlett Packard Enterprise. The latest version is SQL/MX 3.4.

Theoretical computer science is a subfield of computer science and mathematics that focuses on the abstract and mathematical foundations of computation.

Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from Prolog. It is often used as a query language for deductive databases. Datalog has been applied to problems in data integration, networking, program analysis, and more.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing, for modeling, data analysis and visualization without, or with minimal, programming.

ELKI is a data mining software framework developed for use in research and teaching. It was originally created by the database systems research unit at the Ludwig Maximilian University of Munich, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

LightGBM, short for Light Gradient-Boosting Machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

References

↑ "Release 1.2.0". 22 October 2015. Retrieved 20 July 2018.
↑ Venkataraman, Shivaram; Bodzsar, Erik; Roy, Indrajit; AuYoung, Alvin; Schreiber, Robert S. (2013). "Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices" (PDF). European Conference on Computer Systems (EuroSys). Archived from the original (PDF) on 2015-03-01.
↑ Gagliordi, Natalie. "HP adds scale to open-source R in latest big data platform". ZDNet. Retrieved 17 February 2015.
↑ Venkataraman, Shivaram; Roy, Indrajit; AuYoung, Alvin; Schreiber, Robert S. (2012). "Using R for Iterative and Incremental Processing". Workshop on Hot Topics in Cloud Computing (HotCloud).
↑ "HP Delivers Predictive Analytics at Big Data Scale". hp.com. 17 February 2015. Retrieved 17 February 2015.
↑ Prasad, Shreya; Fard, Arash; Gupta, Vishrut; Martinez, Jorge; LeFevre, Jeff; Xu, Vincent; Hsu, Meichun; Roy, Indrajit (2015). "Enabling predictive analytics in Vertica: Fast data transfer, distributed model creation and in-database prediction". ACM SIGMOD International Conference on Management of Data.

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[wikidata-293c64b165a9c670f5f72353e8088003b8154d1d-v13-1] "Release 1.2.0". 22 October 2015. Retrieved 20 July 2018.

[2] Venkataraman, Shivaram; Bodzsar, Erik; Roy, Indrajit; AuYoung, Alvin; Schreiber, Robert S. (2013). "Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices" (PDF). European Conference on Computer Systems (EuroSys). Archived from the original (PDF) on 2015-03-01.

[3] Gagliordi, Natalie. "HP adds scale to open-source R in latest big data platform". ZDNet. Retrieved 17 February 2015.

[4] Venkataraman, Shivaram; Roy, Indrajit; AuYoung, Alvin; Schreiber, Robert S. (2012). "Using R for Iterative and Incremental Processing". Workshop on Hot Topics in Cloud Computing (HotCloud).

[5] "HP Delivers Predictive Analytics at Big Data Scale". hp.com. 17 February 2015. Retrieved 17 February 2015.

[6] Prasad, Shreya; Fard, Arash; Gupta, Vishrut; Martinez, Jorge; LeFevre, Jeff; Xu, Vincent; Hsu, Meichun; Roy, Indrajit (2015). "Enabling predictive analytics in Vertica: Fast data transfer, distributed model creation and in-database prediction". ACM SIGMOD International Conference on Management of Data.

[1]

[2]

[3]

[4]

[5]

[6]

v t e R (programming language)
Features	Sweave
Implementations	Distributed R Microsoft R Open (Revolution R Open) Renjin
Packages	Bibliometrix easystats qdap lumi RGtk2 Rhea Rmetrics rnn RQDA Shiny SimpleITK Statcheck tidyverse ggplot2 dplyr knitr
Interfaces	Bio7 Emacs Speaks Statistics Java GUI for R KH Coder Rattle GUI R Commander RExcel RKWard RStudio
People	Roger Bivand Jenny Bryan John Chambers Peter Dalgaard Dirk Eddelbuettel Robert Gentleman Ross Ihaka Thomas Lumley Brian D. Ripley Julia Silge Luke Tierney Hadley Wickham Yihui Xie
Organisations	R Consortium Revolution Analytics R-Ladies Posit PBC (formerly RStudio PBC)
Publications	The R Journal

Developer(s)	HP

Stable release	1.2.0^[1] / 22 October 2015;8 years ago (22 October 2015)

Repository	github.com/vertica/DistributedR
Written in	C++, R
Operating system	Linux
Type	machine learning algorithms
License	GNU General Public License
Website	www.distributedr.org