Programming with Big Data in R

bdrp
Paradigm	SPMD and MPMD
Designed by	Wei-Chen Chen, George Ostrouchov, Pragneshkumar Patel, and Drew Schmidt
Developer	pbdR Core Team
First appeared	September 2012;11 years ago
Preview release	Through GitHub at RBigData
Typing discipline	Dynamic
OS	Cross-platform
License	General Public License and Mozilla Public License
Website	www.r-pbd.org
Influenced by
	R, C, Fortran, MPI, and ØMQ

Last updated February 29, 2024

Programming with Big Data in R (pbdR)^[1] is a series of R packages and an environment for statistical computing with big data by using high-performance statistical computation.^[2]^[3] The pbdR uses the same programming language as R with S3/S4 classes and methods which is used among statisticians and data miners for developing statistical software. The significant difference between pbdR and R code is that pbdR mainly focuses on distributed memory systems, where data are distributed across several processors and analyzed in a batch mode, while communications between processors are based on MPI that is easily used in large high-performance computing (HPC) systems. R system mainly focuses^{[ citation needed ]} on single multi-core machines for data analysis via an interactive mode such as GUI interface.

The pbdR built on pbdMPI uses SPMD parallelism where every processor is considered as worker and owns parts of data. The SPMD parallelism introduced in mid 1980 is particularly efficient in homogeneous computing environments for large data, for example, performing singular value decomposition on a large matrix, or performing clustering analysis on high-dimensional large data. On the other hand, there is no restriction to use manager/workers parallelism in SPMD parallelism environment.
The Rmpi^[4] uses manager/workers parallelism where one main processor (manager) serves as the control of all other processors (workers). The manager/workers parallelism introduced around early 2000 is particularly efficient for large tasks in small clusters, for example, bootstrap method and Monte Carlo simulation in applied statistics since i.i.d. assumption is commonly used in most statistical analysis. In particular, task pull parallelism has better performance for Rmpi in heterogeneous computing environments.

The idea of SPMD parallelism is to let every processor do the same amount of work, but on different parts of a large data set. For example, a modern GPU is a large collection of slower co-processors that can simply apply the same computation on different parts of relatively smaller data, but the SPMD parallelism ends up with an efficient way to obtain final solutions (i.e. time to solution is shorter).^[5]

Package design

Programming with pbdR requires usage of various packages developed by pbdR core team. Packages developed are the following.

General	I/O	Computation	Application	Profiling	Client/Server
pbdDEMO	pbdNCDF4	pbdDMAT	pmclust	pbdPROF	pbdZMQ
pbdMPI	pbdADIOS	pbdBASE	pbdML	pbdPAPI	remoter
		pbdSLAP		hpcvis	pbdCS
		kazaam			pbdRPC

Among these packages, pbdMPI provides wrapper functions to MPI library, and it also produces a shared library and a configuration file for MPI environments. All other packages rely on this configuration for installation and library loading that avoids difficulty of library linking and compiling. All other packages can directly use MPI functions easily.

pbdMPI --- an efficient interface to MPI either OpenMPI or MPICH2 with a focus on Single Program/Multiple Data (SPMD) parallel programming style
pbdSLAP --- bundles scalable dense linear algebra libraries in double precision for R, based on ScaLAPACK version 2.0.2 which includes several scalable linear algebra packages (namely BLACS, PBLAS, and ScaLAPACK).
pbdNCDF4 --- interface to Parallel Unidata NetCDF4 format data files
pbdBASE --- low-level ScaLAPACK codes and wrappers
pbdDMAT --- distributed matrix classes and computational methods, with a focus on linear algebra and statistics
pbdDEMO --- set of package demonstrations and examples, and this unifying vignette
pmclust --- parallel model-based clustering using pbdR
pbdPROF --- profiling package for MPI codes and visualization of parsed stats
pbdZMQ --- interface to ØMQ
remoter --- R client with remote R servers
pbdCS --- pbdR client with remote pbdR servers
pbdRPC --- remote procedure call
kazaam --- very tall and skinny distributed matrices
pbdML --- machine learning toolbox

Among those packages, the pbdDEMO package is a collection of 20+ package demos which offer example uses of the various pbdR packages, and contains a vignette that offers detailed explanations for the demos and provides some mathematical or statistical insight.

Examples

Example 1

Hello World! Save the following code in a file called "demo.r"

### Initial MPIlibrary(pbdMPI,quiet=TRUE)init()comm.cat("Hello World!\n")### Finishfinalize()

and use the command

mpiexec-np2Rscriptdemo.r

to execute the code where Rscript is one of command line executable program.

Example 2

The following example modified from pbdMPI illustrates the basic syntax of the language of pbdR. Since pbdR is designed in SPMD, all the R scripts are stored in files and executed from the command line via mpiexec, mpirun, etc. Save the following code in a file called "demo.r"

### Initial MPIlibrary(pbdMPI,quiet=TRUE)init().comm.size<-comm.size().comm.rank<-comm.rank()### Set a vector x on all processors with different valuesN<-5x<-(1:N)+N*.comm.rank### All reduce x using summation operationy<-allreduce(as.integer(x),op="sum")comm.print(y)y<-allreduce(as.double(x),op="sum")comm.print(y)### Finishfinalize()

and use the command

mpiexec-np4Rscriptdemo.r

to execute the code where Rscript is one of command line executable program.

Example 3

The following example modified from pbdDEMO illustrates the basic ddmatrix computation of pbdR which performs singular value decomposition on a given matrix. Save the following code in a file called "demo.r"

# Initialize process gridlibrary(pbdDMAT,quiet=T)if(comm.size()!=2)comm.stop("Exactly 2 processors are required for this demo.")init.grid()# Setup for the remaindercomm.set.seed(diff=TRUE)M<-N<-16BL<-2# blocking --- passing single value BL assumes BLxBL blockingdA<-ddmatrix("rnorm",nrow=M,ncol=N,mean=100,sd=10)# LA SVDsvd1<-La.svd(dA)comm.print(svd1$d)# Finishfinalize()

and use the command

mpiexec-np2Rscriptdemo.r

to execute the code where Rscript is one of command line executable program.

Related Research Articles

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

<span class="mw-page-title-main">Jack Dongarra</span> American computer scientist (born 1950)

Jack Joseph Dongarra is an American computer scientist and mathematician. He is the American University Distinguished Professor of Computer Science in the Electrical Engineering and Computer Science Department at the University of Tennessee. He holds the position of a Distinguished Research Staff member in the Computer Science and Mathematics Division at Oak Ridge National Laboratory, Turing Fellowship in the School of Mathematics at the University of Manchester, and is an adjunct professor and teacher in the Computer Science Department at Rice University. He served as a faculty fellow at the Texas A&M University Institute for Advanced Study (2014–2018). Dongarra is the founding director of the Innovative Computing Laboratory at the University of Tennessee. He was the recipient of the Turing Award in 2021.

LAPACK is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition. LAPACK was originally written in FORTRAN 77, but moved to Fortran 90 in version 3.2 (2008). The routines handle both real and complex matrices in both single and double precision. LAPACK relies on an underlying BLAS implementation to provide efficient and portable computational building blocks for its routines.

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

The ScaLAPACK library includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. It is currently written in a Single-Program-Multiple-Data style using explicit message passing for interprocessor communication. It assumes matrices are laid out in a two-dimensional block cyclic decomposition.

Chapel, the Cascade High Productivity Language, is a parallel programming language that was developed by Cray, and later by Hewlett Packard Enterprise which acquired Cray. It was being developed as part of the Cray Cascade project, a participant in DARPA's High Productivity Computing Systems (HPCS) program, which had the goal of increasing supercomputer productivity by 2010. It is being developed as an open source project, under version 2 of the Apache license.

Trilinos is a collection of open-source software libraries, called packages, intended to be used as building blocks for the development of scientific applications. The word "Trilinos" is Greek and conveys the idea of "a string of pearls", suggesting a number of software packages linked together by a common infrastructure. Trilinos was developed at Sandia National Laboratories from a core group of existing algorithms and utilizes the functionality of software interfaces such as BLAS, LAPACK, and MPI. In 2004, Trilinos received an R&D100 Award.

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

Task parallelism is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors. In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by running many different tasks at the same time on the same data. A common type of task parallelism is pipelining, which consists of moving a single set of data through a series of separate tasks where each task can execute independently of the others.

<span class="mw-page-title-main">IPython</span> Advanced interactive shell for Python

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history. IPython provides the following features:

Intel oneAPI Math Kernel Library is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math.

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

For several years parallel hardware was only available for distributed computing but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing multi-threaded or parallel processing applications. Parallel programming requires handling various issues such as synchronization and deadlock avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.

References

↑ Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P. (2012). "Programming with Big Data in R".{{cite web}}: CS1 maint: multiple names: authors list (link)
↑ Chen, W.-C. & Ostrouchov, G. (2011). "HPSC -- High Performance Statistical Computing for Data Intensive Research". Archived from the original on 2013-07-19. Retrieved 2013-06-25.
↑ "Basic Tutorials for R to Start Analyzing Data". 3 November 2022.
1 2 Yu, H. (2002). "Rmpi: Parallel Statistical Computing in R". R News.
↑ Mike Houston. "Folding@Home - GPGPU" . Retrieved 2007-10-04.
↑ "100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages".

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P. (2012). "Programming with Big Data in R".{{cite web}}: CS1 maint: multiple names: authors list (link)

[2] Chen, W.-C. & Ostrouchov, G. (2011). "HPSC -- High Performance Statistical Computing for Data Intensive Research". Archived from the original on 2013-07-19. Retrieved 2013-06-25.

[3] "Basic Tutorials for R to Start Analyzing Data". 3 November 2022.

[rmpi-4] 1 2 Yu, H. (2002). "Rmpi: Parallel Statistical Computing in R". R News.

[5] Mike Houston. "Folding@Home - GPGPU" . Retrieved 2007-10-04.

[6] "100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages".

[1]

[2]

[3]

[4]

[5]

[6]

Programming with Big Data in R

Contents

Package design

Examples

Example 1

Example 2

Example 3

Further reading

Related Research Articles

References

External links