Data parallelism

Last updated June 05, 2024

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

A data parallel job on an array of n elements can be divided equally among all the processors. Let us assume we want to sum all the elements of the given array and the time for a single addition operation is Ta time units. In the case of sequential execution, the time taken by the process will be n×Ta time units as it sums up all the elements of an array. On the other hand, if we execute this job as a data parallel job on 4 processors the time taken would reduce to (n/4)×Ta + merging overhead time units. Parallel execution results in a speedup of 4 over sequential execution. One important thing to note is that the locality of data references plays an important part in evaluating the performance of a data parallel programming model. Locality of data depends on the memory accesses performed by the program as well as the size of the cache.

History

Exploitation of the concept of data parallelism started in 1960s with the development of the Solomon machine.^[1] The Solomon machine, also called a vector processor, was developed to expedite the performance of mathematical operations by working on a large data array (operating on multiple data in consecutive time steps). Concurrency of data operations was also exploited by operating on multiple data at the same time using a single instruction. These processors were called 'array processors'.^[2] In the 1980s, the term was introduced ^[3] to describe this programming style, which was widely used to program Connection Machines in data parallel languages like C*. Today, data parallelism is best exemplified in graphics processing units (GPUs), which use both the techniques of operating on multiple data in space and time using a single instruction.

Most data parallel hardware supports only a fixed number of parallel levels, often only one. This means that within a parallel operation it is not possible to launch more parallel operations recursively, and means that programmers cannot make use of nested hardware parallelism. The programming language NESL was an early effort at implementing a nested data-parallel programming model on flat parallel machines, and in particular introduced the flattening transformation that transforms nested data parallelism to flat data parallelism. This work was continued by other languages such as Data Parallel Haskell and Futhark, although arbitrary nested data parallelism is not widely available in current data-parallel programming languages.

Description

In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different distributed data. In some situations, a single execution thread controls operations on all the data. In others, different threads control the operation, but they execute the same code.

For instance, consider matrix multiplication and addition in a sequential manner as discussed in the example.

Example

Below is the sequential pseudo-code for multiplication and addition of two matrices where the result is stored in the matrix C. The pseudo-code for multiplication calculates the dot product of two matrices A, B and stores the result into the output matrix C.

If the following programs were executed sequentially, the time taken to calculate the result would be of the $O(n^{3})$ (assuming row lengths and column lengths of both matrices are n) and $O(n)$ for multiplication and addition respectively.

// Matrix multiplicationfor(i=0;i<row_length_A;i++){for(k=0;k<column_length_B;k++){sum=0;for(j=0;j<column_length_A;j++){sum+=A[i][j]*B[j][k];}C[i][k]=sum;}}

// Array additionfor(i=0;i<n;i++){c[i]=a[i]+b[i];}

We can exploit data parallelism in the preceding code to execute it faster as the arithmetic is loop independent. Parallelization of the matrix multiplication code is achieved by using OpenMP. An OpenMP directive, "omp parallel for" instructs the compiler to execute the code in the for loop in parallel. For multiplication, we can divide matrix A and B into blocks along rows and columns respectively. This allows us to calculate every element in matrix C individually thereby making the task parallel. For example: A[m x n] dot B [n x k] can be finished in $O(n)$ instead of $O(m*n*k)$ when executed in parallel using m*k processors.

Data parallelism in matrix multiplication Data Parallelism in matrix multiplication.png — Data parallelism in matrix multiplication

// Matrix multiplication in parallel#pragma omp parallel for schedule(dynamic,1) collapse(2)for(i=0;i<row_length_A;i++){for(k=0;k<column_length_B;k++){sum=0;for(j=0;j<column_length_A;j++){sum+=A[i][j]*B[j][k];}C[i][k]=sum;}}

It can be observed from the example that a lot of processors will be required as the matrix sizes keep on increasing. Keeping the execution time low is the priority but as the matrix size increases, we are faced with other constraints like complexity of such a system and its associated costs. Therefore, constraining the number of processors in the system, we can still apply the same principle and divide the data into bigger chunks to calculate the product of two matrices.^[4]

For addition of arrays in a data parallel implementation, let's assume a more modest system with two central processing units (CPU) A and B, CPU A could add all elements from the top half of the arrays, while CPU B could add all elements from the bottom half of the arrays. Since the two processors work in parallel, the job of performing array addition would take one half the time of performing the same operation in serial using one CPU alone.

The program expressed in pseudocode below—which applies some arbitrary operation, foo, on every element in the array d—illustrates data parallelism:^{[nb 1]}

if CPU = "a" then     lower_limit := 1     upper_limit := round(d.length / 2) else if CPU = "b" then     lower_limit := round(d.length / 2) + 1     upper_limit := d.length  for i from lower_limit to upper_limit by 1 do     foo(d[i])

In an SPMD system executed on 2 processor system, both CPUs will execute the code.

Data parallelism emphasizes the distributed (parallel) nature of the data, as opposed to the processing (task parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.

Steps to parallelization

The process of parallelizing a sequential program can be broken down into four discrete steps.^[5]

Type	Description
Decomposition	The program is broken down into tasks, the smallest exploitable unit of concurrence.
Assignment	Tasks are assigned to processes.
Orchestration	Data access, communication, and synchronization of processes.
Mapping	Processes are bound to processors.

Data parallelism vs. task parallelism

Data parallelism	Task parallelism
Same operations are performed on different subsets of same data.	Different operations are performed on the same or different data.
Synchronous computation	Asynchronous computation
Speedup is more as there is only one execution thread operating on all sets of data.	Speedup is less as each processor will execute a different thread or process on the same or different set of data.
Amount of parallelization is proportional to the input data size.	Amount of parallelization is proportional to the number of independent tasks to be performed.
Designed for optimum load balance on multi processor system.	Load balancing depends on the availability of the hardware and scheduling algorithms like static and dynamic scheduling.

Data parallelism vs. model parallelism

Data parallelism	Model parallelism
Same model is used for every thread but the data given to each of them is divided and shared.	Same data is used for every thread, and model is split among threads.
It is fast for small networks but very slow for large networks since large amounts of data needs to be transferred between processors all at once.	It is slow for small networks and fast for large networks.
Data parallelism is ideally used in array and matrix computations and convolutional neural networks	Model parallelism finds its applications in deep learning

^[6]

Mixed data and task parallelism

Data and task parallelism, can be simultaneously implemented by combining them together for the same application. This is called Mixed data and task parallelism. Mixed parallelism requires sophisticated scheduling algorithms and software support. It is the best kind of parallelism when communication is slow and number of processors is large.^[7]

Mixed data and task parallelism has many applications. It is particularly used in the following applications:

Mixed data and task parallelism finds applications in the global climate modeling. Large data parallel computations are performed by creating grids of data representing Earth's atmosphere and oceans and task parallelism is employed for simulating the function and model of the physical processes.
In timing based circuit simulation. The data is divided among different sub-circuits and parallelism is achieved with orchestration from the tasks.

Data parallel programming environments

A variety of data parallel programming environments are available today, most widely used of which are:

Message Passing Interface: It is a cross-platform message passing programming interface for parallel computers. It defines the semantics of library functions to allow users to write portable message passing programs in C, C++ and Fortran.
Open Multi Processing^[8] (Open MP): It's an Application Programming Interface (API) which supports shared memory programming models on multiple platforms of multiprocessor systems. Since version 4.5, OpenMP is also able to target devices other than typical CPUs. It can program FPGAs, DSPs, GPUs and more. It is not confined to GPUs like OpenACC.
CUDA and OpenACC: CUDA and OpenACC (respectively) are parallel computing API platforms designed to allow a software engineer to utilize GPU's computational units for general purpose processing.
Threading Building Blocks and RaftLib: Both open source programming environments that enable mixed data/task parallelism in C/C++ environments across heterogeneous resources.

Applications

Data parallelism finds its applications in a variety of fields ranging from physics, chemistry, biology, material sciences to signal processing. Sciences imply data parallelism for simulating models like molecular dynamics,^[9] sequence analysis of genome data ^[10] and other physical phenomenon. Driving forces in signal processing for data parallelism are video encoding, image and graphics processing, wireless communications ^[11] to name a few.

Data-intensive computing

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.^[12]

Notes

↑ Some input data (e.g. when d.length evaluates to 1 and round rounds towards zero [this is just an example, there are no requirements on what type of rounding is used]) will lead to lower_limit being greater than upper_limit, it's assumed that the loop will exit immediately (i.e. zero iterations will occur) when this happens.

Related Research Articles

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbours, stores the result within itself and passes it downstream. Systolic arrays were first used in Colossus, which was an early computer used to break German Lorenz ciphers during World War II. Due to the classified nature of Colossus, they were independently invented or rediscovered by H. T. Kung and Charles Leiserson who described arrays for many dense linear algebra computations for banded matrices. Early applications include computing greatest common divisors of integers and polynomials. They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD, as discussed later in this article.

In computer science, array programming refers to solutions that allow the application of operations to an entire set of values at once. Such solutions are commonly used in scientific and engineering settings.

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

Matrix chain multiplication is an optimization problem concerning the most efficient way to multiply a given sequence of matrices. The problem is not actually to perform the multiplications, but merely to decide the sequence of the matrix multiplications involved. The problem may be solved using dynamic programming.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

Ateji PX is an object-oriented programming language extension for Java. It is intended to facilliate parallel computing on multi-core processors, GPU, Grid and Cloud.

SequenceL is a general purpose functional programming language and auto-parallelizing compiler and tool set, whose primary design objectives are performance on multi-core processor hardware, ease of programming, platform portability/optimization, and code clarity and readability. Its main advantage is that it can be used to write straightforward code that automatically takes full advantage of all the processing power available, without programmers needing to be concerned with identifying parallelisms, specifying vectorization, avoiding race conditions, and other challenges of manual directive-based programming approaches such as OpenMP.

Programming with Big Data in R (pbdR) is a series of R packages and an environment for statistical computing with big data by using high-performance statistical computation. The pbdR uses the same programming language as R with S3/S4 classes and methods which is used among statisticians and data miners for developing statistical software. The significant difference between pbdR and R code is that pbdR mainly focuses on distributed memory systems, where data are distributed across several processors and analyzed in a batch mode, while communications between processors are based on MPI that is easily used in large high-performance computing (HPC) systems. R system mainly focuses on single multi-core machines for data analysis via an interactive mode such as GUI interface.

Multidimensional Digital Signal Processing (MDSP) refers to the extension of Digital signal processing (DSP) techniques to signals that vary in more than one dimension. While conventional DSP typically deals with one-dimensional data, such as time-varying audio signals, MDSP involves processing signals in two or more dimensions. Many of the principles from one-dimensional DSP, such as Fourier transforms and filter design, have analogous counterparts in multidimensional signal processing.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

Futhark is a multi-paradigm, high-level, functional, data parallel, array programming language. It is a dialect of the language ML, originally developed at UCPH Department of Computer Science (DIKU) as part of the HIPERFIT project. It focuses on enabling data parallel programs written in a functional style to be executed with high performance on massively parallel hardware, especially graphics processing units (GPUs). Futhark is strongly inspired by NESL, and its implementation uses a variant of the flattening transformation, but imposes constraints on how parallelism can be expressed in order to enable more aggressive compiler optimisations. In particular, irregular nested data parallelism is not supported. It is free and open-source software released under an ISC license.

References

↑ "The Solomon Computer".
↑ "SIMD/Vector/GPU" (PDF). Retrieved 2016-09-07.
↑ Hillis, W. Daniel and Steele, Guy L., Data Parallel Algorithms Communications of the ACMDecember 1986
↑ Barney, Blaise. "Introduction to Parallel Computing". computing.llnl.gov. Archived from the original on 2013-06-10. Retrieved 2016-09-07.
↑ Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4.
↑ "How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism". Tim Dettmers. 2014-11-09. Retrieved 2016-09-13.
↑ "The Netlib" (PDF).
↑ "OpenMP.org". openmp.org. Archived from the original on 2016-09-05. Retrieved 2016-09-07.
↑ Boyer, L. L; Pawley, G. S (1988-10-01). "Molecular dynamics of clusters of particles interacting with pairwise forces using a massively parallel computer". Journal of Computational Physics. 78 (2): 405–423. Bibcode:1988JCoPh..78..405B. doi:10.1016/0021-9991(88)90057-5.
↑ Yap, T.K.; Frieder, O.; Martino, R.L. (1998). "Parallel computation in biological sequence analysis". IEEE Transactions on Parallel and Distributed Systems. 9 (3): 283–294. CiteSeerX 10.1.1.30.2819 . doi:10.1109/71.674320.
↑ Singh, H.; Lee, Ming-Hau; Lu, Guangming; Kurdahi, F.J.; Bagherzadeh, N.; Filho, E.M. Chaves (2000-06-01). "MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications". IEEE Transactions on Computers. 49 (5): 465–481. doi:10.1109/12.859540. ISSN 0018-9340.
↑ Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.

Hillis, W. Daniel and Steele, Guy L., Data Parallel Algorithms Communications of the ACM December 1986
Blelloch, Guy E, Vector Models for Data-Parallel Computing MIT Press 1990. ISBN 0-262-02313-X

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[5] Some input data (e.g. when d.length evaluates to 1 and round rounds towards zero [this is just an example, there are no requirements on what type of rounding is used]) will lead to lower_limit being greater than upper_limit, it's assumed that the loop will exit immediately (i.e. zero iterations will occur) when this happens.

[1] "The Solomon Computer".

[2] "SIMD/Vector/GPU" (PDF). Retrieved 2016-09-07.

[3] Hillis, W. Daniel and Steele, Guy L., Data Parallel Algorithms Communications of the ACMDecember 1986

[4] Barney, Blaise. "Introduction to Parallel Computing". computing.llnl.gov. Archived from the original on 2013-06-10. Retrieved 2016-09-07.

[Solihin-6] Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4.

[7] "How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism". Tim Dettmers. 2014-11-09. Retrieved 2016-09-13.

[8] "The Netlib" (PDF).

[9] "OpenMP.org". openmp.org. Archived from the original on 2016-09-05. Retrieved 2016-09-07.

[10] Boyer, L. L; Pawley, G. S (1988-10-01). "Molecular dynamics of clusters of particles interacting with pairwise forces using a massively parallel computer". Journal of Computational Physics. 78 (2): 405–423. Bibcode:1988JCoPh..78..405B. doi:10.1016/0021-9991(88)90057-5.

[11] Yap, T.K.; Frieder, O.; Martino, R.L. (1998). "Parallel computation in biological sequence analysis". IEEE Transactions on Parallel and Distributed Systems. 9 (3): 283–294. CiteSeerX 10.1.1.30.2819 . doi:10.1109/71.674320.

[12] Singh, H.; Lee, Ming-Hau; Lu, Guangming; Kurdahi, F.J.; Bagherzadeh, N.; Filho, E.M. Chaves (2000-06-01). "MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications". IEEE Transactions on Computers. 49 (5): 465–481. doi:10.1109/12.859540. ISSN 0018-9340.

[13] Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.

[1]

[2]

[3]

[4]

[nb 1]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing