Bulk synchronous parallel

Last updated

The bulk synchronous parallel (BSP) abstract computer is a bridging model for designing parallel algorithms. It is similar to the parallel random access machine (PRAM) model, but unlike PRAM, BSP does not take communication and synchronization for granted. In fact, quantifying the requisite synchronization and communication is an important part of analyzing a BSP algorithm.

Contents

History

The BSP model was developed by Leslie Valiant of Harvard University during the 1980s. The definitive article was published in 1990. [1]

Between 1990 and 1992, Leslie Valiant and Bill McColl of Oxford University worked on ideas for a distributed memory BSP programming model, in Princeton and at Harvard. Between 1992 and 1997, McColl led a large research team at Oxford that developed various BSP programming libraries, languages and tools, and also numerous massively parallel BSP algorithms, including many early examples of high-performance communication-avoiding parallel algorithms [2] and recursive "immortal" parallel algorithms that achieve the best possible performance and optimal parametric tradeoffs. [3]

With interest and momentum growing, McColl then led a group from Oxford, Harvard, Florida, Princeton, Bell Labs, Columbia and Utrecht that developed and published the BSPlib Standard for BSP programming in 1996. [4]

Valiant developed an extension to the BSP model in the 2000s, leading to the publication of the Multi-BSP model in 2011. [5]

In 2017, McColl developed a major new extension of the BSP model that provides fault tolerance and tail tolerance for large-scale parallel computations in AI, Analytics and high-performance computing (HPC). [6] See also [7]

The BSP model

Overview

A BSP computer consists of the following:

This is commonly interpreted as a set of processors that may follow different threads of computation, with each processor equipped with fast local memory and interconnected by a communication network.

BSP algorithms rely heavily on the third feature; a computation proceeds in a series of global supersteps, which consists of three components:

The computation and communication actions do not have to be ordered in time. Communication typically takes the form of the one-sided PUT and GET remote direct memory access (RDMA) calls rather than paired two-sided send and receive message-passing calls.

A BSP superstep. Processes lack linear order and may be mapped to processors in any way Bsp.wiki.fig1.svg
A BSP superstep. Processes lack linear order and may be mapped to processors in any way

The barrier synchronization concludes the superstep—it ensures that all one-sided communications are properly concluded. Systems based on two-sided communication include this synchronization cost implicitly for every message sent. The barrier synchronization method relies on the BSP computer's hardware facility. In Valiant's original paper, this facility periodically checks if the end of the current superstep is reached globally. The period of this check is denoted by . [1]

The BSP model is also well-suited for automatic memory management for distributed-memory computing through over-decomposition of the problem and oversubscription of the processors. The computation is divided into more logical processes than there are physical processors, and processes are randomly assigned to processors. This strategy can be shown statistically to lead to almost perfect load balancing, both of work and communication.

Communication

In many parallel programming systems, communications are considered at the level of individual actions, such as sending and receiving a message or memory-to-memory transfer. This is difficult to work with since there are many simultaneous communication actions in a parallel program, and their interactions are typically complex. In particular, it is difficult to say much about the time any single communication action will take to complete.

The BSP model considers communication actions en masse. This has the effect that an upper bound on the time taken to communicate a set of data can be given. BSP considers all communication actions of a superstep as one unit and assumes all individual messages sent as part of this unit have a fixed size.

The maximum number of incoming or outgoing messages for a superstep is denoted by . The ability of a communication network to deliver data is captured by a parameter , defined such that it takes time for a processor to deliver messages of size 1.

A message of length obviously takes longer to send than a message of size 1. However, the BSP model does not make a distinction between a message length of or messages of length 1. In either case, the cost is said to be .

The parameter depends on the following:

In practice, is determined empirically for each parallel computer. Note that is not the normalized single-word delivery time but the single-word delivery time under continuous traffic conditions.

Barriers

The one-sided communication of the BSP model requires barrier synchronization. Barriers are potentially costly but avoid the possibility of deadlock or livelock, since barriers cannot create circular data dependencies. Tools to detect them and deal with them are unnecessary. Barriers also permit novel forms of fault tolerance [ citation needed ].

The cost of barrier synchronization is influenced by a couple of issues:

The cost of a barrier synchronization is denoted by . Note that if the synchronization mechanism of the BSP computer is as suggested by Valiant. [1] In practice, a value of is determined empirically.

On large computers, barriers are expensive, and this is increasingly so on large scales. There is a large body of literature on removing synchronization points from existing algorithms in the context of BSP computing and beyond. For example, many algorithms allow for the local detection of the global end of a superstep simply by comparing local information to the number of messages already received. This drives the cost of global synchronization, compared to the minimally required latency of communication, to zero. [8] Yet also this minimal latency is expected to increase further for future supercomputer architectures and network interconnects; the BSP model, along with other models for parallel computation, require adaptation to cope with this trend. Multi-BSP is one BSP-based solution. [5]

Algorithmic cost

The cost of a superstep is determined as the sum of three terms:

Thus, the cost of one superstep for processors:

where is the cost for the local computation in process , and is the number of messages sent or received by process . Note that homogeneous processors are assumed here. It is more common for the expression to be written as where and are maxima. The cost of an entire BSP algorithm is the sum of the cost of each superstep.

where is the number of supersteps.

, , and are usually modeled as functions that vary with problem size. These three characteristics of a BSP algorithm are usually described in terms of asymptotic notation, e.g., .

Extensions and uses

Interest in BSP has soared, with Google adopting it as a major technology for graph analytics at massive scale via Pregel and MapReduce. Also, with the next generation of Hadoop decoupling the MapReduce model from the rest of the Hadoop infrastructure, there are now active open-source projects to add explicit BSP programming, as well as other high-performance parallel programming models, on top of Hadoop. Examples are Apache Hama and Apache Giraph. [9]

BSP has been extended by many authors to address concerns about BSP's unsuitability for modelling specific architectures or computational paradigms. One example of this is the decomposable BSP model. The model has also been used in the creation of a number of new programming languages and interfaces, such as Bulk Synchronous Parallel ML (BSML), BSPLib, Apache Hama, [9] and Pregel. [10]

Notable implementations of the BSPLib standard are the Paderborn University BSP library [11] and the Oxford BSP Toolset by Jonathan Hill. [12] Modern implementations include BSPonMPI [13] (which simulates BSP on top of the Message Passing Interface), and MulticoreBSP [14] [15] (a novel implementation targeting modern shared-memory architectures). MulticoreBSP for C is especially notable for its capability of starting nested BSP runs, thus allowing for explicit Multi-BSP programming.

See also

Related Research Articles

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer science that studies distributed systems.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can do multiple operations in a given time. It has been a tradition of computer science to describe serial algorithms in abstract machine models, often the one known as random-access machine. Similarly, many computer science researchers have used a so-called parallel random-access machine (PRAM) as a parallel abstract machine (shared-memory).

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

<span class="mw-page-title-main">Theoretical computer science</span> Subfield of computer science and mathematics

Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory.

<span class="mw-page-title-main">Graph (abstract data type)</span> Abstract data type in computer science

In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.

The actor model in computer science is a mathematical model of concurrent computation that treats actor as the universal primitive of concurrent computation. In response to a message it receives, an actor can: make local decisions, create more actors, send more messages, and determine how to respond to the next message received. Actors may modify their own private state, but can only affect each other indirectly through messaging.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

In parallel computing, a barrier is a type of synchronization method. A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

<span class="mw-page-title-main">Iterative Stencil Loops</span>

Iterative Stencil Loops (ISLs) are a class of numerical data processing solution which update array elements according to some fixed pattern, called a stencil. They are most commonly found in computer simulations, e.g. for computational fluid dynamics in the context of scientific and engineering applications. Other notable examples include solving partial differential equations, the Jacobi kernel, the Gauss–Seidel method, image processing and cellular automata. The regular structure of the arrays sets stencil techniques apart from other modeling methods such as the Finite element method. Most finite difference codes which operate on regular grids can be formulated as ISLs.

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.

Torsten Suel is a professor in the Department of Computer Science and Engineering at the New York University Tandon School of Engineering. He received his Ph.D. in 1994 from the University of Texas at Austin under the supervision of Greg Plaxton. He works on the subjects of implementation of bulk synchronous parallel computation, streaming algorithms for histograms, join operations in databases, distributed algorithms for dominating sets, and web crawler algorithms. A conference paper he co-authored in 2011 introduces fast retrieval techniques that were integrated into the Apache Lucene search engine library.

<span class="mw-page-title-main">Apache Hama</span>

Apache Hama is a distributed computing framework based on bulk synchronous parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. It was a Top Level Project under the Apache Software Foundation. Retired in April 2020, project resources are made available as part of the Apache Attic. It was created by Edward J. Yoon, who named it and was inspired by Google's Pregel large-scale graph computing framework described in 2010. Hama also means hippopotamus in Korean language (하마), following the trend of naming Apache projects after animals and zoology.

In computer science, the analysis of parallel algorithms is the process of finding the computational complexity of algorithms executed in parallel – the amount of time, storage, or other resources needed to execute them. In many respects, analysis of parallel algorithms is similar to the analysis of sequential algorithms, but is generally more involved because one must reason about the behavior of multiple cooperating threads of execution. One of the primary goals of parallel analysis is to understand how a parallel algorithm's use of resources changes as the number of processors is changed.

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often commutative. The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

Parallelmultidimensionaldigitalsignal processing (mD-DSP) is defined as the application of parallel programming and multiprocessing to digital signal processing techniques to process digital signals that have more than a single dimension. The use of mD-DSP is fundamental to many application areas such as digital image and video processing, medical imaging, geophysical signal analysis, sonar, radar, lidar, array processing, computer vision, computational photography, and augmented and virtual reality. However, as the number of dimensions of a signal increases the computational complexity to operate on the signal increases rapidly. This relationship between the number of dimensions and the amount of complexity, related to both time and space, as studied in the field of algorithm analysis, is analogues to the concept of the curse of dimensionality. This large complexity generally results in an extremely long execution run-time of a given mD-DSP application rendering its usage to become impractical for many applications; especially for real-time applications. This long run-time is the primary motivation of applying parallel algorithmic techniques to mD-DSP problems.

References

  1. 1 2 3 Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990
  2. W F McColl. Scalable Computing. Computer Science Today: Recent Trends and Developments. J van Leeuwen (editor). LNCS Volume 1000, Springer-Verlag pp.46-61 (1995)
  3. W F McColl and A Tiskin. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3) pp.287-297 (1999)
  4. J M D Hill, W F McColl, D C Stefanescu, M W Goudreau, K Lang, S B Rao, T Suel, T Tsantilas and R H Bisseling. BSPlib: The BSP Programming Library. Parallel Computing 24 (14) pp. 1947-1980 (1998)
  5. 1 2 Valiant, L. G. (2011). A bridging model for multi-core computing. Journal of Computer and System Sciences, 77(1), 154-166
  6. A Bridging Model for High Performance Cloud Computing by Bill McColl in 18th SIAM Conference on Parallel Processing for Scientific Computing (2018), http://meetings.siam.org/sess/dsp_talk.cfm?p=88973 Archived 2019-12-11 at the Wayback Machine .
  7. Bill McColl. Mathematics, Models and Architectures. Chapter 1, pp. 6-53. Mathematics for Future Computing and Communications, edited by Liao Heng and Bill McColl. Cambridge University Press (2022).
  8. Alpert, R., & Philbin, J. (1997). cBSP: Zero-cost synchronization in a modified BSP model. NEC Research Institute, 4 Independence Way, Princeton NJ, 8540, .
  9. 1 2 Apache Hama
  10. Pregel
  11. The Paderborn University BSP (PUB) Library - Design, Implementation and Performance Heinz Nixdorf Institute, Department of Computer Science, University of Paderborn, Germany, technical report Archived 2001-06-05 at the Wayback Machine .
  12. Jonathan Hill: The Oxford BSP Toolset, 1998.
  13. Wijnand J. Suijlen: BSPonMPI, 2006.
  14. MulticoreBSP for C: a high-performance library for shared-memory parallel programming by A. N. Yzelman, R. H. Bisseling, D. Roose, and K. Meerbergen in International Journal of Parallel Programming, in press (2013), doi:10.1109/TPDS.2013.31.
  15. An Object-Oriented Bulk Synchronous Parallel Library for Multicore Programming by A. N. Yzelman & Rob H. Bisseling in Concurrency and Computation: Practice and Experience 24(5), pp. 533-553 (2012), doi:10.1002/cpe.1843.