Collective operation

Last updated March 10, 2024

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.

Definitions

In all asymptotic runtime functions, we denote the latency $\alpha$ (or startup time per message, independent of message size), the communication cost per word $\beta$ , the number of processing units $p$ and the input size per node $n$ . In cases where we have initial messages on more than one node we assume that all local messages are of the same size. To address individual processing units we use $p_{i}\in \{p_{0},p_{1},\dots ,p_{p-1}\}$ .

If we do not have an equal distribution, i.e. node $p_{i}$ has a message of size $n_{i}$ , we get an upper bound for the runtime by setting $n=\max(n_{0},n_{1},\dots ,n_{p-1})$ .

A distributed memory model is assumed. The concepts are similar for the shared memory model. However, shared memory systems can provide hardware support for some operations like broadcast (§ Broadcast) for example, which allows convenient concurrent read.^[2] Thus, new algorithmic possibilities can become available.

Broadcast

The broadcast pattern^[3] is used to distribute data from one processing unit to all processing units, which is often needed in SPMD parallel programs to dispense input or global values. Broadcast can be interpreted as an inverse version of the reduce pattern (§ Reduce). Initially only root $r$ with $id$ $0$ stores message $m$ . During broadcast $m$ is sent to the remaining processing units, so that eventually $m$ is available to all processing units.

Since an implementation by means of a sequential for-loop with $p-1$ iterations becomes a bottleneck, divide-and-conquer approaches are common. One possibility is to utilize a binomial tree structure with the requirement that $p$ has to be a power of two. When a processing unit is responsible for sending $m$ to processing units $i..j$ , it sends $m$ to processing unit $\left\lceil (i+j)/2\right\rceil$ and delegates responsibility for the processing units $\left\lceil (i+j)/2\right\rceil ..j$ to it, while its own responsibility is cut down to $i..\left\lceil (i+j)/2\right\rceil -1$ .

Binomial trees have a problem with long messages $m$ . The receiving unit of $m$ can only propagate the message to other units, after it received the whole message. In the meantime, the communication network is not utilized. Therefore pipelining on binary trees is used, where $m$ is split into an array of $k$ packets of size $\left\lceil n/k\right\rceil$ . The packets are then broadcast one after another, so that data is distributed fast in the communication network.

Pipelined broadcast on balanced binary tree is possible in ${\mathcal {O}}(\alpha \log p+\beta n)$ , whereas for the non-pipelined case it takes ${\mathcal {O}}((\alpha +\beta n)\log p)$ cost.

Reduce

The reduce pattern^[4] is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator. Given $p$ processing units, message $m_{i}$ is on processing unit $p_{i}$ initially. All $m_{i}$ are aggregated by $\otimes$ and the result is eventually stored on $p_{0}$ . The reduction operator $\otimes$ must be associative at least. Some algorithms require a commutative operator with a neutral element. Operators like $sum$ , $min$ , $max$ are common.

Implementation considerations are similar to broadcast (§ Broadcast). For pipelining on binary trees the message must be representable as a vector of smaller object for component-wise reduction.

Pipelined reduce on a balanced binary tree is possible in ${\mathcal {O}}(\alpha \log p+\beta n)$ .

All-Reduce

The all-reduce pattern^[5] (also called allreduce) is used if the result of a reduce operation (§ Reduce) must be distributed to all processing units. Given $p$ processing units, message $m_{i}$ is on processing unit $p_{i}$ initially. All $m_{i}$ are aggregated by an operator $\otimes$ and the result is eventually stored on all $p_{i}$ . Analog to the reduce operation, the operator $\otimes$ must be at least associative.

All-reduce can be interpreted as a reduce operation with a subsequent broadcast (§ Broadcast). For long messages a corresponding implementation is suitable, whereas for short messages, the latency can be reduced by using a hypercube (Hypercube (communication pattern) § All-Gather/ All-Reduce) topology, if $p$ is a power of two. All-reduce can also be implemented with a butterfly algorithm and achieve optimal latency and bandwidth.^[6]

All-reduce is possible in ${\mathcal {O}}(\alpha \log p+\beta n)$ , since reduce and broadcast are possible in ${\mathcal {O}}(\alpha \log p+\beta n)$ with pipelining on balanced binary trees. All-reduce implemented with a butterfly algorithm achieves the same asymptotic runtime.

Prefix-Sum/Scan

The prefix-sum or scan operation^[7] is used to collect data or partial results from different processing units and to compute intermediate results by an operator, which are stored on those processing units. It can be seen as a generalization of the reduce operation (§ Reduce). Given $p$ processing units, message $m_{i}$ is on processing unit $p_{i}$ . The operator $\otimes$ must be at least associative, whereas some algorithms require also a commutative operator and a neutral element. Common operators are $sum$ , $min$ and $max$ . Eventually processing unit $p_{i}$ stores the prefix sum $\otimes _{i'<=i}$ $m_{i'}$ . In the case of the so-called exclusive prefix sum, processing unit $p_{i}$ stores the prefix sum $\otimes _{i'<i}$ $m_{i'}$ . Some algorithms require to store the overall sum at each processing unit in addition to the prefix sums.

For short messages, this can be achieved with a hypercube topology if $p$ is a power of two. For long messages, the hypercube (Hypercube (communication pattern) § Prefix sum, Prefix sum § Distributed memory: Hypercube algorithm) topology is not suitable, since all processing units are active in every step and therefore pipelining can't be used. A binary tree topology is better suited for arbitrary $p$ and long messages (Prefix sum § Large Message Sizes: Pipelined Binary Tree).

Prefix-sum on a binary tree can be implemented with an upward and downward phase. In the upward phase reduction is performed, while the downward phase is similar to broadcast, where the prefix sums are computed by sending different data to the left and right children. With this approach pipelining is possible, because the operations are equal to reduction (§ Reduce) and broadcast (§ Broadcast).

Pipelined prefix sum on a binary tree is possible in ${\mathcal {O}}(\alpha \log p+\beta n)$ .

Barrier

The barrier^[8] as a collective operation is a generalization of the concept of a barrier, that can be used in distributed computing. When a processing unit calls barrier, it waits until all other processing units have called barrier as well. Barrier is thus used to achieve global synchronization in distributed computing.

One way to implement barrier is to call all-reduce (§ All-Reduce) with an empty/ dummy operand. We know the runtime of All-reduce is ${\mathcal {O}}(\alpha \log p+\beta n)$ . Using a dummy operand reduces size $n$ to a constant factor and leads to a runtime of ${\mathcal {O}}(\alpha \log p)$ .

Gather

The gather communication pattern^[9] is used to store data from all processing units on a single processing unit. Given $p$ processing units, message $m_{i}$ on processing unit $p_{i}$ . For a fixed processing unit $p_{j}$ , we want to store the message $m_{1}\cdot m_{2}\cdot \ldots \cdot m_{p}$ on $p_{j}$ . Gather can be thought of as a reduce operation (§ Reduce) that uses the concatenation operator. This works due to the fact that concatenation is associative. By using the same binomial tree reduction algorithm we get a runtime of ${\mathcal {O}}(\alpha \log p+\beta pn)$ . We see that the asymptotic runtime is similar to the asymptotic runtime of reduce ${\mathcal {O}}(\alpha \log p+\beta n)$ , but with the addition of a factor p to the term $\beta n$ . This additional factor is due to the message size increasing in each step as messages get concatenated. Compare this to reduce where message size is a constant for operators like $min$ .

All-Gather

The all-gather communication pattern^[9] is used to collect data from all processing units and to store the collected data on all processing units. Given $p$ processing units $p_{i}$ , message $m_{i}$ initially stored on $p_{i}$ , we want to store the message $m_{1}\cdot m_{2}\cdot \ldots \cdot m_{p}$ on each $p_{j}$ .

It can be thought of in multiple ways. The first is as an all-reduce operation (§ All-Reduce) with concatenation as the operator, in the same way that gather can be represented by reduce. The second is as a gather-operation followed by a broadcast of the new message of size $pn$ . With this we see that all-gather in ${\mathcal {O}}(\alpha \log p+\beta pn)$ is possible.

Scatter

The scatter communication pattern^[10] is used to distribute data from one processing unit to all the processing units. It differs from broadcast, in that it does not send the same message to all processing units. Instead it splits the message and delivers one part of it to each processing unit.

Given $p$ processing units $p_{i}$ , a fixed processing unit $p_{j}$ that holds the message $m=m_{1}\cdot m_{2}\cdot \ldots \cdot m_{p}$ . We want to transport the message $m_{i}$ onto $p_{i}$ . The same implementation concerns as for gather (§ Gather) apply. This leads to an optimal runtime in ${\mathcal {O}}(\alpha \log p+\beta pn)$ .

All-to-all

All-to-all^[11] is the most general communication pattern. For $0\leq i,j<p$ , message $m_{i,j}$ is the message that is initially stored on node $i$ and has to be delivered to node $j$ . We can express all communication primitives that do not use operators through all-to-all. For example, broadcast of message $m$ from node $p_{k}$ is emulated by setting $m_{i,j}=m$ for $i=k$ and setting $m_{l,j}$ empty for $l\neq k$ .

Assuming we have a fully connected network, the best possible runtime for all-to-all is in ${\mathcal {O}}(p(\alpha +\beta n))$ . This is achieved through $p$ rounds of direct message exchange. For $p$ power of 2, in communication round $k$ , node $p_{i}$ exchanges messages with node $p_{j},j=i\oplus k$ .

If the message size is small and latency dominates the communication, a hypercube algorithm can be used to distribute the messages in time ${\mathcal {O}}(\log p(\alpha +\beta pn))$ .

Information flow of All-to-All operation performed on three nodes. Letters indicate nodes and numbers indicate information items. All-to-All.png — Information flow of All-to-All operation performed on three nodes. Letters indicate nodes and numbers indicate information items.

Runtime Overview

This table^[12] gives an overview over the best known asymptotic runtimes, assuming we have free choice of network topology.

Example topologies we want for optimal runtime are binary tree, binomial tree, hypercube.

In practice, we have to adjust to the available physical topologies, e.g. dragonfly, fat tree, grid network (references other topologies, too).

More information under Network topology.

For each operation, the optimal algorithm can depend on the input sizes $n$ . For example, broadcast for short messages is best implemented using a binomial tree whereas for long messages a pipelined communication on a balanced binary tree is optimal.

The complexities stated in the table depend on the latency $\alpha$ and the communication cost per word $\beta$ in addition to the number of processing units $p$ and the input message size per node $n$ . The # senders and # receivers columns represent the number of senders and receivers that are involved in the operation respectively. The # messages column lists the number of input messages and the Computations? column indicates if any computations are done on the messages or if the messages are just delivered without processing. Complexity gives the asymptotic runtime complexity of an optimal implementation under free choice of topology.


Name	# senders	# receivers	# messages	Computations?	Complexity
Broadcast	$1$	$p$	$1$	no	${\mathcal {O}}(\alpha \log p+\beta n)$
Reduce	$p$	$1$	$p$	yes	${\mathcal {O}}(\alpha \log p+\beta n)$
All-reduce	$p$	$p$	$p$	yes	${\mathcal {O}}(\alpha \log p+\beta n)$
Prefix sum	$p$	$p$	$p$	yes	${\mathcal {O}}(\alpha \log p+\beta n)$
Barrier	$p$	$p$	$0$	no	${\mathcal {O}}(\alpha \log p)$
Gather	$p$	$1$	$p$	no	${\mathcal {O}}(\alpha \log p+\beta pn)$
All-Gather	$p$	$p$	$p$	no	${\mathcal {O}}(\alpha \log p+\beta pn)$
Scatter	$1$	$p$	$p$	no	${\mathcal {O}}(\alpha \log p+\beta pn)$
All-To-All	$p$	$p$	$p^{2}$	no	${\mathcal {O}}(\log p(\alpha +\beta pn))$ or ${\mathcal {O}}(p(\alpha +\beta n))$

Notes

↑ Intercommunicator Collective Operations. The Message Passing Interface (MPI) standard, chapter 7.3.1. Mathematics and Computer Science Division, Argonne National Laboratory.
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 395
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 396-401
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 402-403
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 403-404
↑ Yuan, Xin (February 2009). "Bandwidth optimal all-reduce algorithms for clusters of workstations" (PDF). Journal of Parallel and Distributed Computing. 69 (2).
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 404-406
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 408
1 2 Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 412-413
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 413
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 413-418
↑ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 394

Related Research Articles

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices $A$ and $B$ is denoted as $AB$ .

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $k$ and a scale parameter $θ$
With a shape parameter $and an inverse scale parameter, called a rate parameter.$

In arithmetic, long division is a standard division algorithm suitable for dividing multi-digit Hindu-Arabic numerals that is simple enough to perform by hand. It breaks down a division problem into a series of easier steps.

In group theory, a branch of mathematics, the baby-step giant-step is a meet-in-the-middle algorithm for computing the discrete logarithm or order of an element in a finite abelian group by Daniel Shanks. The discrete log problem is of fundamental importance to the area of public key cryptography.

A tree k-spanner of a graph $is a spanning subtree of in which the distance between every pair of vertices is at most times their distance in .$

The normal-inverse Gaussian distribution is a continuous probability distribution that is defined as the normal variance-mean mixture where the mixing density is the inverse Gaussian distribution. The NIG distribution was noted by Blaesild in 1977 as a subclass of the generalised hyperbolic distribution discovered by Ole Barndorff-Nielsen. In the next year Barndorff-Nielsen published the NIG in another paper. It was introduced in the mathematical finance literature in 1997.

The fast multipole method (FMM) is a numerical technique that was developed to speed up the calculation of long-ranged forces in the n-body problem. It does this by expanding the system Green's function using a multipole expansion, which allows one to group sources that lie close together and treat them as if they are a single source.

Cell lists is a data structure in molecular dynamics simulations to find all atom pairs within a given cut-off distance of each other. These pairs are needed to compute the short-range non-bonded interactions in a system, such as Van der Waals forces or the short-range part of the electrostatic interaction when using Ewald summation.

In mathematics, the discrete Fourier transform over a ring generalizes the discrete Fourier transform (DFT), of a function whose values are commonly complex numbers, over an arbitrary ring.

In mathematics, Montgomery's pair correlation conjecture is a conjecture made by Hugh Montgomery that the pair correlation between pairs of zeros of the Riemann zeta function is

In algebraic number theory Eisenstein's reciprocity law is a reciprocity law that extends the law of quadratic reciprocity and the cubic reciprocity law to residues of higher powers. It is one of the earliest and simplest of the higher reciprocity laws, and is a consequence of several later and stronger reciprocity laws such as the Artin reciprocity law. It was introduced by Eisenstein, though Jacobi had previously announced a similar result for the special cases of 5th, 8th and 12th powers in 1839.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

The cyclotomic fast Fourier transform is a type of fast Fourier transform algorithm over finite fields. This algorithm first decomposes a DFT into several circular convolutions, and then derives the DFT results from the circular convolution results. When applied to a DFT over $, this algorithm has a very low multiplicative complexity. In practice, since there usually exist efficient algorithms for circular convolutions with specific lengths, this algorithm is very efficient.$

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often commutative. The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

$-dimensional hypercube is a network topology for parallel computers with processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce, and Prefix sum. The processing elements are numbered through . Each processing element is adjacent to processing elements whose numbers differ in one and only one bit. The algorithms described in this page utilize this structure efficiently.$

In graph theory a minimum spanning tree (MST) $of a graph with and is a tree subgraph of that contains all of its vertices and is of minimum weight.$

References

Sanders, Peter; Mehlhorn, Kurt; Dietzfelbinger, Martin; Dementiev, Roman (2019). Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Springer Nature Switzerland AG. ISBN 978-3-030-25208-3.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Intercommunicator Collective Operations. The Message Passing Interface (MPI) standard, chapter 7.3.1. Mathematics and Computer Science Division, Argonne National Laboratory.

[:1-2] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 395

[:2-3] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 396-401

[:3-4] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 402-403

[:4-5] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 403-404

[6] Yuan, Xin (February 2009). "Bandwidth optimal all-reduce algorithms for clusters of workstations" (PDF). Journal of Parallel and Distributed Computing. 69 (2).

[:5-7] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 404-406

[:6-8] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 408

[:7-9] 1 2 Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 412-413

[:8-10] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 413

[:9-11] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 413-418

[:10-12] Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 394

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]