Reduction operator

Last updated November 10, 2024

In computer science, the reduction operator^[1] is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often (but not necessarily) commutative.^[2]^[3]^[4] The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

Theory

A reduction operator can help break down a task into various partial tasks by calculating partial results which can be used to obtain a final result. It allows certain serial operations to be performed in parallel and the number of steps required for those operations to be reduced. A reduction operator stores the result of the partial tasks into a private copy of the variable. These private copies are then merged into a shared copy at the end.

An operator is a reduction operator if:

It can reduce an array to a single scalar value.^[2]
The final result should be obtainable from the results of the partial tasks that were created.^[2]

These two requirements are satisfied for commutative and associative operators that are applied to all array elements.

Some operators which satisfy these requirements are addition, multiplication, and some logical operators (and, or, etc.).

A reduction operator $\oplus$ can be applied in constant time on an input set $V=\left\{v_{0}={\begin{pmatrix}e_{0}^{0}\\\vdots \\e_{0}^{m-1}\end{pmatrix}},v_{1}={\begin{pmatrix}e_{1}^{0}\\\vdots \\e_{1}^{m-1}\end{pmatrix}},\dots ,v_{p-1}={\begin{pmatrix}e_{p-1}^{0}\\\vdots \\e_{p-1}^{m-1}\end{pmatrix}}\right\}$ of $p$ vectors with $m$ elements each. The result $r$ of the operation is the combination of the elements $r={\begin{pmatrix}e_{0}^{0}\oplus e_{1}^{0}\oplus \dots \oplus e_{p-1}^{0}\\\vdots \\e_{0}^{m-1}\oplus e_{1}^{m-1}\oplus \dots \oplus e_{p-1}^{m-1}\end{pmatrix}}={\begin{pmatrix}\bigoplus _{i=0}^{p-1}e_{i}^{0}\\\vdots \\\bigoplus _{i=0}^{p-1}e_{i}^{m-1}\end{pmatrix}}$ and has to be stored at a specified root processor at the end of the execution. If the result $r$ has to be available at every processor after the computation has finished, it is often called Allreduce. An optimal sequential linear-time algorithm for reduction can apply the operator successively from front to back, always replacing two vectors with the result of the operation applied to all its elements, thus creating an instance that has one vector less. It needs $(p-1)\cdot m$ steps until only $r$ is left. Sequential algorithms can not perform better than linear time, but parallel algorithms leave some space left to optimize.

Example

Suppose we have an array $[2,3,5,1,7,6,8,4]$ . The sum of this array can be computed serially by sequentially reducing the array into a single sum using the '+' operator. Starting the summation from the beginning of the array yields: ${\Bigg (}{\bigg (}{\Big (}{\big (}\,(\,(2+3)+5)+1{\big )}+7{\Big )}+6{\bigg )}+8{\Bigg )}+4=36.$ Since '+' is both commutative and associative, it is a reduction operator. Therefore this reduction can be performed in parallel using several cores, where each core computes the sum of a subset of the array, and the reduction operator merges the results. Using a binary tree reduction would allow 4 cores to compute ${\textstyle (2+3)}$ , ${\textstyle (5+1)}$ , ${\textstyle (7+6)}$ , and ${\textstyle (8+4)}$ . Then two cores can compute $(5+6)$ and $(13+12)$ , and lastly a single core computes $(11+25)=36$ . So a total of 4 cores can be used to compute the sum in ${\textstyle \log _{2}8=3}$ steps instead of the $7$ steps required for the serial version. This parallel binary tree technique computes ${\textstyle {\big (}\,(2+3)+(5+1)\,{\big )}+{\big (}\,(7+6)+(8+4)\,{\big )}}$ . Of course the result is the same, but only because of the associativity of the reduction operator. The commutativity of the reduction operator would be important if there were a master core distributing work to several processors, since then the results could arrive back to the master processor in any order. The property of commutativity guarantees that the result will be the same.

IEEE 754-2019 defines 4 kinds of sum reductions and 3 kinds of scaled-product reductions. Because the operations are reduction operators, the standards specifies that "implementations may associate in any order or evaluate in any wider format."^[5]

Nonexample

Matrix multiplication is not a reduction operator since the operation is not commutative. If processes were allowed to return their matrix multiplication results in any order to the master process, the final result that the master computes will likely be incorrect if the results arrived out of order. However, note that matrix multiplication is associative, and therefore the result would be correct as long as the proper ordering were enforced, as in the binary tree reduction technique.

Algorithms

Binomial tree algorithms

Regarding parallel algorithms, there are two main models of parallel computation, the parallel random access machine (PRAM) as an extension of the RAM with shared memory between processing units and the bulk synchronous parallel computer which takes communication and synchronization into account. Both models have different implications for the time-complexity, therefore two algorithms will be shown.

PRAM-algorithm

This algorithm represents a widely spread method to handle inputs where $p$ is a power of two. The reverse procedure is often used for broadcasting elements.^[6]^[7]^[8]

for

k\gets 0

to

\lceil \log _{2}p\rceil -1

do

for

i\gets 0

to

p-1

do in parallel

if

p_{i}

is active then

if bit

k

of

i

is set then

set

p_{i}

to inactive

else if

i+2^{k}<p

x_{i}\gets x_{i}\oplus ^{\star }x_{i+2^{k}}

The binary operator for vectors is defined element-wise such that ${\begin{pmatrix}e_{i}^{0}\\\vdots \\e_{i}^{m-1}\end{pmatrix}}\oplus ^{\star }{\begin{pmatrix}e_{j}^{0}\\\vdots \\e_{j}^{m-1}\end{pmatrix}}={\begin{pmatrix}e_{i}^{0}\oplus e_{j}^{0}\\\vdots \\e_{i}^{m-1}\oplus e_{j}^{m-1}\end{pmatrix}}.$

The algorithm further assumes that in the beginning $x_{i}=v_{i}$ for all $i$ and $p$ is a power of two and uses the processing units $p_{0},p_{1},\dots p_{n-1}$ . In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor $p_{i}$ evaluates the given operator on the element $x_{i}$ it is currently holding and $x_{j}$ where $j$ is the minimal index fulfilling $j>i$ , so that $p_{j}$ is becoming an inactive processor in the current step. $x_{i}$ and $x_{j}$ are not necessarily elements of the input set $X$ as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from $0$ to $p-1$ is used. Each processor looks at its $k$ -th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the $k$ -th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm.

Only $p_{0}$ holds the result in the end, therefore it is the root processor. For an Allreduce operation the result has to be distributed, which can be done by appending a broadcast from $p_{0}$ . Furthermore, the number $p$ of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.^[9]

Runtime analysis

The main loop is executed $\lceil \log _{2}p\rceil$ times, the time needed for the part done in parallel is in ${\mathcal {O}}(m)$ as a processing unit either combines two vectors or becomes inactive. Thus the parallel time $T(p,m)$ for the PRAM is $T(p,m)={\mathcal {O}}(\log(p)\cdot m)$ . The strategy for handling read and write conflicts can be chosen as restrictive as an exclusive read and exclusive write (EREW). The speedup $S(p,m)$ of the algorithm is ${\textstyle S(p,m)\in {\mathcal {O}}\left({\frac {T_{\text{seq}}}{T(p,m)}}\right)={\mathcal {O}}\left({\frac {p}{\log(p)}}\right)}$ and therefore the efficiency is ${\textstyle E(p,m)\in {\mathcal {O}}\left({\frac {S(p,m)}{p}}\right)={\mathcal {O}}\left({\frac {1}{\log(p)}}\right)}$ . The efficiency suffers because half of the active processing units become inactive after each step, so ${\frac {p}{2^{i}}}$ units are active in step $i$ .

Distributed memory algorithm

In contrast to the PRAM-algorithm, in the distributed memory model, memory is not shared between processing units and data has to be exchanged explicitly between processing units. Therefore, data has to be exchanged explicitly between units, as can be seen in the following algorithm.

for

k\gets 0

to

\lceil \log _{2}p\rceil -1

do

for

i\gets 0

to

p-1

do in parallel

if

p_{i}

is active then

if bit

k

of

i

is set then

send

x_{i}

to

p_{i-2^{k}}

set

p_{k}

to inactive

else if

i+2^{k}<p

receive

x_{i+2^{k}}

x_{i}\gets x_{i}\oplus ^{\star }x_{i+2^{k}}

The only difference between the distributed algorithm and the PRAM version is the inclusion of explicit communication primitives, the operating principle stays the same.

Runtime analysis

The communication between units leads to some overhead. A simple analysis for the algorithm uses the BSP-model and incorporates the time $T_{\text{start}}$ needed to initiate communication and $T_{\text{byte}}$ the time needed to send a byte. Then the resulting runtime is $\Theta ((T_{\text{start}}+n\cdot T_{\text{byte}})\cdot log(p))$ , as $m$ elements of a vector are sent in each iteration and have size $n$ in total.

Pipeline-algorithm

For distributed memory models, it can make sense to use pipelined communication. This is especially the case when $T_{\text{start}}$ is small in comparison to $T_{\text{byte}}$ . Usually, linear pipelines split data or a tasks into smaller pieces and process them in stages. In contrast to the binomial tree algorithms, the pipelined algorithm uses the fact that the vectors are not inseparable, but the operator can be evaluated for single elements:^[10]

for

k\gets 0

to

p+m-3

do

for

i\gets 0

to

p-1

do in parallel

if

i\leq k<i+m\land i\neq p-1

send

x_{i}^{k-i}

to

p_{i+1}

if

i-1\leq k<i-1+m\land i\neq 0

receive

x_{i-1}^{k+i-1}

from

p_{i-1}

x_{i}^{k+i-1}\gets x_{i}^{k+i-1}\oplus x_{i-1}^{k+i-1}

It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at $p_{p-1}$ at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step.

Runtime analysis

The number of steps in the parallel execution are $p+m-2$ , it takes $p-1$ steps until the last processing unit receives its first element and additional $m-1$ until all elements are received. Therefore, the runtime in the BSP-model is ${\textstyle T(n,p,m)=\left(T_{\text{start}}+{\frac {n}{m}}\cdot T_{\text{byte}}\right)(p+m-2)}$ , assuming that $n$ is the total byte-size of a vector.

Although $m$ has a fixed value, it is possible to logically group elements of a vector together and reduce $m$ . For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is sent each step, but the number of steps has roughly halved. It means that the parameter $m$ is halved, while the total byte-size $n$ stays the same. The runtime $T(p)$ for this approach depends on the value of $m$ , which can be optimized if $T_{\text{start}}$ and ${\textstyle T_{\text{byte}}}$ are known. It is optimal for ${\textstyle m={\sqrt {\frac {n\cdot (p-2)\cdot T_{\text{byte}}}{T_{\text{start}}}}}}$ , assuming that this results in a smaller $m$ that divides the original one.

Applications

Reduction is one of the main collective operations implemented in the Message Passing Interface, where performance of the used algorithm is important and evaluated constantly for different use cases.^[11] Operators can be used as parameters for MPI_Reduce and MPI_Allreduce, with the difference that the result is available at one (root) processing unit or all of them.

OpenMP offers a reduction clause for describing how the results from parallel operations are collected together.^[12]

MapReduce relies heavily on efficient reduction algorithms to process big data sets, even on huge clusters.^[13]^[14]

Some parallel sorting algorithms use reductions to be able to handle very big data sets.^[15]

Related Research Articles

In mathematics, the determinant is a scalar-valued function of the entries of a square matrix. The determinant of a matrix $A$ is commonly denoted $det(A)$ , $det A$ , or $| A |$ . Its value characterizes some properties of the matrix and the linear map represented, on a given basis, by the matrix. In particular, the determinant is nonzero if and only if the matrix is invertible and the corresponding linear map is an isomorphism.

In mathematics, and more specifically in linear algebra, a linear map is a mapping $between two vector spaces that preserves the operations of vector addition and scalar multiplication. The same names and the same definition are also used for the more general case of modules over a ring; see Module homomorphism.$

<span class="mw-page-title-main">Simplex</span> Multi-dimensional generalization of triangle

In geometry, a simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. For example,

The Mersenne Twister is a general-purpose pseudorandom number generator (PRNG) developed in 1997 by Makoto Matsumoto and Takuji Nishimura. Its name derives from the choice of a Mersenne prime as its period length.

A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics.

In mathematics, specifically in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices $A$ and $B$ is denoted as $AB$ .

In linear algebra, the Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924. When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.

In linear algebra, the characteristic polynomial of a square matrix is a polynomial which is invariant under matrix similarity and has the eigenvalues as roots. It has the determinant and the trace of the matrix among its coefficients. The characteristic polynomial of an endomorphism of a finite-dimensional vector space is the characteristic polynomial of the matrix of that endomorphism over any basis. The characteristic equation, also known as the determinantal equation, is the equation obtained by equating the characteristic polynomial to zero.

In mathematics, and in particular linear algebra, the Moore–Penrose inverse⁠ $⁠$ of a matrix ⁠ $⁠$ , often called the pseudoinverse, is the most widely known generalization of the inverse matrix. It was independently described by E. H. Moore in 1920, Arne Bjerhammar in 1951, and Roger Penrose in 1955. Earlier, Erik Ivar Fredholm had introduced the concept of a pseudoinverse of integral operators in 1903. The terms pseudoinverse and generalized inverse are sometimes used as synonyms for the Moore–Penrose inverse of a matrix, but sometimes applied to other elements of algebraic structures which share some but not all properties expected for an inverse element.

<span class="mw-page-title-main">Multidimensional scaling</span> Set of related ordination techniques used in information visualization

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of $objects in a set into a configuration of points mapped into an abstract Cartesian space.$

In linear algebra, a Householder transformation is a linear transformation that describes a reflection about a plane or hyperplane containing the origin. The Householder transformation was used in a 1958 paper by Alston Scott Householder.

In linear algebra, a tridiagonal matrix is a band matrix that has nonzero elements only on the main diagonal, the subdiagonal/lower diagonal, and the supradiagonal/upper diagonal. For example, the following matrix is tridiagonal:

In mathematics, the square root of a matrix extends the notion of square root from numbers to matrices. A matrix $B$ is said to be a square root of $A$ if the matrix product $BB$ is equal to $A$ .

In mathematics, a logarithm of a matrix is another matrix such that the matrix exponential of the latter matrix equals the original matrix. It is thus a generalization of the scalar logarithm and in some sense an inverse function of the matrix exponential. Not all matrices have a logarithm and those matrices that do have a logarithm may have more than one logarithm. The study of logarithms of matrices leads to Lie theory since when a matrix has a logarithm then it is in an element of a Lie group and the logarithm is the corresponding element of the vector space of the Lie algebra.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix. The product sometimes includes a permutation matrix as well. LU decomposition can be viewed as the matrix form of Gaussian elimination. Computers usually solve square systems of linear equations using LU decomposition, and it is also a key step when inverting a matrix or computing the determinant of a matrix. The LU decomposition was introduced by the Polish astronomer Tadeusz Banachiewicz in 1938. To quote: "It appears that Gauss and Doolittle applied the method [of elimination] only to symmetric equations. More recent authors, for example, Aitken, Banachiewicz, Dwyer, and Crout … have emphasized the use of the method, or variations of it, in connection with non-symmetric problems … Banachiewicz … saw the point … that the basic problem is really one of matrix factorization, or “decomposition” as he called it." It is also sometimes referred to as LR decomposition.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

In quantum computing, the quantum Fourier transform (QFT) is a linear transformation on quantum bits, and is the quantum analogue of the discrete Fourier transform. The quantum Fourier transform is a part of many quantum algorithms, notably Shor's algorithm for factoring and computing the discrete logarithm, the quantum phase estimation algorithm for estimating the eigenvalues of a unitary operator, and algorithms for the hidden subgroup problem. The quantum Fourier transform was discovered by Don Coppersmith. With small modifications to the QFT, it can also be used for performing fast integer arithmetic operations such as addition and multiplication.

$-dimensional hypercube is a network topology for parallel computers with processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce, and Prefix sum. The processing elements are numbered through . Each processing element is adjacent to processing elements whose numbers differ in one and only one bit. The algorithms described in this page utilize this structure efficiently.$

Broadcast is a collective communication primitive in parallel programming to distribute programming instructions or data to nodes in a cluster. It is the reverse operation of reduction. The broadcast operation is widely used in parallel algorithms, such as matrix-vector multiplication, Gaussian elimination and shortest paths.

References

↑ "Reduction Clause". www.dartmouth.edu. Dartmouth College. 23 March 2009. Retrieved 26 September 2016.
1 2 3 Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. CRC Press. p. 75. ISBN 978-1-4822-1118-4.
↑ Chandra, Rohit (2001). Parallel Programming in OpenMP . Morgan Kaufmann. pp. 59–77. ISBN 1558606718.
↑ Cole, Murray (2004). "Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming" (PDF). Parallel Computing. 30 (3): 393. doi:10.1016/j.parco.2003.12.002. hdl: 20.500.11820/8eb79d42-de83-4cfb-9faa-30d9ac3b3839 .
↑ IEEE Computer Society (22 July 2019). "9.4 Reduction operations". IEEE Standard for Floating-Point Arithmetic. IEEE STD 754-2019. IEEE. pp. 1–84. doi:10.1109/IEEESTD.2019.8766229. ISBN 978-1-5044-5924-2. IEEE Std 754-2019.
↑ Bar-Noy, Amotz; Kipnis, Shlomo (1994). "Broadcasting multiple messages in simultaneous send/receive systems". Discrete Applied Mathematics. 55 (2): 95–105. doi:10.1016/0166-218x(94)90001-9.
↑ Santos, Eunice E. (2002). "Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines". Journal of Parallel and Distributed Computing. 62 (4): 517–543. doi:10.1006/jpdc.2000.1698.
↑ Slater, P.; Cockayne, E.; Hedetniemi, S. (1981-11-01). "Information Dissemination in Trees". SIAM Journal on Computing. 10 (4): 692–701. doi:10.1137/0210052. ISSN 0097-5397.
↑ Rabenseifner, Rolf; Träff, Jesper Larsson (2004-09-19). "More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems". Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science. Vol. 3241. Springer, Berlin, Heidelberg. pp. 36–46. doi:10.1007/978-3-540-30218-6_13. ISBN 9783540231639.
↑ Bar-Noy, A.; Kipnis, S. (1994-09-01). "Designing broadcasting algorithms in the postal model for message-passing systems". Mathematical Systems Theory. 27 (5): 431–452. CiteSeerX 10.1.1.54.2543 . doi:10.1007/BF01184933. ISSN 0025-5661. S2CID 42798826.
↑ Pješivac-Grbović, Jelena; Angskun, Thara; Bosilca, George; Fagg, Graham E.; Gabriel, Edgar; Dongarra, Jack J. (2007-06-01). "Performance analysis of MPI collective operations". Cluster Computing. 10 (2): 127–143. CiteSeerX 10.1.1.80.3867 . doi:10.1007/s10586-007-0012-0. ISSN 1386-7857. S2CID 2142998.
↑ "10.9. Reduction — OpenMP Application Programming Interface Examples". passlab.github.io.
↑ Lämmel, Ralf (2008). "Google's MapReduce programming model — Revisited". Science of Computer Programming. 70 (1): 1–30. doi:10.1016/j.scico.2007.07.001.
↑ Senger, Hermes; Gil-Costa, Veronica; Arantes, Luciana; Marcondes, Cesar A. C.; Marín, Mauricio; Sato, Liria M.; da Silva, Fabrício A.B. (2016-06-10). "BSP cost and scalability analysis for MapReduce operations". Concurrency and Computation: Practice and Experience. 28 (8): 2503–2527. doi:10.1002/cpe.3628. hdl: 10533/147670 . ISSN 1532-0634. S2CID 33645927.
↑ Axtmann, Michael; Bingmann, Timo; Sanders, Peter; Schulz, Christian (2014-10-24). "Practical Massively Parallel Sorting". arXiv: 1410.6754 [cs.DS].

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Reduction Clause". www.dartmouth.edu. Dartmouth College. 23 March 2009. Retrieved 26 September 2016.

[:1-2] 1 2 3 Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. CRC Press. p. 75. ISBN 978-1-4822-1118-4.

[:0-3] Chandra, Rohit (2001). Parallel Programming in OpenMP . Morgan Kaufmann. pp. 59–77. ISBN 1558606718.

[4] Cole, Murray (2004). "Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming" (PDF). Parallel Computing. 30 (3): 393. doi:10.1016/j.parco.2003.12.002. hdl: 20.500.11820/8eb79d42-de83-4cfb-9faa-30d9ac3b3839 .

[5] IEEE Computer Society (22 July 2019). "9.4 Reduction operations". IEEE Standard for Floating-Point Arithmetic. IEEE STD 754-2019. IEEE. pp. 1–84. doi:10.1109/IEEESTD.2019.8766229. ISBN 978-1-5044-5924-2. IEEE Std 754-2019.

[6] Bar-Noy, Amotz; Kipnis, Shlomo (1994). "Broadcasting multiple messages in simultaneous send/receive systems". Discrete Applied Mathematics. 55 (2): 95–105. doi:10.1016/0166-218x(94)90001-9.

[7] Santos, Eunice E. (2002). "Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines". Journal of Parallel and Distributed Computing. 62 (4): 517–543. doi:10.1006/jpdc.2000.1698.

[8] Slater, P.; Cockayne, E.; Hedetniemi, S. (1981-11-01). "Information Dissemination in Trees". SIAM Journal on Computing. 10 (4): 692–701. doi:10.1137/0210052. ISSN 0097-5397.

[9] Rabenseifner, Rolf; Träff, Jesper Larsson (2004-09-19). "More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems". Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science. Vol. 3241. Springer, Berlin, Heidelberg. pp. 36–46. doi:10.1007/978-3-540-30218-6_13. ISBN 9783540231639.

[10] Bar-Noy, A.; Kipnis, S. (1994-09-01). "Designing broadcasting algorithms in the postal model for message-passing systems". Mathematical Systems Theory. 27 (5): 431–452. CiteSeerX 10.1.1.54.2543 . doi:10.1007/BF01184933. ISSN 0025-5661. S2CID 42798826.

[11] Pješivac-Grbović, Jelena; Angskun, Thara; Bosilca, George; Fagg, Graham E.; Gabriel, Edgar; Dongarra, Jack J. (2007-06-01). "Performance analysis of MPI collective operations". Cluster Computing. 10 (2): 127–143. CiteSeerX 10.1.1.80.3867 . doi:10.1007/s10586-007-0012-0. ISSN 1386-7857. S2CID 2142998.

[12] "10.9. Reduction — OpenMP Application Programming Interface Examples". passlab.github.io.

[13] Lämmel, Ralf (2008). "Google's MapReduce programming model — Revisited". Science of Computer Programming. 70 (1): 1–30. doi:10.1016/j.scico.2007.07.001.

[14] Senger, Hermes; Gil-Costa, Veronica; Arantes, Luciana; Marcondes, Cesar A. C.; Marín, Mauricio; Sato, Liria M.; da Silva, Fabrício A.B. (2016-06-10). "BSP cost and scalability analysis for MapReduce operations". Concurrency and Computation: Practice and Experience. 28 (8): 2503–2527. doi:10.1002/cpe.3628. hdl: 10533/147670 . ISSN 1532-0634. S2CID 33645927.

[15] Axtmann, Michael; Bingmann, Timo; Sanders, Peter; Schulz, Christian (2014-10-24). "Practical Massively Parallel Sorting". arXiv: 1410.6754 [cs.DS].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Reduction operator

Contents

Theory

Example

Nonexample

Algorithms

Binomial tree algorithms

PRAM-algorithm

Runtime analysis

Distributed memory algorithm

Runtime analysis

Pipeline-algorithm

Runtime analysis

Applications

See also

Related Research Articles

References