Broadcast (parallel pattern)

Last updated December 02, 2024

Broadcast is a collective communication primitive in parallel programming to distribute programming instructions or data to nodes in a cluster. It is the reverse operation of reduction.^[1] The broadcast operation is widely used in parallel algorithms, such as matrix-vector multiplication,^[1] Gaussian elimination and shortest paths.^[2]

Definition

A message $M[1..m]$ of length $m$ should be distributed from one node to all other $p-1$ nodes.

$T_{\text{byte}}$ is the time it takes to send one byte.

$T_{\text{start}}$ is the time it takes for a message to travel to another node, independent of its length.

Therefore, the time to send a package from one node to another is $t=\mathrm {size} \times T_{\text{byte}}+T_{\text{start}}$ .^[1]

$p$ is the number of nodes and the number of processors.

Binomial Tree Broadcast

With Binomial Tree Broadcast the whole message is sent at once. Each node that has already received the message sends it on further. This grows exponentially as each time step the amount of sending nodes is doubled. The algorithm is ideal for short messages but falls short with longer ones as during the time when the first transfer happens only one node is busy.

Sending a message to all nodes takes $\log _{2}(p)t$ time which results in a runtime of $\log _{2}(p)(mT_{\text{byte}}+T_{\text{start}})$

MessageMid:=nodenumberp:=numberofnodesifid>0blocking_receiveMfor(i:=ceil(log_2(p))-1;i>=0;i--)if(id%2^(i+1)==0&&id+2^i<=p)sendMtonodeid+2^i

^[4]

Linear Pipeline Broadcast

The message is split up into $k$ packages and sent piecewise from node $n$ to node $n+1$ . The time needed to distribute the first message piece is ${\textstyle pt={\frac {m}{k}}T_{\text{byte}}+T_{\text{start}}}$ whereby $t$ is the time needed to send a package from one processor to another.

Sending a whole message takes $(p+k)\left({\frac {mT_{\text{byte}}}{k}}+T_{\text{start}}\right)=(p+k)t=pt+kt$ .

Optimal is to choose $k={\sqrt {\frac {m(p-2)T_{\text{byte}}}{T_{\text{start}}}}}$ resulting in a runtime of approximately $mT_{\text{byte}}+pT_{\text{start}}+{\sqrt {mpT_{\text{start}}T_{\text{byte}}}}$

The run time is dependent on not only message length but also the number of processors that play roles. This approach shines when the length of the message is much larger than the amount of processors.

MessageM:=[m_1,m_2,...,m_n]id=nodenumberfor(i:=1;i<=n;i++)inparallelif(id!=0)blocking_receivem_iif(id!=n)sendm_itonodeid+1

^[4]

Pipelined Binary Tree Broadcast

This algorithm combines Binomial Tree Broadcast and Linear Pipeline Broadcast, which makes the algorithm work well for both short and long messages. The aim is to have as many nodes work as possible while maintaining the ability to send short messages quickly. A good approach is to use Fibonacci trees for splitting up the tree, which are a good choice as a message cannot be sent to both children at the same time. This results in a binary tree structure.

We will assume in the following that communication is full-duplex. The Fibonacci tree structure has a depth of about $d\approx \log _{\Phi }(p)$ whereby $\Phi ={\frac {1+{\sqrt {5}}}{2}}$ the golden ratio.

The resulting runtime is ${\textstyle ({\frac {m}{k}}T_{\text{byte}}+T_{\text{start}})(d+2k-2)}$ . Optimal is $k={\sqrt {\frac {n(d-2)T_{\text{byte}}}{3T_{\text{start}}}}}$ .

This results in a runtime of $2mT_{\text{byte}}+T_{\text{start}}\log _{\Phi }(p)+{\sqrt {2m\log _{\Phi }(p)T_{\text{start}}T_{\text{byte}}}}$ .

MessageM:=[m_1,m_2,...,m_k]fori=1tokif(hasParent())blocking_receivem_iif(hasChild(left_child))sendm_itoleft_childif(hasChild(right_child))sendm_itoright_child

^[2]^[4]

Two Tree Broadcast (23-Broadcast)

^[5]^[6]^[7]

Definition

This algorithm aims to improve on some disadvantages of tree structure models with pipelines. Normally in tree structure models with pipelines (see above methods), leaves receive just their data and cannot contribute to send and spread data.

The algorithm concurrently uses two binary trees to communicate over. Those trees will be called tree A and B. Structurally in binary trees there are relatively more leave nodes than inner nodes. Basic Idea of this algorithm is to make a leaf node of tree A be an inner node of tree B. It has also the same technical function in opposite side from B to A tree. This means, two packets are sent and received by inner nodes and leaves in different steps.

Tree construction

The number of steps needed to construct two parallel-working binary trees is dependent on the amount of processors. Like with other structures one processor can is the root node who sends messages to two trees. It is not necessary to set a root node, because it is not hard to recognize that the direction of sending messages in binary tree is normally top to bottom. There is no limitation on the number of processors to build two binary trees. Let the height of the combined tree be $h = ⌈log(p + 2)⌉$ . Tree A and B can have a height of $h-1$ . Especially, if the number of processors correspond to $p=2^{h}-1$ , we can make both sides trees and a root node.

To construct this model efficiently and easily with a fully built tree, we can use two methods called "Shifting" and "Mirroring" to get second tree. Let assume tree A is already modeled and tree B is supposed to be constructed based on tree A. We assume that we have $p$ processors ordered from 0 to $p-1$ .

Shifting

The "Shifting" method, first copies tree A and moves every node one position to the left to get tree B. The node, which will be located on -1, becomes a child of processor $p-2$ .

Mirroring

"Mirroring" is ideal for an even number of processors. With this method tree B can be more easily constructed by tree A, because there are no structural transformations in order to create the new tree. In addition, a symmetric process makes this approach simple. This method can also handle an odd number of processors, in this case, we can set processor $p-1$ as root node for both trees. For the remaining processors "Mirroring" can be used.

Coloring

We need to find a schedule in order to make sure that no processor has to send or receive two messages from two trees in a step. The edge, is a communication connection to connect two nodes, and can be labelled as either 0 or 1 to make sure that every processor can alternate between 0 and 1-labelled edges. The edges of A and B can be colored with two colors (0 and 1) such that

no processor is connected to its parent nodes in A and B using edges of the same color-
no processor is connected to its children nodes in A or B using edges of the same color.^[7]

In every even step the edges with 0 are activated and edges with 1 are activated in every odd step.

Time complexity

In this case the number of packet k is divided in half for each tree. Both trees are working together the total number of packets $k=k/2+k/2$ (upper tree + bottom tree)

In each binary tree sending a message to another nodes takes $2i$ steps until a processor has at least a packet in step $i$ . Therefore, we can calculate all steps as $d:=\log _{2}(p+1)\Rightarrow \log _{2}(p+1)\approx \log _{2}(p)$ .

The resulting run time is ${\textstyle T(m,p,k)\approx ({\frac {m}{k}}T_{\text{byte}}+T_{\text{start}})(2d+k-1)}$ . (Optimal ${\textstyle k={\sqrt {{m(2d-1)T_{\text{byte}}}/{T_{\text{start}}}}}}$ )

This results in a run time of $T(m,p)\approx mT_{\text{byte}}+T_{\text{start}}\cdot 2\log _{2}(p)+{\sqrt {m\cdot 2\log _{2}(p)T_{\text{start}}T_{\text{byte}}}}$ .

ESBT-Broadcasting (Edge-disjoint Spanning Binomial Trees)

In this section, another broadcasting algorithm with an underlying telephone communication model will be introduced. A Hypercube creates network system with $p=2^{d}(d=0,1,2,3,...)$ . Every node is represented by binary ${0,1}$ depending on the number of dimensions. Fundamentally ESBT(Edge-disjoint Spanning Binomial Trees) is based on hypercube graphs, pipelining( $m$ messages are divided by $k$ packets) and binomial trees. The Processor $0^{d}$ cyclically spreads packets to roots of ESBTs. The roots of ESBTs broadcast data with binomial tree. To leave all of $k$ from $p_{0}$ , $k$ steps are required, because all packets are distributed by $p_{0}$ . It takes another d steps until the last leaf node receives the packet. In total $d+k$ steps are necessary to broadcast $m$ message through ESBT.

The resulting run time is ${\textstyle T(m,p,k)=({\frac {m}{k}}T_{\text{byte}}+T_{\text{start}})(k+d)}$ . ${\textstyle (k={\sqrt {{mdT_{\text{byte}}}/{T_{\text{start}}}}})}$ .

This results in a run time of $T(m,p):=mT_{\text{byte}}+dT_{\text{start}}+{\sqrt {mdT_{\text{start}}T_{\text{byte}}}}$ .

^[8]^[9]

Related Research Articles

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, a priority queue is an abstract data type similar to a regular queue or stack abstract data type. Each element in a priority queue has an associated priority. In a priority queue, elements with high priority are served before elements with low priority. In some implementations, if two elements have the same priority, they are served in the same order in which they were enqueued. In other implementations, the order of elements with the same priority is undefined.

In mathematics, a square-free integer (or squarefree integer) is an integer which is divisible by no square number other than 1. That is, its prime factorization has exactly one factor for each prime that appears in it. For example, 10 = 2 ⋅ 5 is square-free, but 18 = 2 ⋅ 3 ⋅ 3 is not, because 18 is divisible by 9 = 3². The smallest positive square-free numbers are

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is one of two related techniques for constructing a prefix code based on a set of symbols and their probabilities.

<span class="mw-page-title-main">Binary heap</span> Variant of heap data structure

A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964 as a data structure for implementing heapsort.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<i>m</i>-ary tree Tree data structure in which each node has at most m children.

In graph theory, an m-ary tree is an arborescence in which each node has no more than m children. A binary tree is an important case where m = 2; similarly, a ternary tree is one where m = 3.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

<span class="mw-page-title-main">Fast inverse square root</span> Root-finding algorithm

Fast inverse square root, sometimes referred to as Fast InvSqrt or by the hexadecimal constant 0x5F3759DF, is an algorithm that estimates $, the reciprocal of the square root of a 32-bit floating-point number in IEEE 754 floating-point format. The algorithm is best known for its implementation in 1999 in Quake III Arena, a first-person shooter video game heavily based on 3D graphics. With subsequent hardware advancements, especially the x86 SSE instruction rsqrtss, this algorithm is not generally the best choice for modern computers, though it remains an interesting historical example.$

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the B^ε tree. The B^ε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the B^ε tree has degree B^ε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often commutative. The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

The two-tree broadcast is an algorithm that implements a broadcast communication pattern on a distributed system using message passing. A broadcast is a commonly used collective operation that sends data from one processor to all other processors. The two-tree broadcast communicates concurrently over two binary trees that span all processors. This achieves full usage of the bandwidth in the full-duplex communication model while having a startup latency logarithmic in the number of partaking processors. The algorithm can also be adapted to perform a reduction or prefix sum.

A central problem in algorithmic graph theory is the shortest path problem. Hereby, the problem of finding the shortest path between every pair of nodes is known as all-pair-shortest-paths (APSP) problem. As sequential algorithms for this problem often yield long runtimes, parallelization has shown to be beneficial in this field. In this article two efficient algorithms solving this problem are introduced.

$-dimensional hypercube is a network topology for parallel computers with processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce, and Prefix sum. The processing elements are numbered through . Each processing element is adjacent to processing elements whose numbers differ in one and only one bit. The algorithms described in this page utilize this structure efficiently.$

References

1 2 3 Kumar, Vipin (2002). Introduction to Parallel Computing (2nd ed.). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. pp. 185, 151, 66. ISBN 9780201648652.
1 2 Bruck, J.; Cypher, R.; Ho, C-T. (1992). "Multiple message broadcasting with generalized Fibonacci trees". [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing. pp. 424–431. doi:10.1109/SPDP.1992.242714. ISBN 0-8186-3200-3. S2CID 2846661.
↑ MPI: A Message-Passing Interface StandardVersion 3.0, Message Passing Interface Forum, pp. 148, 2012
1 2 3 Michael Ikkert, T. Kieritz, P. Sanders Parallele Algorithmen - Script (German), Karlsruhe Institute of Technology, pp. 29-32, 2009
↑ Michael Ikkert, T. Kieritz, P. Sanders Parralele Algorithme - Script (German), Karlsruhe Institute of Technology, pp.33-37, 2009
↑ P. Sanders (German), Karlsruhe Institute of Technology, pp. 82-96, 2018
1 2 Sanders, Peter; Speck, Jochen; Träff, Jesper Larsson (2009). "Two-tree algorithms for full bandwidth broadcast, reduction and scan". Parallel Computing. 35 (12): 581–594. doi:10.1016/j.parco.2009.09.001. ISSN 0167-8191.
↑ Michael Ikkert, T. Kieritz, P. Sanders Parallele Algorithmen - Script (German), Karlsruhe Institute of Technology, pp.40-42, 2009
↑ P. Sanders (German), Karlsruhe Institute of Technology, pp. 101-104, 2018

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:1-1] 1 2 3 Kumar, Vipin (2002). Introduction to Parallel Computing (2nd ed.). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. pp. 185, 151, 66. ISBN 9780201648652.

[:0-2] 1 2 Bruck, J.; Cypher, R.; Ho, C-T. (1992). "Multiple message broadcasting with generalized Fibonacci trees". [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing. pp. 424–431. doi:10.1109/SPDP.1992.242714. ISBN 0-8186-3200-3. S2CID 2846661.

[3] MPI: A Message-Passing Interface StandardVersion 3.0, Message Passing Interface Forum, pp. 148, 2012

[:2-4] 1 2 3 Michael Ikkert, T. Kieritz, P. Sanders Parallele Algorithmen - Script (German), Karlsruhe Institute of Technology, pp. 29-32, 2009

[:5-5] Michael Ikkert, T. Kieritz, P. Sanders Parralele Algorithme - Script (German), Karlsruhe Institute of Technology, pp.33-37, 2009

[:6-6] P. Sanders (German), Karlsruhe Institute of Technology, pp. 82-96, 2018

[:7-7] 1 2 Sanders, Peter; Speck, Jochen; Träff, Jesper Larsson (2009). "Two-tree algorithms for full bandwidth broadcast, reduction and scan". Parallel Computing. 35 (12): 581–594. doi:10.1016/j.parco.2009.09.001. ISSN 0167-8191.

[:8-8] Michael Ikkert, T. Kieritz, P. Sanders Parallele Algorithmen - Script (German), Karlsruhe Institute of Technology, pp.40-42, 2009

[:9-9] P. Sanders (German), Karlsruhe Institute of Technology, pp. 101-104, 2018

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Broadcast (parallel pattern)

Contents

Definition

Binomial Tree Broadcast

Linear Pipeline Broadcast

Pipelined Binary Tree Broadcast

Two Tree Broadcast (23-Broadcast)

Definition

Tree construction

Shifting

Mirroring

Coloring

Time complexity

ESBT-Broadcasting (Edge-disjoint Spanning Binomial Trees)

See also

Related Research Articles

References