Hypercube (communication pattern)

Last updated February 25, 2020

$d$ -dimensional hypercube is a network topology for parallel computers with $2^{d}$ processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce, and Prefix sum.^[1] The processing elements are numbered $0$ through $2^{d}-1$ . Each processing element is adjacent to processing elements whose numbers differ in one and only one bit. The algorithms described in this page utilize this structure efficiently.

Algorithm outline

Most of the communication primitives presented in this article share a common template.^[2] Initially, each processing element possesses one message that must reach every other processing element during the course of the algorithm. The following pseudo code sketches the communication steps necessary. Hereby, Initialization, Operation, and Output are placeholders that depend on the given communication primitive (see next section).

Input: message  $m$ . Output: depends on Initialization, Operation and Output. Initialization $s:=m$ for $0\leq k<d$ do $y:=i{\text{ XOR }}2^{k}$ Send $s$ to $y$ Receive $m$ from $y$ Operation $(s,m)$ endforOutput

Each processing element iterates over its neighbors (the expression $i{\text{ XOR }}2^{k}$ negates the $k$ -th bit in $i$ 's binary representation, therefore obtaining the numbers of its neighbors). In each iteration, each processing element exchanges a message with the neighbor and processes the received message afterwards. The processing operation depends on the communication primitive.

Communication primitives

Prefix sum

In the beginning of a prefix sum operation, each processing element $i$ owns a message $m_{i}$ . The goal is to compute $\bigoplus _{0\leq j\leq i}m_{j}$ , where $\oplus$ is an associative operation. The following pseudo code describes the algorithm.

Input: message  $m_{i}$  of processor  $i$ . Output: prefix sum  $\bigoplus _{0\leq j\leq i}m_{j}$  of processor  $i$ .  $x:=m_{i}$  $\sigma :=m_{i}$ for $0\leq k\leq d-1$ do $y:=i{\text{ XOR }}2^{k}$ Send $\sigma$ to $y$ Receive $m$ from $y$  $\sigma :=\sigma \oplus m$ if bit  $k$  in  $i$  is set then $x:=x\oplus m$ endfor

The algorithm works as follows. Observe that hypercubes of dimension $d$ can be split into two hypercubes of dimension $d-1$ . Refer to the sub cube containing nodes with a leading 0 as the 0-sub cube and the sub cube consisting of nodes with a leading 1 as 1-sub cube. Once both sub cubes have calculated the prefix sum, the sum over all elements in the 0-sub cube has to be added to the every element in the 1-sub cube, since every processing element in the 0-sub cube has a lower rank than the processing elements in the 1-sub cube. The pseudo code stores the prefix sum in variable $x$ and the sum over all nodes in a sub cube in variable $\sigma$ . This makes it possible for all nodes in 1-sub cube to receive the sum over the 0-sub cube in every step.

This results in a factor of $\log p$ for $T_{\text{start}}$ and a factor of $n\log p$ for $T_{\text{byte}}$ : $T(n,p)=(T_{\text{start}}+nT_{\text{byte}})\log p$ .

All-gather / all-reduce

All-gather operations start with each processing element having a message $m_{i}$ . The goal of the operation is for each processing element to know the messages of all other processing elements, i.e. $x:=m_{0}\cdot m_{1}\dots m_{p}$ where $\cdot$ is concatenation. The operation can be implemented following the algorithm template.

Input: message  $x:=m_{i}$  at processing unit  $i$ . Output: all messages  $m_{1}\cdot m_{2}\dots m_{p}$ .  $x:=m_{i}$ for $0\leq k<d$ do $y:=i{\text{ XOR }}2^{k}$ Send $x$ to $y$ Receive $x'$ from $y$  $x:=x\cdot x'$ endfor

With each iteration, the transferred message doubles in length. This leads to a runtime of $T(n,p)\approx \sum _{j=0}^{d-1}(T_{\text{start}}+n\cdot 2^{j}T_{\text{byte}})=\log(p)T_{\text{start}}+(p-1)nT_{\text{byte}}$ .

The same principle can be applied to the All-Reduce operations, but instead of concatenating the messages, it performs a reduction operation on the two messages. So it is a Reduce operation, where all processing units know the result. Compared to a normal reduce operation followed by a broadcast, All-Reduce in hypercubes reduces the number of communication steps.

All-to-all

Here every processing element has a unique message for all other processing elements.

Input: message  $m_{ij}$  at processing element  $i$  to processing element  $j$ . for $d>k\geq 0$ doReceive from processing element  $i{\text{ XOR }}2^{k}$ :         all messages for my  $k$ -dimensional sub cube     Send to processing element  $i{\text{ XOR }}2^{k}$ :         all messages for its  $k$ -dimensional sub cube endfor

With each iteration a messages comes closer to its destination by one dimension, if it hasn't arrived yet. Hence, all messages have reached their target after at most $d=\log {p}$ steps. In every step, $p/2$ messages are sent: in the first iteration, half of the messages aren't meant for the own sub cube. In every following step, the sub cube is only half the size as before, but in the previous step exactly the same number of messages arrived from another processing element.

This results in a run-time of $T(n,p)\approx \log {p}(T_{\text{start}}+{\frac {p}{2}}nT_{\text{byte}})$ .

ESBT-broadcast

The ESBT-broadcast (Edge-disjoint Spanning Binomial Tree) algorithm^[3] is a pipelined broadcast algorithm with optimal runtime for clusters with hypercube network topology. The algorithm embeds $d$ edge-disjoint binomial trees in the hypercube, such that each neighbor of processing element $0$ is the root of a spanning binomial tree on $2^{d}-1$ nodes. To broadcast a message, the source node splits its message into $k$ chunks of equal size and cyclically sends them to the roots of the binomial trees. Upon receiving a chunk, the binomial trees broadcast it.

Runtime

In each step, the source node sends one of its $k$ chunks to a binomial tree. Broadcasting the chunk within the binomial tree takes $d$ steps. Thus, it takes $k$ steps to distribute all chunks and additionally $d$ steps until the last binomial tree broadcast has finished, resulting in $k+d$ steps overall. Therefore, the runtime for a message of length $n$ is $T(n,p,k)=\left({\frac {n}{k}}T_{\text{byte}}+T_{\text{start}}\right)(k+d)$ . With the optimal chunk size $k^{*}={\sqrt {\frac {nd\cdot T_{\text{byte}}}{T_{\text{start}}}}}$ , the optimal runtime of the algorithm is $T^{*}(n,p)=n\cdot T_{\text{byte}}+\log(p)\cdot T_{\text{start}}+{\sqrt {n\log(p)\cdot T_{\text{start}}\cdot T_{\text{byte}}}}$ .

Construction of the binomial trees

A
3
{\displaystyle 3}
-dimensional hypercubes with three ESBT embedded. HypergraphESBT.png — A $3$ -dimensional hypercubes with three ESBT embedded.

This section describes how to construct the binomial trees systematically. First, construct a single binomial spanning tree von $2^{d}$ nodes as follows. Number the nodes from $0$ to $2^{d}-1$ and consider their binary representation. Then the children of each nodes are obtained by negating single leading zeroes. This results in a single binomial spanning tree. To obtain $d$ edge-disjoint copies of the tree, translate and rotate the nodes: for the $k$ -th copy of the tree, apply a XOR operation with $2^{k}$ to each node. Subsequently, right-rotate all nodes by $k$ digits. The resulting binomial trees are edge-disjoint and therefore fulfill the requirements for the ESBT-broadcasting algorithm.

Related Research Articles

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: success/yes/true/one or failure/no/false/zero. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficient is indexed by a pair of integers $n \geq k \geq 0$ and is written $It is the coefficient of the x k term in the polynomial expansion of the binomial power (1 + x) n, and it is given by the formula$

In elementary algebra, the binomial theorem describes the algebraic expansion of powers of a binomial. According to the theorem, it is possible to expand the polynomial $(x + y) n$ into a sum involving terms of the form $ax b y c$ , where the exponents $b$ and $c$ are nonnegative integers with $b + c = n$ , and the coefficient $a$ of each term is a specific positive integer depending on $n$ and $b$ . For example,

The information entropy, often just entropy, is a basic quantity in information theory associated to any random variable, which can be interpreted as the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication".

A splay tree is a self-balancing binary search tree with the additional property that recently accessed elements are quick to access again. It performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For many sequences of non-random operations, splay trees perform better than other search trees, even when the specific pattern of the sequence is unknown. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

Pascals triangle triangular array of the binomial coefficients in mathematics

In mathematics, Pascal's triangle is a triangular array of the binomial coefficients. In much of the Western world, it is named after the French mathematician Blaise Pascal, although other mathematicians studied it centuries before him in India, Persia (Iran), China, Germany, and Italy.

In computer science, a binomial heap is a data structure that acts as a priority queue but also allows pairs of heaps to be merged together. It is important as an implementation of the mergeable heap abstract data type, which is a priority queue supporting merge operation. It is implemented as a heap similar to a binary heap but using a special tree structure that is different from the complete binary trees used by binary heaps. Binomial heaps were invented in 1978 by Jean Vuillemin.

In computer science, a Fibonacci heap is a data structure for priority queue operations, consisting of a collection of heap-ordered trees. It has a better amortized running time than many other priority queue data structures including the binary heap and binomial heap. Michael L. Fredman and Robert E. Tarjan developed Fibonacci heaps in 1984 and published them in a scientific journal in 1987. Fibonacci heaps are named after the Fibonacci numbers, which are used in their running time analysis.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values. In fact, the node ID provides a direct map to file hashes and that node stores information on where to obtain the file or resource.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

In cryptography, a message authentication code based on universal hashing, or UMAC, is a type of message authentication code (MAC) calculated choosing a hash function from a class of hash functions according to some secret (random) process and applying it to the message. The resulting digest or fingerprint is then encrypted to hide the identity of the hash function used. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message.

<i>m</i>-ary tree Tree data structure in which each node has at most K children. From a graph theoretical perspective, k-ary trees are actually arborescences.

In graph theory, an m-ary tree is a rooted tree in which each node has no more than m children. A binary tree is the special case where m = 2, and a ternary tree is another case with m = 3 that limits its children to three.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task is begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size $s$ from the $n$ -element sequence, and determining the range of the buckets by sorting the sample and choosing $p -1 < s$ elements from the result. These elements then divide the sample into $p$ equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

Hypercube internetwork topology interconnect protocol

Hypercube networks are a type of network topology used to connect multiple processors with memory modules and accurately route data. Hypercube networks consist of 2^m nodes. These nodes form the vertices of squares to create an internetwork connection. A hypercube is basically a multidimensional mesh network with two nodes in each dimension. Due to similarity, such topologies are usually grouped into a k-ary d-dimensional mesh topology family where d represents the number of dimensions and k represents the number of nodes in each dimension.

Reduce is a collective communication primitive used in the context of a parallel programming model to combine multiple vectors into one, using an associative binary operator $. Every vector is present at a distinct processor in the beginning. The goal of the primitive is to apply the operator in the order given by the processor-indices to the vectors until only one is left. The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a function is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduce as a primary operation to solve more complex problems. The Message Passing Interface implements it in the operations MPI_Reduce and MPI_Allreduce, with the difference that the result is available at one (root) processing unit or all of them. Closely related to reduce is the broadcast operation, which distributes data to all processors. Many reduce algorithms can be used for broadcasting by reverting them and omitting the operator.$

The two-tree broadcast is an algorithm that implements a broadcast communication pattern on a distributed system using message passing. A broadcast is a commonly used collective operation that sends data from one processor to all other processors. The two-tree broadcast communicates concurrently over two binary trees that span all processors. This achieves full usage of the bandwidth in the full-duplex communication model while having a startup latency logarithmic in the number of partaking processors. The algorithm can also be adapted to perform a reduction or prefix sum.

Broadcast is a collective communication primitive in parallel programming to distribute programming instructions or data to nodes in a cluster it is the reverse operation of reduce. The broadcast operation is widely used in parallel algorithms, such as matrix-vector multiplication, Gaussian elimination and shortest paths.

References

↑ Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. ISBN 978-0201648652.
↑ Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; ISBN 0201575949.
↑ Johnsson, S.L.; Ho, C.-T. (1989). "Optimum broadcasting and personalized communication in hypercubes". IEEE Transactions on Computers. 38 (9): 1249–1268. doi:10.1109/12.29465. ISSN 0018-9340.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. ISBN 978-0201648652.

[2] Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; ISBN 0201575949.

[3] Johnsson, S.L.; Ho, C.-T. (1989). "Optimum broadcasting and personalized communication in hypercubes". IEEE Transactions on Computers. 38 (9): 1249–1268. doi:10.1109/12.29465. ISSN 0018-9340.