All-to-all (parallel pattern)

Last updated September 26, 2021

In parallel computing, all-to-all (also known as index operation or total exchange) is a collective operation, where each processor sends an individual message to every other processor.

The number of communication rounds and the overall communication volume are measures to evaluate the quality of an all-to-all algorithm. We consider a single-ported full-duplex machine throughout this article. On such a machine, an all-to-all algorithm requires at least $\lceil \log _{2}n\rceil$ communication rounds. Further a minimum of $\left\lceil m(n-1)\right\rceil$ units of data is transferred. Optimum for both these measures can not be achieved simultaneously.^[1]

Depending on the network topology (fully connected, hypercube, ring), different all-to-all algorithms are required.

All-to-all algorithms based on topology

We consider a single-ported machine. The way the data is routed through the network depends on its underlying topology. We take a look at all-to-all algorithms for common network topologies.

Hypercube

A hypercube is a network topology, where two processors share a link, if the hamming distance of their indices is one. The idea of an all-to-all algorithm is to combine messages belonging to the same subcube, and then distribute them.

Ring

An all-to-all algorithm in a ring topology is very intuitive. Initially a processor sends a message of size m(p-1) to one of its neighbors. Communication is performed in the same direction on all processors. When a processor receives a message, it extracts the part that belongs to it and forwards the remainder of the message to the next neighbor. After (p-1) communication rounds, every message is distributed to its destination.

The time taken by this algorithm is $(t_{s}+{\frac {1}{2}}t_{w}mp)(p-1)$ .^[2] Here $t_{s}$ is the startup cost for a communication, and $t_{w}$ is the cost of transmitting a unit of data. This term can further be improved when half of the messages are sent in one and the other half in the other direction. This way, messages arrive earlier at their destination.

Mesh

For a mesh we look at a ${\sqrt {p}}\times {\sqrt {p}}$ mesh. This algorithm is easily adaptable for any mesh. An all-to-all algorithm in a mesh consists of two communication phases. First, each processors groups the messages into ${\sqrt {p}}$ groups, each containing ${\sqrt {p}}$ messages. Messages are in the same group, if their destined processors share the same row. Next, an all-to-all operation among rows is performed. Each processor now holds all relevant information for processors in his column. Again, the messages need to be rearranged. After another all-to-all operation, this time in respect to columns, each processor ends up with its messages.

The overall time of communication for this algorithm is $(2t_{s}+t_{w}mp)({\sqrt {p}}-1)$ . Additionally, time for the local rearrangement of messages adds to the overall runtime of the algorithm.

1-factor algorithm

Again, we consider a single-ported machine. A trivial algorithm, is to send (p-1) asynchronous messages into the network for each processor. The performance of this algorithm is poor, which is due to congestion arising because of the bisection width of the network.^[3] More sophisticated algorithms combine messages to reduce the number of send operations and try to control congestion.

For large messages, the cost of a startup is small compared to the cost of transmitting the payload. It is faster to send messages directly to their destination. In the following algorithm an all-to-all algorithm is performed using (p-1) one-to-one routings.

// p odd: // pe index  $j\in \{0,\dots ,p-1\}$  for i := 0 to p-1 do     Exchange data with PE  $(i-j)\,\mod \,p$

// p even: // pe index  $j\in \{0,\dots ,p-1\}$  for i := 0 to p-2 do     idle :=  ${\frac {p}{2}}i\,\mod \,(p-1)$      if j = p-1 then         exchange data with PE idle     else         if j = idle then             exchange data with pe p-1         else             exchange data with PE  $(i-j)mod(p-1)$

The algorithm has a different behavior, whether p is odd or even. In case p is odd, one processor is idle in each iteration. For an even p, this idle processor communicates with the processor with index p-1. The total time taken is $(p-1)(t_{s}+t_{w}m)$ for an even p, and $p(t_{s}+t_{w}m)$ for an odd p respectively.

Instead of pairing processor j with processor $(i-j){\bmod {p}}$ in iteration i, we can also use the exclusive-or of j and i to determine a mapping. This approach requires p to be a power of two. Depending on the underlying topology of the network, one approach might be superior to the other. The exclusive or approach is superior, when performing pairwise one-to-one routings in a hypercube or fat-tree.^[4]

Related Research Articles

Network topology is the arrangement of the elements of a communication network. Network topology can be used to define or describe the arrangement of various types of telecommunication networks, including command and control radio networks, industrial fieldbusses and computer networks.

In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data.

In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.

The fat tree network is a universal network for provably efficient communication. It was invented by Charles E. Leiserson of the Massachusetts Institute of Technology in 1985.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications especially in ranking problems such as feedback arc set. Topological sorting is possible even when the DAG has disconnected components.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

The bulk synchronous parallel (BSP) abstract computer is a bridging model for designing parallel algorithms. It is similar to the parallel random access machine (PRAM) model, but unlike PRAM, BSP does not take communication and synchronization for granted. In fact, quantifying the requisite synchronization and communication is an important part of analyzing a BSP algorithm.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.

Multistage interconnection networks (MINs) are a class of high-speed computer networks usually composed of processing elements (PEs) on one end of the network and memory elements (MEs) on the other end, connected by switching elements (SEs). The switching elements themselves are usually connected to each other in stages, hence the name.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often commutative. The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

In computer networking, hypercube networks are a type of network topology used to connect multiple processors with memory modules and accurately route data. Hypercube networks consist of $2 m$ nodes, which form the vertices of squares to create an internetwork connection. A hypercube is basically a multidimensional mesh network with two nodes in each dimension. Due to similarity, such topologies are usually grouped into a $k$ -ary $d$ -dimensional mesh topology family, where $d$ represents the number of dimensions and $k$ represents the number of nodes in each dimension.

A butterfly network is a technique to link multiple computers into a high-speed network. This form of multistage interconnection network topology can be used to connect different nodes in a multiprocessor system. The interconnect network for a shared memory multiprocessor system must have low latency and high bandwidth unlike other network systems, like local area networks (LANs) or internet for three reasons:

The two-tree broadcast is an algorithm that implements a broadcast communication pattern on a distributed system using message passing. A broadcast is a commonly used collective operation that sends data from one processor to all other processors. The two-tree broadcast communicates concurrently over two binary trees that span all processors. This achieves full usage of the bandwidth in the full-duplex communication model while having a startup latency logarithmic in the number of partaking processors. The algorithm can also be adapted to perform a reduction or prefix sum.

A central problem in algorithmic graph theory is the shortest path problem. Hereby, the problem of finding the shortest path between every pair of nodes is known as all-pair-shortest-paths (APSP) problem. As sequential algorithms for this problem often yield long runtimes, parallelization has shown to be beneficial in this field. In this article two efficient algorithms solving this problem are introduced.

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

$-dimensional hypercube is a network topology for parallel computers with processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce, and Prefix sum. The processing elements are numbered through . Each processing element is adjacent to processing elements whose numbers differ in one and only one bit. The algorithms described in this page utilize this structure efficiently.$

Broadcast is a collective communication primitive in parallel programming to distribute programming instructions or data to nodes in a cluster it is the reverse operation of reduce. The broadcast operation is widely used in parallel algorithms, such as matrix-vector multiplication, Gaussian elimination and shortest paths.

References

↑ Bruck, Jehoshua; Ho, Ching-Tien; Kipnis, Shlomo; Weathersby, Derrick (1997). "Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems" (PDF). IEEE Transactions on Parallel and Distributed Systems. 8 (11): 1143–1156. doi:10.1109/71.642949.
↑ Grama, Ananth (2003). Introduction to parallel computing.
↑ Hambrusch, Susanne E.; Hameed, Farooq; Khokhar, Ashfaq A. (May 1995). "Communication operations on coarse-grained mesh architectures". Parallel Computing. 21 (5): 731–751. doi:10.1016/0167-8191(94)00110-V.
↑ Thakur, Rajeev; Choudhary, Alok (26–29 April 1994). All-to-All Communication on Meshes with Wormhole Routing. Proceedings of 8th International Parallel Processing Symposium. Cancun, Mexico.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Bruck, Jehoshua; Ho, Ching-Tien; Kipnis, Shlomo; Weathersby, Derrick (1997). "Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems" (PDF). IEEE Transactions on Parallel and Distributed Systems. 8 (11): 1143–1156. doi:10.1109/71.642949.

[2] Grama, Ananth (2003). Introduction to parallel computing.

[3] Hambrusch, Susanne E.; Hameed, Farooq; Khokhar, Ashfaq A. (May 1995). "Communication operations on coarse-grained mesh architectures". Parallel Computing. 21 (5): 731–751. doi:10.1016/0167-8191(94)00110-V.

[4] Thakur, Rajeev; Choudhary, Alok (26–29 April 1994). All-to-All Communication on Meshes with Wormhole Routing. Proceedings of 8th International Parallel Processing Symposium. Cancun, Mexico.

[1]

[2]

[3]

[4]