Parallel all-pairs shortest path algorithm

Last updated

A central problem in algorithmic graph theory is the shortest path problem. Hereby, the problem of finding the shortest path between every pair of nodes is known as all-pair-shortest-paths (APSP) problem. As sequential algorithms for this problem often yield long runtimes, parallelization has shown to be beneficial in this field. In this article two efficient algorithms solving this problem are introduced.

Contents

Another variation of the problem is the single-source-shortest-paths (SSSP) problem, which also has parallel approaches: Parallel single-source shortest path algorithm.

Problem definition

Let be a directed Graph with the set of nodes and the set of edges . Each edge has a weight assigned. The goal of the all-pair-shortest-paths problem is to find the shortest path between all pairs of nodes of the graph. For this path to be unique it is required that the graph does not contain cycles with a negative weight.

In the remainder of the article it is assumed that the graph is represented using an adjacency matrix. We expect the output of the algorithm to be a distancematrix . In , every entry is the weight of the shortest path in from node to node .

The Floyd algorithm presented later can handle negative edge weights, whereas the Dijkstra algorithm requires all edges to have a positive weight.

Dijkstra algorithm

The Dijkstra algorithm originally was proposed as a solver for the single-source-shortest-paths problem. However, the algorithm can easily be used for solving the All-Pair-Shortest-Paths problem by executing the Single-Source variant with each node in the role of the root node.

In pseudocode such an implementation could look as follows:

 1    func DijkstraSSSP(G,v) {  2        ... //standard SSSP-implementation here  3        return dv;  4    }  5      6    func DijkstraAPSP(G) {  7        D := |V|x|V|-Matrix  8        forifrom 1 to |V| {  9           //D[v] denotes the v-th row of D  10          D[v] := DijkstraSSP(G,i)  11       }  12   }

In this example we assume that DijkstraSSSP takes the graph and the root node as input. The result of the execution in turn is the distancelist . In , the -th element stores the distance from the root node to the node . Therefore the list corresponds exactly to the -th row of the APSP distancematrix . For this reason, DijkstraAPSP iterates over all nodes of the graph and executes DijkstraSSSP with each as root node while storing the results in .

The runtime of DijkstraSSSP is as we expect the graph to be represented using an adjacency matrix. Therefore DijkstraAPSP has a total sequential runtime of .

Parallelization for up to |V| processors

A trivial parallelization can be obtained by parallelizing the loop of DijkstraAPSP in line8. However, when using the sequential DijkstraSSSP this limits the number of processors to be used by the number of iterations executed in the loop. Therefore, for this trivial parallelization is an upper bound for the number of processors.

For example, let the number of processors be equal to the number of nodes . This results in each processor executing DijkstraSSSP exactly once in parallel. However, when there are only for example processors available, each processor has to execute DijkstraSSSP twice.

In total this yields a runtime of , when is a multiple of . Consequently, the efficiency of this parallelization is perfect: Employing processors reduces the runtime by the factor .

Another benefit of this parallelization is that no communication between the processors is required. However, it is required that every processor has enough local memory to store the entire adjacency matrix of the graph.

Parallelization for more than |V| processors

Apsp dijkstra graph.png
Apsp dijkstra distancelist.png

If more than processors shall be used for the parallelization, it is required that multiple processors take part of the DijkstraSSSP computation. For this reason, the parallelization is split across into two levels.

For the first level the processors are split into partitions. Each partition is responsible for the computation of a single row of the distancematrix . This means each partition has to evaluate one DijkstraSSSP execution with a fixed root node. With this definition each partition has a size of processors. The partitions can perform their computations in parallel as the results of each are independent of each other. Therefore, the parallelization presented in the previous section corresponds to a partition size of 1 with processors.

The main difficulty is the parallelization of multiple processors executing DijkstraSSSP for a single root node. The idea for this parallelization is to distribute the management of the distancelist in DijkstraSSSP within the partition. Each processor in the partition therefore is exclusively responsible for elements of . For example, consider and : this yields a partition size of . In this case, the first processor of each partition is responsible for , and the second processor is responsible for and . Hereby, the total distance lists is .

The DijkstraSSSP algorithm mainly consists of the repetition of two steps: First, the nearest node in the distancelist has to be found. For this node the shortest path already has been found. Afterwards the distance of all neighbors of has to be adjusted in .

These steps have to be altered as follows because for the parallelization has been distributed across the partition:

  1. Find the node with the shortest distance in .
    • Each processor owns a part of : Each processor scans for the local minimum in his part, for example using linear search.
    • Compute the global minimum in by performing a reduce-operation across all .
    • Broadcast the global minimum to all nodes in the partition.
  2. Adjust the distance of all neighbors of in
    • Every processors now knows the global nearest node and its distance. Based on this information, adjust the neighbors of in which are managed by the corresponding processor.

The total runtime of such an iteration of DijkstraSSSP performed by a partition of size can be derived based on the performed subtasks:

For -iterations this results in a total runtime of . After substituting the definition of this yields the total runtime for DijkstraAPSP: .

The main benefit of this parallelization is that it is not required anymore that every processor stores the entire adjacency matrix. Instead, it is sufficient when each processor within a partition only stores the columns of the adjacency matrix of the nodes for which he is responsible. Given a partition size of , each processor only has to store columns of the adjacency matrix. A downside, however, is that this parallelization comes with a communication overhead due to the reduce- and broadcast-operations.

Example

The graph used in this example is the one presented in the image with four nodes.

The goal is to compute the distancematrix with processors. For this reason, the processors are divided into four partitions with two processors each. For the illustration we focus on the partition which is responsible for the computation of the shortest paths from node A to all other nodes. Let the processors of this partition be named p1 and p2.

The computation of the distancelist across the different iterations is visualized in the second image.

The top row in the image corresponds to after the initialization, the bottom one to after the termination of the algorithm. The nodes are distributed in a way that p1 is responsible for the nodes A and B, while p2 is responsible for C and D. The distancelist is distributed according to this. For the second iteration the subtasks executed are shown explicitly in the image:

  1. Computation of the local minimum node in
  2. Computation of the globalminimum node in through a reduce operation
  3. Broadcast of the global minimum node in
  4. Marking of the global nearest node as "finished" and adjusting the distance of its neighbors

Floyd algorithm

The Floyd algorithm solves the All-Pair-Shortest-Paths problem for directed graphs. With the adjacency matrix of a graph as input, it calculates shorter paths iterative. After |V| iterations the distance-matrix contains all the shortest paths. The following describes a sequential version of the algorithm in pseudo code:

 1    func Floyd_All_Pairs_SP(A) {  2         = A;  3        fork := 1 tondo  4            fori := 1 tondo  5                forj := 1 tondo  6                      7     }
partition of a matrix with 2-D block mapping 2d block-mapping.png
partition of a matrix with 2-D block mapping

Where A is the adjacency matrix, n = |V| the number of nodes and D the distance matrix. For a more detailed description of the sequential algorithm look up Floyd–Warshall algorithm.

Parallelization

The basic idea to parallelize the algorithm is to partition the matrix and split the computation between the processes. Each process is assigned to a specific part of the matrix. A common way to achieve this is 2-D Block Mapping. Here the matrix is partitioned into squares of the same size and each square gets assigned to a process. For an -matrix and p processes each process calculates a sized part of the distance matrix. For processes each would get assigned to exactly one element of the matrix. Because of that the parallelization only scales to a maximum of processes. In the following we refer with to the process that is assigned to the square in the i-th row and the j-th column.

As the calculation of the parts of the distance matrix is dependent on results from other parts the processes have to communicate between each other and exchange data. In the following we refer with to the element of the i-th row and j-th column of the distance matrix after the k-th iteration. To calculate we need the elements , and as specified in line 6 of the algorithm. is available to each process as it was calculated by itself in the previous iteration.

Additionally each process needs a part of the k-th row and the k-th column of the matrix. The element holds a process in the same row and the element holds a process in the same column as the process that wants to compute . Each process that calculated a part of the k-th row in the matrix has to send this part to all processes in its column. Each process that calculated a part of the k-th column in the matrix has to send this part to all processes in its row. All this processes have to do a one-to-all-broadcast operation along the row or the column. The data dependencies are illustrated in the image below.

For the 2-D block mapping we have to modify the algorithm as follows:

 1    func Floyd_All_Pairs_Parallel() {  2      fork := 1 tondo{  3          Each process  that has a segment of the k-th row of ,             broadcasts it to the  processes;  4          Each process  that has a segment of the k-th column of ,             broadcasts it to the  processes;  5          Each process waits to receive the needed segments;  6          Each process computes its part of the  matrix;  7          }  8     }
data dependencies in Floyd algorithm Data-denpendencies-floyd.png
data dependencies in Floyd algorithm

In line 5 of the algorithm we have a synchronisation step to ensure that all processes have the data necessary to compute the next iteration. To improve the runtime of the algorithm we can remove the synchronisation step without affecting the correctness of the algorithm. To achieve that each process starts the computation as soon as it has the data necessary to compute its part of the matrix. This version of the algorithm is called pipelined 2-D block mapping.

Runtime

The runtime of the sequential algorithm is determined by the triple nested for loop. The computation in line 6 can be done in constant time (). Therefore, the runtime of the sequential algorithm is .

2-D block mapping

The runtime of the parallelized algorithm consists of two parts. The time for the computation and the part for communication and data transfer between the processes.

As there is no additional computation in the algorithm and the computation is split equally among the p processes, we have a runtime of for the computational part.

In each iteration of the algorithm there is a one-to-all broadcast operation performed along the row and column of the processes. There are elements broadcast. Afterwards there is a synchronisation step performed. How much time these operations take is highly dependent on the architecture of the parallel system used. Therefore, the time needed for communication and data transfer in the algorithm is .

For the whole algorithm we have the following runtime:

Pipelined 2-D block mapping

For the runtime of the data transfer between the processes in the pipelined version of the algorithm we assume that a process can transfer k elements to a neighbouring process in time. In every step there are elements of a row or a column send to a neighbouring process. Such a step takes time. After steps the relevant data of the first row and column arrive at process (in time).

The values of successive rows and columns follow after time in a pipelined mode. Process finishes its last computation after O() + O() time. Therefore, the additional time needed for communication in the pipelined version is .

The overall runtime for the pipelined version of the algorithm is:

Related Research Articles

Minimum spanning tree Tree of smallest total weight through all vertices of a graph

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

Shortest path problem Computational problem of graph theory

In graph theory, the shortest path problem is the problem of finding a path between two vertices in a graph such that the sum of the weights of its constituent edges is minimized.

Dijkstras algorithm Graph search algorithm

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

Breadth-first search Algorithm for searching the nodes of a graph in order by their hop count from a starting node

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

A* is a graph traversal and path search algorithm, which is often used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms which can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases.

Bellman–Ford algorithm Algorithm for finding the shortest paths in graphs

The Bellman–Ford algorithm is an algorithm that computes shortest paths from a single source vertex to all of the other vertices in a weighted digraph. It is slower than Dijkstra's algorithm for the same problem, but more versatile, as it is capable of handling graphs in which some of the edge weights are negative numbers. The algorithm was first proposed by Alfonso Shimbel (1955), but is instead named after Richard Bellman and Lester Ford Jr., who published it in 1958 and 1956, respectively. Edward F. Moore also published a variation of the algorithm in 1959, and for this reason it is also sometimes called the Bellman–Ford–Moore algorithm.

In computer science, the Floyd–Warshall algorithm is an algorithm for finding shortest paths in a directed weighted graph with positive or negative edge weights. A single execution of the algorithm will find the lengths of shortest paths between all pairs of vertices. Although it does not return details of the paths themselves, it is possible to reconstruct the paths with simple modifications to the algorithm. Versions of the algorithm can also be used for finding the transitive closure of a relation , or widest paths between all pairs of vertices in a weighted graph.

Hypergraph Generalization of graph theory

In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. In contrast, in an ordinary graph, an edge connects exactly two vertices.

Graph (abstract data type) Abstract data type in computer science

In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications especially in ranking problems such as feedback arc set. Topological sorting is possible even when the DAG has disconnected components.

Maximal independent set

In graph theory, a maximal independent set (MIS) or maximal stable set is an independent set that is not a subset of any other independent set. In other words, there is no vertex outside the independent set that may join it because it is maximal with respect to the independent set property.

Pathfinding Plotting by a computer application

Pathfinding or pathing is the plotting, by a computer application, of the shortest route between two points. It is a more practical variant on solving mazes. This field of research is based heavily on Dijkstra's algorithm for finding the shortest path on a weighted graph.

Spectral clustering

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

Isomap is a nonlinear dimensionality reduction method. It is one of several widely used low-dimensional embedding methods. Isomap is used for computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points. The algorithm provides a simple method for estimating the intrinsic geometry of a data manifold based on a rough estimate of each data point’s neighbors on the manifold. Isomap is highly efficient and generally applicable to a broad range of data sources and dimensionalities.

In graph theory, the planar separator theorem is a form of isoperimetric inequality for planar graphs, that states that any planar graph can be split into smaller pieces by removing a small number of vertices. Specifically, the removal of vertices from an -vertex graph can partition the graph into disjoint subgraphs each of which has at most vertices.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

In computer science, the method of contraction hierarchies is a speed-up technique for finding the shortest-path in a graph. The most intuitive applications are car-navigation systems: a user wants to drive from to using the quickest possible route. The metric optimized here is the travel time. Intersections are represented by vertices, the road sections connecting them by edges. The edge weights represent the time it takes to drive along this segment of the road. A path from to is a sequence of edges ; the shortest path is the one with the minimal sum of edge weights among all possible paths. The shortest path in a graph can be computed using Dijkstra's algorithm but, given that road networks consist of tens of millions of vertices, this is impractical. Contraction hierarchies is a speed-up method optimized to exploit properties of graphs representing road networks. The speed-up is achieved by creating shortcuts in a preprocessing phase which are then used during a shortest-path query to skip over "unimportant" vertices. This is based on the observation that road networks are highly hierarchical. Some intersections, for example highway junctions, are "more important" and higher up in the hierarchy than for example a junction leading into a dead end. Shortcuts can be used to save the precomputed distance between two important junctions such that the algorithm doesn't have to consider the full path between these junctions at query time. Contraction hierarchies do not know about which roads humans consider "important", but they are provided with the graph as input and are able to assign importance to vertices using heuristics.

Yen's algorithm computes single-source K-shortest loopless paths for a graph with non-negative edge cost. The algorithm was published by Jin Y. Yen in 1971 and employs any shortest path algorithm to find the best path, then proceeds to find K − 1 deviations of the best path.

The out-of-kilter algorithm is an algorithm that computes the solution to the minimum-cost flow problem in a flow network. It was published in 1961 by D. R. Fulkerson  and is described here. The analog of steady state flow in a network of nodes and arcs may describe a variety of processes. Examples include transportation systems & personnel assignment actions. Arcs generally have cost & capacity parameters. A recurring problem is trying to determine the minimum cost route between two points in a capacitated network. The idea of the algorithm is to identify out-of-kilter arcs and modify the flow network until all arcs are in-kilter and a minimum cost flow has been reached. The algorithm can be used to minimize the total cost of a constrained flow in an oriented network.

A central problem in algorithmic graph theory is the shortest path problem. One of the generalizations of the shortest path problem is known as the single-source-shortest-paths (SSSP) problem, which consists of finding the shortest path between every pair of vertices in a graph. There are classical sequential algorithms which solve this problem, such as Dijkstra's algorithm. In this article, however, we present two parallel algorithms solving this problem.

References

    Bibliography