In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" (or coordinator) of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.
The network nodes communicate among themselves in order to decide which of them will get into the "leader" state. For that, they need some method in order to break the symmetry among them. For example, if each node has unique and comparable identities, then the nodes can compare their identities, and decide that the node with the highest identity is the leader.
The definition of this problem is often attributed to LeLann, who formalized it as a method to create a new token in a token ring network in which the token has been lost.
Leader election algorithms are designed to be economical in terms of total bytes transmitted, and time. The algorithm suggested by Gallager, Humblet, and Spira [1] for general undirected graphs has had a strong impact on the design of distributed algorithms in general, and won the Dijkstra Prize for an influential paper in distributed computing.
Many other algorithms have been suggested for different kinds of network graphs, such as undirected rings, unidirectional rings, complete graphs, grids, directed Euler graphs, and others. A general method that decouples the issue of the graph family from the design of the leader election algorithm was suggested by Korach, Kutten, and Moran. [2]
The problem of leader election is for each processor eventually to decide whether it is a leader or not, subject to the constraint that exactly one processor decides that it is the leader. [3] An algorithm solves the leader election problem if:
A valid leader election algorithm must meet the following conditions: [4]
An algorithm for leader election may vary in the following aspects: [5]
A ring network is a connected-graph topology in which each node is exactly connected to two other nodes, i.e., for a graph with n nodes, there are exactly n edges connecting the nodes. A ring can be unidirectional, which means processors only communicate in one direction (a node could only send messages to the left or only send messages to the right), or bidirectional, meaning processors may transmit and receive messages in both directions (a node could send messages to the left and right).
A ring is said to be anonymous if every processor is identical. More formally, the system has the same state machine for every processor. [3] There is no deterministic algorithm to elect a leader in anonymous rings, even when the size of the network is known to the processes. [3] [6] This is due to the fact that there is no possibility of breaking symmetry in an anonymous ring if all processes run at the same speed. The state of processors after some steps only depends on the initial state of neighbouring nodes. So, because their states are identical and execute the same procedures, in every round the same messages are sent by each processor. Therefore, each processor state also changes identically and as a result if one processor is elected as a leader, so are all the others.
For simplicity, here is a proof in anonymous synchronous rings. It is a proof by contradiction. Consider an anonymous ring R with size n>1. Assume there exists an algorithm "A" to solve leader election in this anonymous ring R. [3]
Proof. Proof by induction on .
Base case:: all the processes are in the initial state, so all the processes are identical.
Induction hypothesis: assume the lemma is true for rounds.
Inductive step: in round , every process send the same message to the right and send the same message to the left. Since all the processes are in the same state after round , in round k, every process will receive the message from the left edge, and will receive the message from the right edge. Since all processes are receiving the same messages in round , they are in the same state after round .
The above lemma contradicts the fact that after some finite number of rounds in an execution of A, one process entered the elected state and other processes entered the non-elected state.
A common approach to solve the problem of leader election in anonymous rings is the use of probabilistic algorithms. In such approaches, generally processors assume some identities based on a probabilistic function and communicate it to the rest of the network. At the end, through the application of an algorithm, a leader is selected (with high probability).
Since there is no algorithm for anonymous rings (proved above), the asynchronous rings would be considered as asynchronous non-anonymous rings. In non-anonymous rings, each process has a unique , and they don't know the size of the ring. Leader election in asynchronous rings can be solved by some algorithm with using messages or messages.
In the algorithm, every process sends a message with its to the left edge. Then waits until a message from the right edge. If the in the message is greater than its own , then forwards the message to the left edge; else ignore the message, and does nothing. If the in the message is equal to its own , then sends a message to the left announcing myself is elected. Other processes forward the announcement to the left and turn themselves to non-elected. It is clear that the upper bound is for this algorithm.
In the algorithm, it is running in phases. On the th phase, a process will determine whether it is the winner among the left side and right side neighbors. If it is a winner, then the process can go to next phase. In phase , each process needs to determine itself is a winner or not by sending a message with its to the left and right neighbors (neighbor do not forward the message). The neighbor replies an only if the in the message is larger than the neighbor's , else replies an . If receives two s, one from the left, one from the right, then is the winner in phase . In phase , the winners in phase need to send a message with its to the left and right neighbors. If the neighbors in the path receive the in the message larger than their , then forward the message to the next neighbor, otherwise reply an . If the th neighbor receives the larger than its , then sends back an , otherwise replies an . If the process receives two s, then it is the winner in phase . In the last phase, the final winner will receive its own in the message, then terminates and send termination message to the other processes. In the worst case, each phase there are at most winners, where is the phase number. There are phases in total. Each winner sends in the order of messages in each phase. So, the messages complexity is .
In Attiya and Welch's Distributed Computing book, [3] they described a non-uniform algorithm using messages in synchronous ring with known ring size . The algorithm is operating in phases, each phase has rounds, each round is one time unit. In phase , if there is a process with , then process sends termination message to the other processes (sending termination messages cost rounds). Else, go to the next phase. The algorithm will check if there is a phase number equals to a process , then does the same steps as phase . At the end of the execution, the minimal will be elected as the leader. It used exactly messages and rounds.
Itai and Rodeh [7] introduced an algorithm for a unidirectional ring with synchronized processes. They assume the size of the ring (number of nodes) is known to the processes. For a ring of size n, a≤n processors are active. Each processor decides with probability of a^(-1) whether to become a candidate. At the end of each phase, each processor calculates the number of candidates c and if it is equal to 1, it becomes the leader. To determine the value of c, each candidate sends a token (pebble) at the start of the phase which is passed around the ring, returning after exactly n time units to its sender. Every processor determines c by counting the number of pebbles which passed through. This algorithm achieves leader election with expected message complexity of O(nlogn). A similar approach is also used in which a time-out mechanism is employed to detect deadlocks in the system. [8] There are also algorithms for rings of special sizes such as prime size [9] [10] and odd size. [11]
In typical approaches to leader election, the size of the ring is assumed to be known to the processes. In the case of anonymous rings, without using an external entity, it is not possible to elect a leader. Even assuming an algorithm exists, the leader could not estimate the size of the ring. i.e. in any anonymous ring, there is a positive probability that an algorithm computes a wrong ring size. [12] To overcome this problem, Fisher and Jiang used a so-called leader oracle Ω? that each processor can ask whether there is a unique leader. They show that from some point upward, it is guaranteed to return the same answer to all processes. [13]
In one of the early works, Chang and Roberts [14] proposed a uniform algorithm in which a processor with the highest ID is selected as the leader. Each processor sends its ID in a clockwise direction. A processor receives a message and compares the ID with its own. If the ID is bigger then the processor passes it through, otherwise it discards the message. The authors show that this algorithm uses messages in the worst case and in the average case.
Hirschberg and Sinclair [15] improved this algorithm with message complexity by introducing a bidirectional message-passing scheme.
The mesh is another popular form of network topology, especially in parallel systems, redundant memory systems and interconnection networks. [16]
In a mesh structure, nodes are either corner (only two neighbours), border (only three neighbours) or interior (with four neighbours). The number of edges in a mesh of size a x b is m=2ab-a-b.
A typical algorithm to solve the leader election in an unoriented mesh is to only elect one of the four corner nodes as the leader. Since the corner nodes might not be aware of the state of other processes, the algorithm should first wake up the corner nodes. A leader can be elected as follows. [17]
The message complexity is at most , and if the mesh is square-shaped, .
An oriented mesh is a special case where port numbers are compass labels, i.e. north, south, east and west. Leader election in an oriented mesh is trivial. We only need to nominate a corner, e.g. "north" and "east" and make sure that node knows it is a leader.
A special case of mesh architecture is a torus which is a mesh with "wrap-around". In this structure, every node has exactly 4 connecting edges. One approach to elect a leader in such a structure is known as electoral stages. Similar to procedures in ring structures, this method in each stage eliminates potential candidates until eventually one candidate node is left. This node becomes the leader and then notifies all other processes of termination. [16] This approach can be used to achieve a complexity of O(n). There also more practical approaches introduced for dealing with presence of faulty links in the network. [18] [19]
A Hypercube is a network consisting of nodes, each with degree of and edges. A similar electoral stages as before can be used to solve the problem of leader election. In each stage two nodes (called duelists) compete and the winner is promoted to the next stage. This means in each stage only half of the duelists enter the next stage. This procedure continues until only one duelist is left, and it becomes the leader. Once selected, it notifies all other processes. This algorithm requires messages. In the case of unoriented hypercubes, a similar approach can be used but with a higher message complexity of . [16]
Complete networks are structures in which all processes are connected to one another, i.e., the degree of each node is n-1, n being the size of the network. An optimal solution with O(n) message and space complexity is known. [20] In this algorithm, processes have the following states:
There is an assumption that although a node does not know the total set of nodes in the system, it is required that in this arrangement every node knows the identifier of its single successor, which is called neighbor, [20] and every node is known by another one. [21]
All processors initially start in a passive state until they are woken up. Once the nodes are awake, they are candidates to become the leader. Based on a priority scheme, candidate nodes collaborate in the virtual ring. At some point, candidates become aware of the identity of candidates that precede them in the ring. The higher priority candidates ask the lower ones about their predecessors. The candidates with lower priority become dummies after replying to the candidates with higher priority. Based on this scheme, the highest priority candidate eventually knows that all nodes in the system are dummies except itself, at which point it knows it is the leader.
The above algorithm is not correct — it needs further improvement. [21]
As the name implies, these algorithms are designed to be used in any process network without prior knowledge of the network's topology or properties (such as size). [16]
The Shout protocol builds a spanning tree on a generic graph and elects its root as leader. The algorithm has a total cost linear in the edges cardinality.
This technique is similar to finding a Minimum Spanning Tree (MST) in which the root of the tree becomes the leader. The idea is that individual nodes "merge" with each other to form bigger structures. The result of this algorithm is a tree (a graph with no cycles) whose root is the leader of the entire system. The cost of the mega-merger method is , where m is the number of edges and n is the number of nodes.
Yo-yo (algorithm) is a minimum finding algorithm consisting of two parts: a preprocessing phase and a series of iterations. [16] In the first phase or setup, each node exchanges its id with all its neighbours and based on the value it orients its incident edges. For instance, if node x has a smaller id than y, x orients towards y. If a node has a smaller id than all its neighbours it becomes a source. In contrast, a node with all inward edges (i.e., with id larger than all of its neighbours) is a sink. All other nodes are internal nodes.
Once all the edges are oriented, the iteration phase starts. Each iteration is an electoral stage in which some candidates will be removed. Each iteration has two phases: YO- and –YO. In this phase sources start the process to propagate to each sink the smallest values of the sources connected to that sink.
Yo-
-yo
After the final stage, any source who receives a NO is no longer a source and becomes a sink. An additional stage, pruning, also is introduced to remove the nodes that are useless, i.e. their existence has no impact on the next iterations.
This method has a total cost of O(mlogn) messages. Its real message complexity including pruning is an open research problem and is unknown.
This section needs expansionwith: examples and additional citations. You can help by adding to it. (October 2014) |
In radio network protocols, leader election is often used as a first step to approach more advanced communication primitives, such as message gathering or broadcasts. [22] The very nature of wireless networks induces collisions when adjacent nodes transmit at the same time; electing a leader allows to better coordinate this process. While the diameter D of a network is a natural lower bound for the time needed to elect a leader, upper and lower bounds for the leader election problem depend on the specific radio model studied.
In radio networks, the n nodes may in every round choose to either transmit or receive a message. If no collision detection is available, then a node cannot distinguish between silence or receiving more than one message at a time. Should collision detection be available, then a node may detect more than one incoming message at the same time, even though the messages itself cannot be decoded in that case. In the beeping model, nodes can only distinguish between silence or at least one message via carrier sensing.
Known runtimes for single-hop networks range from a constant (expected with collision detection) to O(n log n) rounds (deterministic and no collision detection). In multi-hop networks, known runtimes differ from roughly O((D+ log n)(log2 log n)) rounds (with high probability in the beeping model), O(D log n) (deterministic in the beeping model), O(n) (deterministic with collision detection) to O(n log3/2 n (log log n)0.5) rounds (deterministic and no collision detection).
A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.
Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a weighted graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.
A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.
In computing, Chord is a protocol and algorithm for a peer-to-peer distributed hash table. A distributed hash table stores key-value pairs by assigning keys to different computers ; a node will store the values for all the keys for which it is responsible. Chord specifies how keys are assigned to nodes, and how a node can discover the value for a given key by first locating the node responsible for that key.
In graph theory, graph coloring is a special case of graph labeling; it is an assignment of labels traditionally called "colors" to elements of a graph subject to certain constraints. In its simplest form, it is a way of coloring the vertices of a graph such that no two adjacent vertices are of the same color; this is called a vertex coloring. Similarly, an edge coloring assigns a color to each edge so that no two adjacent edges are of the same color, and a face coloring of a planar graph assigns a color to each face or region so that no two faces that share a boundary have the same color.
Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.
In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge (u,v) from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications, especially in ranking problems such as feedback arc set. Topological sorting is also possible when the DAG has disconnected components.
In graph theory, a maximal independent set (MIS) or maximal stable set is an independent set that is not a subset of any other independent set. In other words, there is no vertex outside the independent set that may join it because it is maximal with respect to the independent set property.
A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts. Real-world applications often requiring consensus include cloud computing, clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain, and others.
The distributed minimum spanning tree (MST) problem involves the construction of a minimum spanning tree by a distributed algorithm, in a network where nodes communicate by message passing. It is radically different from the classical sequential problem, although the most basic approach resembles Borůvka's algorithm. One important application of this problem is to find a tree that can be used for broadcasting. In particular, if the cost for a message to pass through an edge in a graph is significant, an MST can minimize the total cost for a source process to communicate with all the other processes in the network.
In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.
In peer-to-peer networks, Koorde is a distributed hash table (DHT) system based on the Chord DHT and the De Bruijn graph. Inheriting the simplicity of Chord, Koorde meets O(log n) hops per node, and hops per lookup request with O(log n) neighbors per node.
Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.
Skip graphs are a kind of distributed data structure based on skip lists. They were invented in 2003 by James Aspnes and Gauri Shah. A nearly identical data structure called SkipNet was independently invented by Nicholas Harvey, Michael Jones, Stefan Saroiu, Marvin Theimer and Alec Wolman, also in 2003.
In graph theory, Yen's algorithm computes single-source K-shortest loopless paths for a graph with non-negative edge cost. The algorithm was published by Jin Y. Yen in 1971 and employs any shortest path algorithm to find the best path, then proceeds to find K − 1 deviations of the best path.
Mega-merger is a distributed algorithm aimed at solving the election problem in generic connected undirected graph.
Yo-Yo is a distributed algorithm aimed at minimum finding and leader election in generic connected undirected graph. Unlike Mega-Merger it has a trivial termination and cost analysis.
A central problem in algorithmic graph theory is the shortest path problem. Hereby, the problem of finding the shortest path between every pair of nodes is known as all-pair-shortest-paths (APSP) problem. As sequential algorithms for this problem often yield long runtimes, parallelization has shown to be beneficial in this field. In this article two efficient algorithms solving this problem are introduced.
Broadcast is a collective communication primitive in parallel programming to distribute programming instructions or data to nodes in a cluster. It is the reverse operation of reduction. The broadcast operation is widely used in parallel algorithms, such as matrix-vector multiplication, Gaussian elimination and shortest paths.
In parallel computing, all-to-all is a collective operation, where each processor sends an individual message to every other processor.
{{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite book}}
: |journal=
ignored (help)