Atomic broadcast

Last updated

In fault-tolerant distributed computing, an atomic broadcast or total order broadcast is a broadcast where all correct processes in a system of multiple processes receive the same set of messages in the same order; that is, the same sequence of messages. [1] [2] The broadcast is termed "atomic" because it either eventually completes correctly at all participants, or all participants abort without side effects. Atomic broadcasts are an important distributed computing primitive.

Contents

Properties

The following properties are usually required from an atomic broadcast protocol:

  1. Validity: if a correct participant broadcasts a message, then all correct participants will eventually receive it.
  2. Uniform Agreement: if one correct participant receives a message, then all correct participants will eventually receive that message.
  3. Uniform Integrity: a message is received by each participant at most once, and only if it was previously broadcast.
  4. Uniform Total Order: the messages are totally ordered in the mathematical sense; that is, if any correct participant receives message 1 first and message 2 second, then every other correct participant must receive message 1 before message 2.

Rodrigues and Raynal [3] and Schiper et al. [4] define the integrity and validity properties of atomic broadcast slightly differently.

Note that total order is not equivalent to FIFO order, which requires that if a process sent message 1 before it sent message 2, then all participants must receive message 1 before receiving message 2. It is also not equivalent to "causal order", where if message 2 "depends on" or "occurs after" message 1 then all participants must receive message 2 after receiving message 1. While a strong and useful condition, total order requires only that all participants receive the messages in the same order, but does not place other constraints on that order. [5]

Fault tolerance

Designing an algorithm for atomic broadcasts is relatively easy if it can be assumed that computers will not fail. For example, if there are no failures, atomic broadcast can be achieved simply by having all participants communicate with one "leader" which determines the order of the messages, with the other participants following the leader.

However, real computers are faulty; they fail and recover from failure at unpredictable, possibly inopportune, times. For example, in the follow-the-leader algorithm, what if the leader fails at the wrong time? In such an environment achieving atomic broadcasts is difficult. [1] A number of protocols have been proposed for performing atomic broadcast, under various assumptions about the network, failure models, availability of hardware support for multicast, and so forth. [2]

Equivalent to consensus

In order for the conditions for atomic broadcast to be satisfied, the participants must effectively "agree" on the order of receipt of the messages. Participants recovering from failure, after the other participants have "agreed" an order and started to receive the messages, must be able to learn and comply with the agreed order. Such considerations indicate that in systems with crash failures, atomic broadcast and consensus are equivalent problems. [6]

A value can be proposed by a process for consensus by atomically broadcasting it, and a process can decide a value by selecting the value of the first message which it atomically receives. Thus, consensus can be reduced to atomic broadcast.

Conversely, a group of participants can atomically broadcast messages by achieving consensus regarding the first message to be received, followed by achieving consensus on the next message, and so forth until all the messages have been received. Thus, atomic broadcast reduces to consensus. This was demonstrated more formally and in greater detail by Xavier Défago, et al. [2]

A fundamental result in distributed computing is that achieving consensus in asynchronous systems in which even one crash failure can occur is impossible in the most general case. This was shown in 1985 by Michael J. Fischer, Nancy Lynch, and Mike Paterson, and is sometimes called the FLP result. [7] Since consensus and atomic broadcast are equivalent, FLP applies also to atomic broadcast. [5] The FLP result does not prohibit the implementation of atomic broadcast in practice, but it does require making less stringent assumptions than FLP in some respect, such as about processor and communication timings.

Algorithms

The Chandra-Toueg algorithm [6] is a consensus-based solution to atomic broadcast. Another solution has been put forward by Rodrigues and Raynal. [3]

The Zookeeper Atomic Broadcast (ZAB) protocol is the basic building block for Apache ZooKeeper, a fault-tolerant distributed coordination service which underpins Hadoop and many other important distributed systems. [8] [9]

Ken Birman has proposed the virtual synchrony execution model for distributed systems, the idea of which is that all processes observe the same events in the same order. A total ordering of the messages being received, as in atomic broadcast, is one (though not the only) method for attaining virtually synchronous message receipt.

Related Research Articles

<span class="mw-page-title-main">Leslie Lamport</span> American computer scientist

Leslie B. Lamport is an American computer scientist. Lamport is best known for his seminal work in distributed systems, and as the initial developer of the document preparation system LaTeX and the author of its first manual. Lamport was the winner of the 2013 Turing Award for imposing clear, well-defined coherence on the seemingly chaotic behavior of distributed computing systems, in which several autonomous computers communicate with each other by passing messages. He devised important algorithms and developed formal modeling and verification protocols that improve the quality of real distributed systems. These contributions have resulted in improved correctness, performance, and reliability of computer systems.

Self-stabilization is a concept of fault-tolerance in distributed systems. Given any initial state, a self-stabilizing distributed system will end up in a correct state in a finite number of execution steps.

In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort the transaction. This protocol achieves its goal even in many cases of temporary system failure, and is thus widely used. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed to remedy an outcome. To accommodate recovery from failure the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures. Many protocol variants exist that primarily differ in logging strategies and recovery mechanisms. Though usually intended to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered and supported by the protocol.

Concurrency (computer science) Ability execute a task in a non-serial manner

In computer science, concurrency is the ability of different parts or units of a program, algorithm, or problem to be executed out-of-order or in partial order, without affecting the final outcome. This allows for parallel execution of the concurrent units, which can significantly improve overall speed of the execution in multi-processor and multi-core systems. In more technical terms, concurrency refers to the decomposability of a program, algorithm, or problem into order-independent or partially-ordered components or units of computation.

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine generals problem", developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

Clock synchronization is a topic in computer science and engineering that aims to coordinate otherwise independent clocks. Even when initially set accurately, real clocks will differ after some amount of time due to clock drift, caused by clocks counting time at slightly different rates. There are several problems that occur as a result of clock rate differences and several solutions, some being more acceptable than others in certain contexts.

A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. Distributed algorithms are used in different application areas of distributed computing, such as telecommunications, scientific computing, distributed information processing, and real-time process control. Standard problems solved by distributed algorithms include leader election, consensus, distributed search, spanning tree generation, mutual exclusion, and resource allocation.

Head-of-line blocking in computer networking is a performance-limiting phenomenon that occurs when a line of packets is held up in a queue by a first packet. Examples include input buffered network switches, out-of-order delivery and multiple requests in HTTP pipelining.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts. Real-world applications often requiring consensus include cloud computing, clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain, and others.

In computer science, state machine replication (SMR) or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols.

Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

A gossip protocol or epidemic protocol is a procedure or process of computer peer-to-peer communication that is based on the way epidemics spread. Some distributed systems use peer-to-peer gossip to ensure that data is disseminated to all members of a group. Some ad-hoc networks have no central registry and the only way to spread common data is to rely on each member to pass it along to their neighbors.

Byzantine fault tolerant protocols are algorithms that are robust to arbitrary types of failures in distributed algorithms. The Byzantine agreement protocol is an essential part of this task. The constant-time quantum version of the Byzantine protocol, is described below.

Distributed algorithmic mechanism design (DAMD) is an extension of algorithmic mechanism design.

The Brooks–Iyengar algorithm or Brooks–Iyengar hybrid algorithm is a distributed algorithm that improves both the precision and accuracy of the interval measurements taken by a distributed sensor network, even in the presence of faulty sensors. The sensor network does this by exchanging the measured value and accuracy value at every node with every other node, and computes the accuracy range and a measured value for the whole network from all of the values collected. Even if some of the data from some of the sensors is faulty, the sensor network will not malfunction. The algorithm is fault-tolerant and distributed. It could also be used as a sensor fusion method. The precision and accuracy bound of this algorithm have been proved in 2016.

The Chandra–Toueg consensus algorithm, published by Tushar Deepak Chandra and Sam Toueg in 1996, is an algorithm for solving consensus in a network of unreliable processes equipped with an eventually strong failure detector. The failure detector is an abstract version of timeouts; it signals to each process when other processes may have crashed. An eventually strong failure detector is one that never identifies some specific non-faulty process as having failed after some initial period of confusion, and, at the same time, eventually identifies all faulty processes as failed. The Chandra–Toueg consensus algorithm assumes that the number of faulty processes, denoted by f, is less than n/2, i.e. it assumes f < n/2, where n is the total number of processes.

In a distributed computing system, a failure detector is a computer application or a subsystem that is responsible for the detection of node failures or crashes. Failure detectors were first introduced in 1996 by Chandra and Toueg in their book Unreliable Failure Detectors for Reliable Distributed Systems. The book depicts the failure detector as a tool to improve consensus and atomic broadcast in the distributed system. In other word, failure detectors seek for errors in the process, and the system will maintain a level of reliability. In practice, after failure detectors spot crashes, the system will ban the processes that are making mistakes to prevent any further serious crashes or errors.

Gbcast is a reliable multicast protocol that provides ordered, fault-tolerant (all-or-none) message delivery in a group of receivers within a network of machines that experience crash failure. The protocol is capable of solving Consensus in a network of unreliable processors, and can be used to implement state machine replication. Gbcast can be used in a standalone manner, or can support the virtual synchrony execution model, in which case Gbcast is normally used for group membership management while other, faster, protocols are often favored for routine communication tasks.

Michel Raynal, is a French informatics scientist, professor at IRISA, University of Rennes, France. He is known for his contributions in the fields of algorithms, computability, and fault-tolerance in the context of concurrent and distributed systems. Michel Raynal is also Distinguished Chair professor at the Hong Kong Polytechnic University and editor of the “Synthesis Lectures on Distributed Computing Theory” published by Morgan & Claypool. He is a senior member of Institut Universitaire de France and a member of Academia Europaea.

References

  1. 1 2 Kshemkalyani, Ajay; Singhal, Mukesh (2008). Distributed Computing: Principles, Algorithms, and Systems (Google eBook) . Cambridge University Press. pp.  583–585. ISBN   9781139470315.
  2. 1 2 3 Défago, Xavier; Schiper, André; Urbán, Péter (2004). "Total order broadcast and multicast algorithms" (PDF). ACM Computing Surveys. 36 (4): 372–421. doi:10.1145/1041680.1041682.
  3. 1 2 Rodrigues L, Raynal M.: Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems , ICDCS '00: Proceedings of the 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
  4. Ekwall, R.; Schiper, A. (2006). "Solving Atomic Broadcast with Indirect Consensus". International Conference on Dependable Systems and Networks (DSN'06) (PDF). pp. 156–165. doi:10.1109/dsn.2006.65. ISBN   0-7695-2607-1.
  5. 1 2 Dermot Kelly. "Group Communication".
  6. 1 2 Chandra, Tushar Deepak; Toueg, Sam (1996). "Unreliable failure detectors for reliable distributed systems". Journal of the ACM. 43 (2): 225–267. doi:10.1145/226643.226647.
  7. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson (1985). "Impossibility of Distributed Consensus with One Faulty Process" (PDF). Journal of the ACM. 32 (2): 374–382. doi:10.1145/3149.214121.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  8. Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini, Yahoo! Research (2011). "Zab: High-performance broadcast for primary-backup systems". 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). pp. 245–256. doi:10.1109/DSN.2011.5958223. ISBN   978-1-4244-9233-6. S2CID   206611670.{{cite book}}: CS1 maint: multiple names: authors list (link)
  9. André Medeiros (March 20, 2012). "ZooKeeper's atomic broadcast protocol: Theory and practice" (PDF). Helsinki University of Technology - Laboratory of Theoretical Computer Science.