State machine replication

Last updated

In computer science, state machine replication (SMR) or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols. [1]

Contents

Problem definition

Distributed service

In terms of clients and services, each service comprises one or more servers and exports operations that clients invoke by making requests. Although using a single, centralized server is the simplest way to implement a service, the resulting service can only be as fault tolerant as the processor executing that server. If this level of fault tolerance is unacceptable, then multiple servers that fail independently can be used. Usually, replicas of a single server are executed on separate processors of a distributed system, and protocols are used to coordinate client interactions with these replicas.

State machine

For the subsequent discussion a State Machine will be defined as the following tuple of values [2] (See also Mealy machine and Moore Machine):

A State Machine begins at the State labeled Start. Each Input received is passed through the transition and output function to produce a new State and an Output. The State is held stable until a new Input is received, while the Output is communicated to the appropriate receiver.

This discussion requires a State Machine to be deterministic: multiple copies of the same State Machine begin in the Start state, and receiving the same Inputs in the same order will arrive at the same State having generated the same Outputs.

Typically, systems based on State Machine Replication voluntarily restrict their implementations to use finite-state machines to simplify error recovery.

Fault Tolerance

Determinism is an ideal characteristic for providing fault-tolerance. Intuitively, if multiple copies of a system exist, a fault in one would be noticeable as a difference in the State or Output from the others.

A little deduction shows the minimum number of copies needed for fault-tolerance is three; one which has a fault, and two others to whom we compare State and Output. Two copies are not enough as there is no way to tell which copy is the faulty one.

Further deduction shows a three-copy system can support at most one failure (after which it must repair or replace the faulty copy). If more than one of the copies were to fail, all three States and Outputs might differ, and there would be no way to choose which is the correct one.

In general, a system which supports F failures must have 2F+1 copies (also called replicas). [3] The extra copies are used as evidence to decide which of the copies are correct and which are faulty. Special cases can improve these bounds. [4]

All of this deduction pre-supposes that replicas are experiencing only random independent faults such as memory errors or hard-drive crash. Failures caused by replicas which attempt to lie, deceive, or collude can also be handled by the State Machine Approach, with isolated changes.

Failed replicas are not required to stop; they may continue operating, including generating spurious or incorrect Outputs.

Special Case: Fail-Stop

Theoretically, if a failed replica is guaranteed to stop without generating outputs, only F+1 replicas are required, and clients may accept the first output generated by the system. No existing systems achieve this limit, but it is often used when analyzing systems built on top of a fault-tolerant layer (Since the fault-tolerant layer provides fail-stop semantics to all layers above it).

Special Case: Byzantine Failure

Faults where a replica sends different values in different directions (for instance, the correct Output to some of its fellow replicas and incorrect Outputs to others) are called Byzantine Failures. [5] Byzantine failures may be random, spurious faults, or malicious, intelligent attacks. 2F+1 replicas, with non-cryptographic hashes suffices to survive all non-malicious Byzantine failures (with high probability). Malicious attacks require cryptographic primitives to achieve 2F+1 (using message signatures), or non-cryptographic techniques can be applied but the number of replicas must be increased to 3F+1. [5]

The State Machine Approach

The preceding intuitive discussion implies simple technique for implementing a fault-tolerant service in terms of a State Machine:

  1. Place copies of the State Machine on multiple, independent servers.
  2. Receive client requests, interpreted as Inputs to the State Machine.
  3. Choose an ordering for the Inputs.
  4. Execute Inputs in the chosen order on each server.
  5. Respond to clients with the Output from the State Machine.
  6. Monitor replicas for differences in State or Output.

The remainder of this article develops the details of this technique.

The appendix contains discussion on typical extensions used in real-world systems such as Logging, Checkpoints, Reconfiguration, and State Transfer.

Ordering Inputs

The critical step in building a distributed system of State Machines is choosing an order for the Inputs to be processed. Since all non-faulty replicas will arrive at the same State and Output if given the same Inputs, it is imperative that the Inputs are submitted in an equivalent order at each replica. Many solutions have been proposed in the literature. [2] [6] [7] [8] [9]

A Visible Channel is a communication path between two entities actively participating in the system (such as clients and servers). Example: client to server, server to server

A Hidden Channel is a communication path which is not revealed to the system. Example: client to client channels are usually hidden; such as users communicating over a telephone, or a process writing files to disk which are read by another process.

When all communication paths are visible channels and no hidden channels exist, a partial global order (Causal Order) may be inferred from the pattern of communications. [8] [10] Causal Order may be derived independently by each server. Inputs to the State Machine may be executed in Causal Order, guaranteeing consistent State and Output for all non-faulty replicas.

In open systems, hidden channels are common and a weaker form of ordering must be used. An order of Inputs may be defined using a voting protocol whose results depend only on the visible channels.

The problem of voting for a single value by a group of independent entities is called Consensus . By extension, a series of values may be chosen by a series of consensus instances. This problem becomes difficult when the participants or their communication medium may experience failures. [3]

Inputs may be ordered by their position in the series of consensus instances (Consensus Order). [7] Consensus Order may be derived independently by each server. Inputs to the State Machine may be executed in Consensus Order, guaranteeing consistent State and Output for all non-faulty replicas.

Optimizing Causal & Consensus Ordering
In some cases additional information is available (such as real-time clocks). In these cases, it is possible to achieve more efficient causal or consensus ordering for the Inputs, with a reduced number of messages, fewer message rounds, or smaller message sizes. See references for details [1] [4] [6] [11]
Further optimizations are available when the semantics of State Machine operations are accounted for (such as Read vs Write operations). See references Generalized Paxos. [2] [12]

Sending Outputs

Client requests are interpreted as Inputs to the State Machine, and processed into Outputs in the appropriate order. Each replica will generate an Output independently. Non-faulty replicas will always produce the same Output. Before the client response can be sent, faulty Outputs must be filtered out. Typically, a majority of the Replicas will return the same Output, and this Output is sent as the response to the client.

System Failure

If there is no majority of replicas with the same Output, or if less than a majority of replicas returns an Output, a system failure has occurred. The client response must be the unique Output: FAIL.

Auditing and Failure Detection

The permanent, unplanned compromise of a replica is called a Failure. Proof of failure is difficult to obtain, as the replica may simply be slow to respond, [13] or even lie about its status. [5]

Non-faulty replicas will always contain the same State and produce the same Outputs. This invariant enables failure detection by comparing States and Outputs of all replicas. Typically, a replica with State or Output which differs from the majority of replicas is declared faulty.

A common implementation is to pass checksums of the current replica State and recent Outputs among servers. An Audit process at each server restarts the local replica if a deviation is detected. [14] Cryptographic security is not required for checksums.

It is possible that the local server is compromised, or that the Audit process is faulty, and the replica continues to operate incorrectly. This case is handled safely by the Output filter described previously (see Sending Outputs).

Appendix: Extensions

Input Log

In a system with no failures, the Inputs may be discarded after being processed by the State Machine. Realistic deployments must compensate for transient non-failure behaviors of the system such as message loss, network partitions, and slow processors. [14]

One technique is to store the series of Inputs in a log. During times of transient behavior, replicas may request copies of a log entry from another replica in order to fill in missing Inputs. [7]

In general the log is not required to be persistent (it may be held in memory). A persistent log may compensate for extended transient periods, or support additional system features such as Checkpoints, and Reconfiguration.

Checkpoints

If left unchecked a log will grow until it exhausts all available storage resources. For continued operation, it is necessary to forget log entries. In general a log entry may be forgotten when its contents are no longer relevant (for instance if all replicas have processed an Input, the knowledge of the Input is no longer needed).

A common technique to control log size is store a duplicate State (called a Checkpoint), then discard any log entries which contributed to the checkpoint. This saves space when the duplicated State is smaller than the size of the log.

Checkpoints may be added to any State Machine by supporting an additional Input called CHECKPOINT. Each replica maintains a checkpoint in addition to the current State value. When the log grows large, a replica submits the CHECKPOINT command just like a client request. The system will ensure non-faulty replicas process this command in the same order, after which all log entries before the checkpoint may be discarded.

In a system with checkpoints, requests for log entries occurring before the checkpoint are ignored. Replicas which cannot locate copies of a needed log entry are faulty and must re-join the system (see Reconfiguration).

Reconfiguration

Reconfiguration allows replicas to be added and removed from a system while client requests continue to be processed. Planned maintenance and replica failure are common examples of reconfiguration. Reconfiguration involves Quitting and Joining.

Quitting

When a server detects its State or Output is faulty (see Auditing and Failure Detection), it may selectively exit the system. Likewise, an administrator may manually execute a command to remove a replica for maintenance.

A new Input is added to the State Machine called QUIT. [2] [6] A replica submits this command to the system just like a client request. All non-faulty replicas remove the quitting replica from the system upon processing this Input. During this time, the replica may ignore all protocol messages. If a majority of non-faulty replicas remain, the quit is successful. If not, there is a System Failure.

Joining

After quitting, a failed server may selectively restart or re-join the system. Likewise, an administrator may add a new replica to the group for additional capacity.

A new Input is added to the State Machine called JOIN. A replica submits this command to the system just like a client request. All non-faulty replicas add the joining node to the system upon processing this Input. A new replica must be up-to-date on the system's State before joining (see State Transfer).

State Transfer

When a new replica is made available or an old replica is restarted, it must be brought up to the current State before processing Inputs (see Joining). Logically, this requires applying every Input from the dawn of the system in the appropriate order.

Typical deployments short-circuit the logical flow by performing a State Transfer of the most recent Checkpoint (see Checkpoints). This involves directly copying the State of one replica to another using an out-of-band protocol.

A checkpoint may be large, requiring an extended transfer period. During this time, new Inputs may be added to the log. If this occurs, the new replica must also receive the new Inputs and apply them after the checkpoint is received. Typical deployments add the new replica as an observer to the ordering protocol before beginning the state transfer, allowing the new replica to collect Inputs during this period.

Optimizing State Transfer

Common deployments reduce state transfer times by sending only State components which differ. This requires knowledge of the State Machine internals. Since state transfer is usually an out-of-band protocol, this assumption is not difficult to achieve.

Compression is another feature commonly added to state transfer protocols, reducing the size of the total transfer.

Leader Election (for Paxos)

Paxos [7] is a protocol for solving consensus, and may be used as the protocol for implementing Consensus Order.

Paxos requires a single leader to ensure liveness. [7] That is, one of the replicas must remain leader long enough to achieve consensus on the next operation of the state machine. System behavior is unaffected if the leader changes after every instance, or if the leader changes multiple times per instance. The only requirement is that one replica remains leader long enough to move the system forward.

Conflict Resolution

In general, a leader is necessary only when there is disagreement about which operation to perform, [11] and if those operations conflict in some way (for instance, if they do not commute). [12]

When conflicting operations are proposed, the leader acts as the single authority to set the record straight, defining an order for the operations, allowing the system to make progress.

With Paxos, multiple replicas may believe they are leaders at the same time. This property makes Leader Election for Paxos very simple, and any algorithm which guarantees an 'eventual leader' will work.

Historical background

A number of researchers published articles on the replicated state machine approach in the early 1980s. Anita Borg described an implementation of a fault tolerant operating system based on replicated state machines in a 1983 paper "A message system supporting fault tolerance". Leslie Lamport also proposed the state machine approach, in his 1984 paper on "Using Time Instead of Timeout In Distributed Systems". Fred Schneider later elaborated the approach in his paper "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial".

Ken Birman developed the virtual synchrony model in a series of papers published between 1985 and 1987. The primary reference to this work is "Exploiting Virtual Synchrony in Distributed Systems", which describes the Isis Toolkit, a system that was used to build the New York and Swiss Stock Exchanges, French Air Traffic Control System, US Navy AEGIS Warship, and other applications.

Recent work by Miguel Castro and Barbara Liskov used the state machine approach in what they call a "Practical Byzantine fault tolerance" architecture that replicates especially sensitive services using a version of Lamport's original state machine approach, but with optimizations that substantially improve performance.

Most recently, there has also been the creation of the BFT-SMaRt library, [15] a high-performance Byzantine fault-tolerant state machine replication library developed in Java. This library implements a protocol very similar to PBFT's, plus complementary protocols which offer state transfer and on-the-fly reconfiguration of hosts (i.e., JOIN and LEAVE operations). BFT-SMaRt is the most recent effort to implement state machine replication, still being actively maintained.

Raft, a consensus based algorithm, was developed in 2013.

Motivated by PBFT, Tendermint BFT [16] was introduced for partial asynchronous networks and it is mainly used for Proof of Stake blockchains.

Related Research Articles

<span class="mw-page-title-main">Leslie Lamport</span> American computer scientist and mathematician

Leslie B. Lamport is an American computer scientist and mathematician. Lamport is best known for his seminal work in distributed systems, and as the initial developer of the document preparation system LaTeX and the author of its first manual.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine generals problem", developed to describe a situation in which, to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

In systems design, a fail-fast system is one which immediately reports at its interface any condition that is likely to indicate a failure. Fail-fast systems are usually designed to stop normal operation rather than attempt to continue a possibly flawed process. Such designs often check the system's state at several points in an operation, so any failures can be detected early. The responsibility of a fail-fast module is detecting errors, then letting the next-highest level of the system handle them.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts. Real-world applications often requiring consensus include cloud computing, clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain, and others.

Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.

Control reconfiguration is an active approach in control theory to achieve fault-tolerant control for dynamic systems. It is used when severe faults, such as actuator or sensor outages, cause a break-up of the control loop, which must be restructured to prevent failure at the system level. In addition to loop restructuring, the controller parameters must be adjusted to accommodate changed plant dynamics. Control reconfiguration is a building block toward increasing the dependability of systems under feedback control.

Fault Tolerant Messaging in the context of computer systems and networks, refers to a design approach and set of techniques aimed at ensuring reliable and continuous communication between components or nodes even in the presence of errors or failures. This concept is especially critical in distributed systems, where components may be geographically dispersed and interconnected through networks, making them susceptible to various potential points of failure.

Byzantine fault tolerant protocols are algorithms that are robust to arbitrary types of failures in distributed algorithms. The Byzantine agreement protocol is an essential part of this task. The constant-time quantum version of the Byzantine protocol, is described below.

A reliable multicast is any computer networking protocol that provides a reliable sequence of packets to multiple recipients simultaneously, making it suitable for applications such as multi-receiver file transfer.

XtreemFS is an object-based, distributed file system for wide area networks. XtreemFS' outstanding feature is full and real fault tolerance, while maintaining POSIX file system semantics. Fault-tolerance is achieved by using Paxos-based lease negotiation algorithms and is used to replicate files and metadata. SSL and X.509 certificates support make XtreemFS usable over public networks.

The Brooks–Iyengar algorithm or FuseCPA Algorithm or Brooks–Iyengar hybrid algorithm is a distributed algorithm that improves both the precision and accuracy of the interval measurements taken by a distributed sensor network, even in the presence of faulty sensors. The sensor network does this by exchanging the measured value and accuracy value at every node with every other node, and computes the accuracy range and a measured value for the whole network from all of the values collected. Even if some of the data from some of the sensors is faulty, the sensor network will not malfunction. The algorithm is fault-tolerant and distributed. It could also be used as a sensor fusion method. The precision and accuracy bound of this algorithm have been proved in 2016.

Gbcast is a reliable multicast protocol that provides ordered, fault-tolerant (all-or-none) message delivery in a group of receivers within a network of machines that experience crash failure. The protocol is capable of solving Consensus in a network of unreliable processors, and can be used to implement state machine replication. Gbcast can be used in a standalone manner, or can support the virtual synchrony execution model, in which case Gbcast is normally used for group membership management while other, faster, protocols are often favored for routine communication tasks.

<span class="mw-page-title-main">Raft (algorithm)</span> Consensus algorithm

Raft is a consensus algorithm designed as an alternative to the Paxos family of algorithms. It was meant to be more understandable than Paxos by means of separation of logic, but it is also formally proven safe and offers some additional features. Raft offers a generic way to distribute a state machine across a cluster of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions. It has a number of open-source reference implementations, with full-specification implementations in Go, C++, Java, and Scala. It is named after Reliable, Replicated, Redundant, And Fault-Tolerant.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Michel Raynal is a French informatics scientist, professor at IRISA, University of Rennes, France. He is known for his contributions in the fields of algorithms, computability, and fault-tolerance in the context of concurrent and distributed systems. Michel Raynal is also Distinguished Chair professor at the Hong Kong Polytechnic University and editor of the “Synthesis Lectures on Distributed Computing Theory” published by Morgan & Claypool. He is a senior member of Institut Universitaire de France and a member of Academia Europaea.

The Vsync software library is a BSD-licensed open source library written in C# for the .NET platform, providing a wide variety of primitives for fault-tolerant distributed computing, including: state machine replication, virtual synchrony process groups, atomic broadcast with several levels of ordering and durability, a distributed lock manager, persistent replicated data, a distributed key-value store, and scalable aggregation. The system implements the virtual synchrony execution model, and includes an implementation of Leslie Lamport's Paxos Protocol.

<span class="mw-page-title-main">Avalanche (blockchain platform)</span> Open-source blockchain computing platform

Avalanche is a decentralized, open-source proof of stake blockchain with smart contract functionality. AVAX is the native cryptocurrency of the platform.

References

  1. 1 2 Schneider, Fred (1990). "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial" (PS). ACM Computing Surveys. 22 (4): 299–319. CiteSeerX   10.1.1.69.1536 . doi:10.1145/98163.98167. S2CID   678818.
  2. 1 2 3 4 Lamport, Leslie (1978). "The Implementation of Reliable Distributed Multiprocess Systems". Computer Networks. 2 (2): 95–114. doi:10.1016/0376-5075(78)90045-4 . Retrieved 2008-03-13.
  3. 1 2 Lamport, Leslie (2004). "Lower Bounds for Asynchronous Consensus".
  4. 1 2 Lamport, Leslie; Mike Massa (2004). "Cheap Paxos". International Conference on Dependable Systems and Networks, 2004. pp. 307–314. doi:10.1109/DSN.2004.1311900. ISBN   978-0-7695-2052-0. S2CID   1325830.
  5. 1 2 3 Lamport, Leslie; Robert Shostak; Marshall Pease (July 1982). "The Byzantine Generals Problem". ACM Transactions on Programming Languages and Systems. 4 (3): 382–401. CiteSeerX   10.1.1.64.2312 . doi:10.1145/357172.357176. S2CID   55899582 . Retrieved 2007-02-02.
  6. 1 2 3 Lamport, Leslie (1984). "Using Time Instead of Timeout for Fault-Tolerant Distributed Systems". ACM Transactions on Programming Languages and Systems. 6 (2): 254–280. CiteSeerX   10.1.1.71.1078 . doi:10.1145/2993.2994. S2CID   402171 . Retrieved 2008-03-13.
  7. 1 2 3 4 5 Lamport, Leslie (May 1998). "The Part-Time Parliament". ACM Transactions on Computer Systems. 16 (2): 133–169. doi: 10.1145/279227.279229 . S2CID   421028 . Retrieved 2007-02-02.
  8. 1 2 Birman, Kenneth; Thomas Joseph (1987). "Exploiting virtual synchrony in distributed systems". ACM Sigops Operating Systems Review. 21 (5): 123–138. doi:10.1145/37499.37515. hdl: 1813/6651 .
  9. Lampson, Butler (1996). "How to Build a Highly Available System Using Consensus" . Retrieved 2008-03-13.
  10. Lamport, Leslie (July 1978). "Time, Clocks and the Ordering of Events in a Distributed System". Communications of the ACM. 21 (7): 558–565. doi: 10.1145/359545.359563 . S2CID   215822405 . Retrieved 2007-02-02.
  11. 1 2 Lamport, Leslie (2005). "Fast Paxos".
  12. 1 2 Lamport, Leslie (2005). "Generalized Consensus and Paxos".{{cite journal}}: Cite journal requires |journal= (help)
  13. Fischer, Michael J.; Nancy A. Lynch; Michael S. Paterson (1985). "Impossibility of Distributed Consensus with One Faulty Process". Journal of the Association for Computing Machinery. 32 (2): 347–382. doi: 10.1145/3149.214121 . S2CID   207660233 . Retrieved 2008-03-13.
  14. 1 2 Chandra, Tushar; Robert Griesemer; Joshua Redstone (2007). "Paxos made live". Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing (PDF). pp. 398–407. doi:10.1145/1281100.1281103. ISBN   9781595936165. S2CID   207164635.
  15. BFT-SMaRt. Google Code repository for the BFT-SMaRt replication library.
  16. Buchman, E.; Kwon, J.; Milosevic, Z. (2018). "The latest gossip on BFT consensus". arXiv: 1807.04938 [cs.DC].