Self-stabilization

Last updated

Self-stabilization is a concept of fault-tolerance in distributed systems. Given any initial state, a self-stabilizing distributed system will end up in a correct state in a finite number of execution steps.

Contents

At first glance, the guarantee of self stabilization may seem less promising than that of the more traditional fault-tolerance of algorithms, that aim to guarantee that the system always remains in a correct state under certain kinds of state transitions. However, that traditional fault tolerance cannot always be achieved. For example, it cannot be achieved when the system is started in an incorrect state or is corrupted by an intruder. Moreover, because of their complexity, it is very hard to debug and to analyze distributed systems. Hence, it is very hard to prevent a distributed system from reaching an incorrect state. Indeed, some forms of self-stabilization are incorporated into many modern computer and telecommunications networks, since it gives them the ability to cope with faults that were not foreseen in the design of the algorithm.

Many years after the seminal paper of Edsger Dijkstra in 1974, this concept remains important as it presents an important foundation for self-managing computer systems and fault-tolerant systems. As a result, Dijkstra's paper received the 2002 ACM PODC Influential-Paper Award, one of the highest recognitions in the distributed computing community. [1] Moreover, after Dijkstra's death, the award was renamed and is now called the Dijkstra Award.

History

E.W. Dijkstra in 1974 presented the concept of self-stabilization, prompting further research in this area. [2] His demonstration involved the presentation of self-stabilizing mutual exclusion algorithms. [3] It also showed the first self-stabilizing algorithms that did not rely on strong assumptions on the system. Some previous protocols used in practice did actually stabilize, but only assuming the existence of a clock that was global to the system, and assuming a known upper bound on the duration of each system transition. It was only ten years later when Leslie Lamport pointed out the importance of Dijkstra's work at a 1983 conference called Symposium on Principles of Distributed Computing that researchers [4] directed their attention to this elegant fault-tolerance concept. In his talk, Lamport stated:

I regard this as Dijkstra's most brilliant work - at least, his most brilliant published paper. It's almost completely unknown. I regard it to be a milestone in work on fault tolerance... I regard self-stabilization to be a very important concept in fault tolerance and to be a very fertile field for research. [3]

Afterwards, Dijkstra's work was awarded ACM-PODC influential paper award, which then became ACM's (the Association for computing Machinery) Dijkstra Prize in Distributed Computing given at the annual ACM-PODC symposium. [5]

Overview

A distributed algorithm is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legitimate state and remain in a legitimate set of states thereafter. A state is legitimate if, starting from this state, the algorithm satisfies its specification. The property of self-stabilization enables a distributed algorithm to recover from a transient fault regardless of its nature. Moreover, a self-stabilizing algorithm does not have to be initialized as it eventually starts to behave correctly regardless of its initial state.

Dijkstra's paper, which introduces the concept of self-stabilization, presents an example in the context of a "token ring"a network of computers ordered in a circle. Here, each computer or processor can "see" the whole state of one processor that immediately precedes it and that this state may imply that the processor "has a token" or it "does not have a token." [5] [6] One of the requirements is that exactly one of them must "hold a token" at any given time. The second requirement prescribes that each node "passes the token" to the computer/processor succeeding it so that the token eventually circulates the ring. [5] [6]

The first self-stabilizing algorithms did not detect errors explicitly in order to subsequently repair them. Instead, they constantly pushed the system towards a legitimate state. Since traditional methods for detecting an error [7] were often very difficult and time-consuming, such a behavior was considered desirable. (The method described in the paper cited above collects a huge amount of information from the whole network to one place; after that, it attempts to determine whether the collected global state is correct; even that determination alone can be a hard task).

Efficiency improvements

More recently, researchers have presented newer methods for light-weight error detection for self-stabilizing systems using local checking. [8] [9] and for general tasks. [10]

The term local refers to a part of a computer network. When local detection is used, a computer in a network is not required to communicate with the entire network in order to detect an errorthe error can be detected by having each computer communicate only with its nearest neighbors. These local detection methods simplified the task of designing self-stabilizing algorithms considerably. This is because the error detection mechanism and the recovery mechanism can be designed separately. Newer algorithms based on these detection methods also turned out to be much more efficient. Moreover, these papers suggested rather efficient general transformers to transform non self stabilizing algorithms to become self stabilizing. The idea is to,

  1. Run the non self stabilizing protocol, at the same time,
  2. detect faults (during the execution of the given protocol) using the above-mentioned detection methods,
  3. then, apply a (self stabilizing) "reset" protocol to return the system to some predetermined initial state, and, finally,
  4. restart the given (non- self stabilizing) protocol.

The combination of these 4 parts is self stabilizing (as long as there is no trigger of fault during the correction fault phases, e.g., [11] ). Initial self stabilizing protocols were also presented in the above papers. More efficient reset protocols were presented later, e.g. [12]

Additional efficiency was introduced with the notion of time-adaptive protocols. [13] The idea behind these is that when only a small number of errors occurs, the recovery time can (and should) be made short. Dijkstra's original self-stabilization algorithms do not have this property.

A useful property of self-stabilizing algorithms is that they can be composed of layers if the layers do not exhibit any circular dependencies. The stabilization time of the composition is then bounded by the sum of the individual stabilization times of each layer. [6]

New approaches to Dijkstra's work emerged later on such as the case of Krzysztof Apt and Ehsan Shoja's proposition, which demonstrated how self-stabilization can be naturally formulated using the standard concepts of strategic games, particularly the concept of an improvement path. [14] This particular work sought to demonstrate the link between self-stabilization and game theory.

Time complexity

The time complexity of a self-stabilizing algorithm is measured in (asynchronous) rounds or cycles.

To measure the output stabilization time, a subset of the state variables is defined to be externally visible (the output). Certain states of outputs are defined to be correct (legitimate). The set of the outputs of all the components of the system is said to have stabilized at the time that it starts to be correct, provided it stays correct indefinitely, unless additional faults occur. The output stabilization time is the time (the number of (asynchronous) rounds) until the output stabilizes. [8]

Definition

A system is self-stabilizing if and only if:

  1. Starting from any state, it is guaranteed that the system will eventually reach a correct state (convergence).
  2. Given that the system is in a correct state, it is guaranteed to stay in a correct state, provided that no fault happens (closure).

A system is said to be randomized self-stabilizing if and only if it is self-stabilizing and the expected number of rounds needed to reach a correct state is bounded by some constant . [15]

Design of self-stabilization in the above-mentioned sense is well known to be a difficult job. In fact, a class of distributed algorithms do not have the property of local checking: the legitimacy of the network state cannot be evaluated by a single process. The most obvious case is Dijkstra's token-ring defined above: no process can detect whether the network state is legitimate or not in the case where more than one token is present in non-neighboring processes. This suggests that self-stabilization of a distributed system is a sort of collective intelligence where each component is taking local actions, based on its local knowledge but eventually this guarantees global convergence at the end.

To help overcome the difficulty of designing self-stabilization as defined above, other types of stabilization were devised. For instance, weak stabilization is the property that a distributed system has a possibility to reach its legitimate behavior from every possible state. [16] Weak stabilization is easier to design as it just guarantees a possibility of convergence for some runs of the distributed system rather than convergence for every run.

A self-stabilizing algorithm is silent if and only if it converges to a global state where the values of communication registers used by the algorithm remain fixed. [17]

An extension of the concept of self-stabilization is that of superstabilization. [18] The intent here is to cope with dynamic distributed systems that undergo topological changes. In classical self-stabilization theory, arbitrary changes are viewed as errors where no guarantees are given until the system has stabilized again. With superstabilizing systems, there is a passage predicate that is always satisfied while the system's topology is reconfigured.

Related Research Articles

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. Distributed computing is a field of computer science that studies distributed systems.

<span class="mw-page-title-main">Edsger W. Dijkstra</span> Dutch computer scientist (1930–2002)

Edsger Wybe Dijkstra was a Dutch computer scientist, programmer, software engineer, and science essayist.

<span class="mw-page-title-main">Leslie Lamport</span> American computer scientist and mathematician

Leslie B. Lamport is an American computer scientist and mathematician. Lamport is best known for his seminal work in distributed systems, and as the initial developer of the document preparation system LaTeX and the author of its first manual.

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine generals problem", developed to describe a situation in which, to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

Fault tolerance is the resilient property that enables a system to continue operating properly in the event of failure or major dysfunction in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

Nancy Ann Lynch is a computer scientist affiliated with the Massachusetts Institute of Technology. She is the NEC Professor of Software Science and Engineering in the EECS department and heads the "Theory of Distributed Systems" research group at MIT's Computer Science and Artificial Intelligence Laboratory.

Superstabilization is a concept of fault-tolerance in distributed computing. Superstabilizing distributed algorithms combine the features of self-stabilizing algorithms and dynamic algorithms. A superstabilizing algorithm – just like any other self-stabilizing algorithm – can be started in an arbitrary state, and it will eventually converge to a legitimate state. Additionally, a superstabilizing algorithm will recover rapidly from a single change in the network topology.

<span class="mw-page-title-main">Shlomi Dolev</span>

Shlomi Dolev is a Rita Altura Trust Chair Professor in Computer Science at Ben-Gurion University of the Negev (BGU) and the head of the BGU Negev Hi-Tech Faculty Startup Accelerator.

ACM SIGACT or SIGACT is the Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory, whose purpose is support of research in theoretical computer science. It was founded in 1968 by Patrick C. Fischer.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts. Real-world applications often requiring consensus include cloud computing, clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain, and others.

In computer science, state machine replication (SMR) or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

Fault Tolerant Messaging in the context of computer systems and networks, refers to a design approach and set of techniques aimed at ensuring reliable and continuous communication between components or nodes even in the presence of errors or failures. This concept is especially critical in distributed systems, where components may be geographically dispersed and interconnected through networks, making them susceptible to various potential points of failure.

Keith Marzullo is the inventor of Marzullo's algorithm, which is part of the basis of the Network Time Protocol and the Windows Time Service. On August 1, 2016 he became the Dean of the University of Maryland College of Information Studies after serving as the Director of the NITRD National Coordination Office. Prior to this he was a Professor in the Department of Computer Science and Engineering at University of California, San Diego. In 2011 he was inducted as a Fellow of the Association for Computing Machinery.

The ACM Symposium on Principles of Distributed Computing (PODC) is an academic conference in the field of distributed computing organised annually by the Association for Computing Machinery.

<span class="mw-page-title-main">Cynthia Dwork</span> American computer scientist

Cynthia Dwork is an American computer scientist best known for her contributions to cryptography, distributed computing, and algorithmic fairness. She is one of the inventors of differential privacy and proof-of-work.

The Brooks–Iyengar algorithm or FuseCPA Algorithm or Brooks–Iyengar hybrid algorithm is a distributed algorithm that improves both the precision and accuracy of the interval measurements taken by a distributed sensor network, even in the presence of faulty sensors. The sensor network does this by exchanging the measured value and accuracy value at every node with every other node, and computes the accuracy range and a measured value for the whole network from all of the values collected. Even if some of the data from some of the sensors is faulty, the sensor network will not malfunction. The algorithm is fault-tolerant and distributed. It could also be used as a sensor fusion method. The precision and accuracy bound of this algorithm have been proved in 2016.

A distributed operating system is system software over a collection of independent software, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisioners. The first is a ubiquitous minimal kernel, or microkernel, that directly controls that node's hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.

Daniel (Danny) Dolev is an Israeli computer scientist known for his research in cryptography and distributed computing. He holds the Berthold Badler Chair in Computer Science at the Hebrew University of Jerusalem and is a member of the scientific council of the European Research Council.

Michel Raynal is a French informatics scientist, professor at IRISA, University of Rennes, France. He is known for his contributions in the fields of algorithms, computability, and fault-tolerance in the context of concurrent and distributed systems. Michel Raynal is also Distinguished Chair professor at the Hong Kong Polytechnic University and editor of the “Synthesis Lectures on Distributed Computing Theory” published by Morgan & Claypool. He is a senior member of Institut Universitaire de France and a member of Academia Europaea.

References

  1. "PODC Influential Paper Award: 2002", ACM Symposium on Principles of Distributed Computing, retrieved 2009-09-01
  2. Dijkstra, Edsger W. (1974), "Self-stabilizing systems in spite of distributed control" (PDF), Communications of the ACM , 17 (11): 643–644, doi:10.1145/361179.361202, S2CID   11101426 .
  3. 1 2 Dolev, Shlomi (2000). Self-stabilization. Cambridge, MA: The MIT Press. p. 3. ISBN   978-0262041782.
  4. Lamport, Leslie (1985), "Solved problems, unsolved problems, and non-problems in concurrency" (PDF), ACM Special Interest Group Operating Systems Review, 19 (4): 34–44, doi:10.1145/858336.858339, S2CID   228819 .
  5. 1 2 3 Chaudhuri, Soma; Das, Samir; Paul, Himadri; Tirthapura, Srikanta (2007). Distributed Computing and Networking: 8th International Conference, ICDCN 2006, Guwahati, India, December 27-30, 2006, Proceedings. Berlin: Springer. p. 108. ISBN   978-3540681397.
  6. 1 2 3 Shlomi Dolev, Shlomo Moran, Amos Israeli: Self-Stabilization of Dynamic Systems Assuming only Read/Write Atomicity. Distributed Computing, volume 7, pages3–16(1993).
  7. Katz, Shmuel; Perry, Kenneth J. (1993), "Self-stabilizing extensions for meassage-passing systems", Distributed Computing , 7 (1): 17–26, doi:10.1007/BF02278852, S2CID   37245790 .
  8. 1 2 Awerbuch, Baruch; Patt-Shamir, Boaz; Varghese, George (1991), "Self-stabilization by local checking and correction", Proc. 32nd Symposium on Foundations of Computer Science (FOCS), pp. 268–277, CiteSeerX   10.1.1.211.8704 , doi:10.1109/SFCS.1991.185378, ISBN   978-0-8186-2445-2, S2CID   8320293 .
  9. Afek, Yehuda; Kutten, Shay; Yung, Moti (1997), "The local detection paradigm and its applications to self-stabilization", Theoretical Computer Science , 186 (1–2): 199–229, doi: 10.1016/S0304-3975(96)00286-1 , MR   1478668 .
  10. Shlomi Dolev, Yehuda Afek: Local Stabilizer. Journal of Parallel and Distributed Computing Volume 62, Issue 5, May 2002, Pages 745-765.
  11. Baruch Awerbuch, Boaz Patt-Shamir, George Varghese, Shlomi Dolev. Self-Stabilization by Local Checking and Global Reset. WDAG 1994: 326-339.
  12. [Baruch Awerbuch, Shay Kutten, Yishay Mansour, Boaz Patt-Shamir, George Varghese. Time optimal self-stabilizing synchronization. ACM STOC 1993: 652-661.]
  13. Shay Kutten, Boaz Patt-Shamir: Stabilizing Time-Adaptive Protocols. Theor. Comput. Sci. 220(1): 93-111 (1999).
  14. de Boer, Frank; Bonsangue, Marcello; Rutten, Jan (2018). Apt. Cham: Springer. p. 22. ISBN   9783319900889.
  15. Dolev, Shlomi (2000), Self-Stabilization, MIT Press, ISBN   978-0-262-04178-2 .
  16. Gouda, Mohamed (1995), The Triumph and Tribulation of System Stabilization, Proceedings of the 9th international workshop on distributed algorithms..
  17. Shlomi Dolev, Mohamed G. Gouda, and Marco Schneider. Memory requirements for silent stabilization. In PODC '96: Proceedings of the fifteenth annual ACM Symposium on Principles of Distributed Computing, pages 27--34, New York, NY, USA, 1996. ACM Press. Online extended abstract.
  18. Dolev, Shlomi; Herman, Ted (1997), "Superstabilizing protocols for dynamic distributed systems", Chicago Journal of Theoretical Computer Science, 3: 1–40, doi: 10.4086/cjtcs.1997.004 , article 4.