Reachability analysis

Last updated January 15, 2021

Reachability analysis is a solution to the reachability problem in the particular context of distributed systems. It is used to determine which global states can be reached by a distributed system which consists of a certain number of local entities that communicated by the exchange of messages.

Overview

Reachability analysis was introduced in a paper of 1978 for the analysis and verification of communication protocols.^[1] This paper was inspired by a paper by Bartlett et al. of 1968 ^[2] which presented the alternating bit protocol using finite-state modeling of the protocol entities, and also pointed out that a similar protocol described earlier had a design flaw. This protocol belongs to the Link layer and, under certain assumptions, provides as service the correct data delivery without loss nor duplication, despite the occasional presence of message corruption or loss.

For reachability analysis, the local entities are modeled by their states and transitions. An entity changes state when it sends a message, consumes a received message, or performs an interaction at its local service interface. The global state $s=(s_{1},s_{2},...,s_{n},medium)$ of a system with n entities ^[3] is determined by the states $s_{i}$ (i=1, ... n) of the entities and the state of the communication $medium$ . In the simplest case, the medium between two entities is modeled by two FIFO queues in opposite directions, which contain the messages in transit (that are sent, but not yet consumed). Reachability analysis considers the possible behavior of the distributed system by analyzing all possible sequences of state transitions of the entities, and the corresponding global states reached.^[4]

The result of reachability analysis is a global state transition graph (also called reachability graph) which shows all global states of the distributed system that are reachable from the initial global state, and all possible sequences of send, consume and service interactions performed by the local entities. However, in many cases this transition graph is unbounded and can not be explored completely. The transition graph can be used for checking general design flaws of the protocol (see below), but also for verifying that the sequences of service interactions by the entities correspond to the requirements given by the global service specification of the system.^[1]

Protocol properties

Boundedness: The global state transition graph is bounded if the number of messages that may be in transit is bounded and the number states of all entities is bounded. The question whether the number of messages remains bounded in the case of finite state entities is in general not decidable.^[5] One usually truncates the exploration of the transition graph when the number of messages in transit reaches a given threshold.

The following are design flaws:

Global deadlock: The system is in a global deadlock if all entities wait for the consumption of a message and no message is in transit. Absence of global deadlocks can be verified by checking that no state in the reachability graph is a global deadlock.
Partial deadlocks: An entity is in a deadlocked state if it waits for the consumption of a message and the system is in a global state where such a message is not in transit and will never be sent in any global state that can be reached in the future. Such a non-local property can be verified by performing model checking on the reachability graph.
Unspecified reception: An entity has an unspecified reception if the next message to be consumed is not expected by the behavior specification of the entity in its current state. The absence of this condition can be verified by checking all states in the reachability graph.

An example

As an example, we consider the system of two protocol entities that exchange the messages ma, mb, mc and md with one another, as shown in the first diagram. The protocol is defined by the behavior of the two entities, which is given in the second diagram in the form of two state machines. Here the symbol "!" means sending a message, and "?" means consuming a received message. The initial states are the states "1".

The third diagram shows the result of the reachability analysis for this protocol in the form of a global state machine. Each global state has four components: the state of protocol entity A (left), the state of the entity B (right) and the messages in transit in the middle (upper part: from A to B; lower part: from B to A). Each transition of this global state machine corresponds to one transition of protocol entity A or entity B. The initial state is [1, - - , 1] (no messages in transit).

One sees that this example has a bounded global state space - the maximum number of messages that may be in transit at the same time is two. This protocol has a global deadlock, which is the state [2, - - , 3]. If one removes the transition of A in state 2 for consuming message mb, there will be an unspecified reception in the global states [2, ma mb ,3] and [2, - mb ,3].

Message transmission

The design of a protocol has to be adapted to the properties of the underlying communication medium, to the possibility that the communication partner fails, and to the mechanism used by an entity to select the next message for consumption. The communication medium for protocols at the Link level is normally not reliable and allows for erroneous reception and message loss (modeled as a state transition of the medium). Protocols using the Internet IP service should also deal with the possibility of out-of-order delivery. Higher-level protocols normally use a session-oriented Transport service which means that the medium provides reliable FIFO transmission of messages between any pair of entities. However, in the analysis of distributed algorithms, one often takes into account the possibility that some entity fails completely, which is normally detected (like a loss of message in the medium) by a timeout mechanism when an expected message does not arrive.

Different assumptions have been made about whether an entity can select a particular message for consumption when several messages have arrived and are ready for consumption. The basic models are the following:

Single input queue: Each entity has a single FIFO queue where incoming messages are stored until they are consumed. Here the entity has no selection power and has to consume the first message in the queue.
Multiple queues: Each entity has multiple FIFO queues, one for each communicating partner. Here the entity has the possibility to decide, depending on its state, from which queue (or queues) the next input message should be consumed.
Reception pool: Each entity has a single pool where received messages are stored until they are consumed. Here the entity has the power to decide, depending on its state, which type of message should be consumed next (and wait for a message if none has been received yet), or possibly consume one from a set of message types (in order to deal with alternatives).

The original paper identifying the problem of unspecified receptions,^[6] and much of the subsequent work, assumed a single input queue.^[7] Sometimes, unspecified receptions are introduced by a race condition, which means that two messages are received and their order is not defined (which is often the case if they come from different partners). Many of these design flaws disappear when multiple queues or reception pools are used.^[8] With the systematic use of reception pools, reachability analysis should check for partial deadlocks and messages remaining forever in the pool (without being consumed by the entity) ^[9]

Practical issues

Most of the work on protocol modeling use finite-state machines (FSM) to model the behavior of the distributed entities (see also Communicating finite-state machines). However, this model is not powerful enough to model message parameters and local variables. Therefore often so-called extended FSM models are used, such as supported by languages like SDL or UML state machines. Unfortunately, reachability analysis becomes much more complex for such models.

A practical issue of reachability analysis is the so-called ″state space explosion″. If the two entities of a protocol have 100 states each, and the medium may include 10 types of messages, up to two in each direction, then the number of global states in the reachability graph is bound by the number 100 x 100 x (10 x 10) x (10 x 10) which is 100 million. Therefore a number of tools have been developed to automatically perform reachability analysis and model checking on the reachability graph. We mention only two examples: The SPIN model checker and a toolbox for the construction and analysis of distributed processes.

References and notes

1 2 Bochmann, G.v. "Finite State Description of Communication Protocols, Computer Networks, Vol. 2 (1978), pp. 361-372".Cite journal requires |journal= (help)
↑ K.A. Bartlett, R.A. Scantlebury and P.T. Wilkinson, A note on reliable full-duplex transmission over half- duplex links, C.ACM 12, 260 (1969).
↑ Note: In the case of protocol analysis, there are normally only two entities.
↑ Note: The corruption or loss of a message is modeled as a state transition of the $medium$ .
↑ M.G.Gouda, E.G.Manning, Y.T.Yu: On the progress of communication between two finite state machines, doi
↑ P. Zafiropulo, C. West, H. Rudin, D. Cowan, D. Brand : Towards analyzing and synthesizing protocols, IEEE Transactions on Communications ( Volume: 28, Issue: 4, Apr 1980 )
↑ Note: The SAVE construct of SDL can be used to indicate that certain types of messages should not be consumed in the current state, but saved for future processing.
↑ M.F. Al-hammouri and G.v. Bochmann : Realizability of service specifications, Proc. System Analysis and Modelling (SAM) conference 2018, Copenhagen, LNCS, Springer
↑ C. Fournet, T. Hoare, S. K. Rajamani, and J. Rehof : Stuck-free Conformance, Proc. 16th Intl. Conf. on Computer Aided Verification (CAV’04), LNCS, vol. 3114, Springer, 2004

Related Research Articles

Distributed computing is a field of computer science that studies distributed systems. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

FIFO – an acronym for first in, first out – in computing and in systems theory, is a method for organising the manipulation of a data structure – often, specifically a data buffer – where the oldest (first) entry, or 'head' of the queue, is processed first.

OSI model Model of communication of seven abstraction layers

The Open Systems Interconnection model is a conceptual model that characterises and standardises the communication functions of a telecommunication or computing system without regard to its underlying internal structure and technology. Its goal is the interoperability of diverse communication systems with standard communication protocols.

A Petri net, also known as a place/transition (PT) net, is one of several mathematical modeling languages for the description of distributed systems. It is a class of discrete event dynamic system. A Petri net is a directed bipartite graph that has two types of elements, places and transitions, depicted as white circles and rectangles, respectively. A place can contain any number of tokens, depicted as black circles. A transition is enabled if all places connected to it as inputs contain at least one token.. Some sources state that Petri nets were invented in August 1939 by Carl Adam Petri—at the age of 13—for the purpose of describing chemical processes.

In computer science, model checking or property checking is a method for checking whether a finite-state model of a system meets a given specification. This is typically associated with hardware or software systems, where the specification contains liveness requirements as well as safety requirements.

The actor model in computer science is a mathematical model of concurrent computation that treats actor as the universal primitive of concurrent computation. In response to a message it receives, an actor can: make local decisions, create more actors, send more messages, and determine how to respond to the next message received. Actors may modify their own private state, but can only affect each other indirectly through messaging.

Kahn process networks is a distributed model of computation where a group of deterministic sequential processes are communicating through unbounded FIFO channels. The resulting process network exhibits deterministic behavior that does not depend on the various computation or communication delays. The model was originally developed for modeling distributed systems but has proven its convenience for modeling signal processing systems. As such, KPNs have found many applications in modeling embedded systems, high-performance computing systems, and other computational tasks. KPNs were first introduced by Gilles Kahn.

Commitment ordering (CO) is a class of interoperable serializability techniques in concurrency control of databases, transaction processing, and related applications. It allows optimistic (non-blocking) implementations. With the proliferation of multi-core processors, CO has been also increasingly utilized in concurrent programming, transactional memory, and especially in software transactional memory (STM) for achieving serializability optimistically. CO is also the name of the resulting transaction schedule (history) property, which was originally defined in 1988 with the name dynamic atomicity. In a CO compliant schedule the chronological order of commitment events of transactions is compatible with the precedence order of the respective transactions. CO is a broad special case of conflict serializability, and effective means to achieve global serializability across any collection of database systems that possibly use different concurrency control mechanisms.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts. Real-world applications often requiring consensus include cloud computing, clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain, and others.

In computing, the producer–consumer problem is a classic example of a multi-process synchronization problem, the first version of which was proposed by Edsger W. Dijkstra in 1965 in his unpublished manuscript, in which the buffer was unbounded, and subsequently published with a bounded buffer in 1972. In the first version of the problem, there are two cyclic processes, a producer and a consumer, which share a common, fixed-size buffer used as a queue. The producer repeatedly generates data and writes it into the buffer. The consumer repeatedly reads the data in the buffer, removing it in the course of reading it, and using that data in some way. In the first version of the problem, with an unbounded buffer, the problem is how to design the producer and consumer code so that, in their exchange of data, no data is lost or duplicated, data is read by the consumer in the order it is written by the producer, and both processes make as much progress as possible. In the later formulation of the problem, Dijkstra proposed multiple producers and consumers sharing a finite collection of buffers. This added the additional problem of preventing producers from trying to write into buffers when all were full, and trying to prevent consumers from reading a buffer when all were empty.

PROMELA is a verification modeling language introduced by Gerard J. Holzmann. The language allows for the dynamic creation of concurrent processes to model, for example, distributed systems. In PROMELA models, communication via message channels can be defined to be synchronous, or asynchronous. PROMELA models can be analyzed with the SPIN model checker, to verify that the modeled system produces the desired behavior. An implementation verified with Isabelle/HOL is also available, as part of the Computer Aided Verification of Automata project. Files written in Promela traditionally have a .pml file extension.

Mihalis Yannakakis is Professor of Computer Science at Columbia University. He is noted for his work in computational complexity, databases, and other related fields. He won the Donald E. Knuth Prize in 2005.

CADP is a toolbox for the design of communication protocols and distributed systems. CADP is developed by the CONVECS team at INRIA Rhone-Alpes and connected to various complementary tools. CADP is maintained, regularly improved, and used in many industrial projects.

A communication protocol is a system of rules that allow two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchronization of communication and possible error recovery methods. Protocols may be implemented by hardware, software, or a combination of both.

Input/output automata provide a formal model, applicable in describing most types of asynchronous concurrent system. On its own, the I/O automaton model contains a very basic structure that enables it to model various types of distributed systems. To describe specific types of asynchronous systems, additional structure must be added to this basic model. The model presents an explicit method for describing and reasoning about system components such as processes and message channels that interact with one another, operating at arbitrary relative speeds. The I/O automata were first introduced by Nancy A. Lynch and Mark R. Tuttle in "Hierarchical correctness proofs for distributed algorithms", 1987.

In computer science, a communicating finite-state machine is a finite state machine labeled with "receive" and "send" operations over some alphabet of channels. They were introduced by Brand and Zafiropulo, and can be used as a model of concurrent processes like Petri nets. Communicating finite state machines are used frequently for modeling a communication protocol since they make it possible to detect major protocol design errors, including boundedness, deadlocks, and unspecified receptions.

In concurrent computing, liveness refers to a set of properties of concurrent systems, that require a system to make progress despite the fact that its concurrently executing components ("processes") may have to "take turns" in critical sections, parts of the program that cannot be simultaneously run by multiple processes. Liveness guarantees are important properties in operating systems and distributed systems.

Protocol engineering is the application of systematic methods to the development of communication protocols. It uses many of the principles of software engineering, but it is specific to the development of distributed systems.

Gregor von Bochmann is a German-Canadian computer scientist and emeritus professor of the Université de Montréal and the University of Ottawa. He is known for his work in the area of protocol engineering and distributed applications.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Boch78-1] 1 2 Bochmann, G.v. "Finite State Description of Communication Protocols, Computer Networks, Vol. 2 (1978), pp. 361-372".Cite journal requires |journal= (help)

[2] K.A. Bartlett, R.A. Scantlebury and P.T. Wilkinson, A note on reliable full-duplex transmission over half- duplex links, C.ACM 12, 260 (1969).

[3] Note: In the case of protocol analysis, there are normally only two entities.

[4] Note: The corruption or loss of a message is modeled as a state transition of the $medium$ .

[5] M.G.Gouda, E.G.Manning, Y.T.Yu: On the progress of communication between two finite state machines, doi

[6] P. Zafiropulo, C. West, H. Rudin, D. Cowan, D. Brand : Towards analyzing and synthesizing protocols, IEEE Transactions on Communications ( Volume: 28, Issue: 4, Apr 1980 )

[7] Note: The SAVE construct of SDL can be used to indicate that certain types of messages should not be consumed in the current state, but saved for future processing.

[8] M.F. Al-hammouri and G.v. Bochmann : Realizability of service specifications, Proc. System Analysis and Modelling (SAM) conference 2018, Copenhagen, LNCS, Springer

[9] C. Fournet, T. Hoare, S. K. Rajamani, and J. Rehof : Stuck-free Conformance, Proc. 16th Intl. Conf. on Computer Aided Verification (CAV’04), LNCS, vol. 3114, Springer, 2004

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]