Terminating Reliable Broadcast

Last updated

Terminating Reliable Broadcast (TRB) is a problem in distributed computing that encapsulates the task of broadcasting a message to a set of receiving processes in the presence of faults. [1] In particular, the sender and any other process might fail ("crash") at any time.

Distributed computing is a field of computer science that studies distributed systems. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

Process (computing) particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed. It contains the program code and its activity. Depending on the operating system (OS), a process may be made up of multiple threads of execution that execute instructions concurrently.

Problem description

A TRB protocol typically organizes the system into a sending process and a set of receiving processes, which may include the sender itself. A process is called "correct" if it does not fail at any point during its execution. The goal of the protocol is to transfer data (the "message") from the sender to the set of receiving processes. A process may perform many I/O operations during protocol execution, but eventually "delivers" a message by passing it to the application on that process that invoked the TRB protocol.

In computing, input/output or I/O is the communication between an information processing system, such as a computer, and the outside world, possibly a human or another information processing system. Inputs are the signals or data received by the system and outputs are the signals or data sent from it. The term can also be used as part of an action; to "perform I/O" is to perform an input or output operation.

The protocol must provide important guarantees to the receiving processes. All correct receiving processes, for example, must deliver the sender's message if the sender is also correct. A receiving process may deliver a special message, ("sender faulty"), if the sender failed, but either all correct processes will deliver or none will. A correct process is therefore guaranteed that data delivered to it was also delivered to all other correct processes.

More precisely, a TRB protocol must satisfy the four formal properties below.

The presence of faults in the system makes these properties more difficult to satisfy. A simple but invalid TRB protocol might have the sender broadcast the message to all processes, and have receiving processes deliver the message as soon as it is received. This protocol, however, does not satisfy agreement if faults can occur: if the sender crashes after sending the message to some processes, but before sending it to others, then the first set of processes may deliver the message while the second set delivers .

TRB is closely related, but not identical, to the fundamental distributed computing problem of consensus.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation. Examples of applications of consensus include whether to commit a transaction to a database, agreeing on the identity of a leader, state machine replication, and atomic broadcasts. The real world applications include clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing and others.

Related Research Articles

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, and file transfer rely on TCP. Applications that do not require reliable data stream service may use the User Datagram Protocol (UDP), which provides a connectionless datagram service that emphasizes reduced latency over reliability.

In computer networking, the User Datagram Protocol (UDP) is one of the core members of the Internet protocol suite. The protocol was designed by David P. Reed in 1980 and formally defined in RFC 768. With UDP, computer applications can send messages, in this case referred to as datagrams, to other hosts on an Internet Protocol (IP) network. Prior communications are not required in order to set up communication channels or data paths.

A commitment scheme is a cryptographic primitive that allows one to commit to a chosen value while keeping it hidden to others, with the ability to reveal the committed value later. Commitment schemes are designed so that a party cannot change the value or statement after they have committed to it: that is, commitment schemes are binding. Commitment schemes have important applications in a number of cryptographic protocols including secure coin flipping, zero-knowledge proofs, and secure computation.

Secure multi-party computation is a subfield of cryptography with the goal of creating methods for parties to jointly compute a function over their inputs while keeping those inputs private. Unlike traditional cryptographic tasks, where cryptography assures security and integrity of communication or storage and the adversary is outside the system of participants, the adversary in this model controls actual participants. These types of tasks started in the late 1970s with the work on mental poker, cryptographic work that simulates game playing/ computational tasks over distances without requiring a trusted third party. Note that traditionally, cryptography was about concealing content, while this new type of computations and protocol is about concealing partial information about data while computing with the data from many sources, and correctly producing outputs.

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals' Problem", developed to describe this condition, where actors must agree on a concerted strategy to avoid catastrophic system failure, but some of the actors are unreliable.

Selective Repeat ARQ/Selective Reject ARQ is a specific instance of the automatic repeat-request (ARQ) protocol used to solve sequence number dilemma in communications.

In distributed computing, the bully algorithm is a method for dynamically electing a coordinator or leader from a group of distributed computer processes. The process with the highest process ID number from amongst the non-failed processes is selected as the coordinator.

In computer science, Luby transform codes are the first class of practical fountain codes that are near-optimal erasure correcting codes. They were invented by Michael Luby in 1998 and published in 2002. Like some other fountain codes, LT codes depend on sparse bipartite graphs to trade reception overhead for encoding and decoding speed. The distinguishing characteristic of LT codes is in employing a particularly simple algorithm based on the exclusive or operation to encode and decode the message.

In computer networking, a reliable protocol is a protocol which notifies the sender whether or not the delivery of data to intended recipients was successful. Reliability is a synonym for assurance, which is the term used by the ITU and ATM Forum.

A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. Distributed algorithms are used in many varied application areas of distributed computing, such as telecommunications, scientific computing, distributed information processing, and real-time process control. Standard problems solved by distributed algorithms include leader election, consensus, distributed search, spanning tree generation, mutual exclusion, and resource allocation.

In computer science, state machine replication or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols.

Paxos is a family of protocols for solving consensus in a network of unreliable processors . Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task is begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

In cryptography, a secret sharing scheme is publicly verifiable (PVSS) if it is a verifiable secret sharing scheme and if any party involved can verify the validity of the shares distributed by the dealer.

In verifiable secret sharing (VSS) the object is to resist malicious players, such as
(i) a dealer sending incorrect shares to some or all of the participants, and
(ii) participants submitting incorrect shares during the reconstruction protocol,cf. [CGMA85].
In publicly verifiable secret sharing (PVSS), as introduced by Stadler [Sta96], it is an explicit goal that not just the participants can verify their

own shares, but that anybody can verify that the participants received correct shares. Hence, it is explicitly required that can be verified publicly.

Virtual synchrony is an interprocess message passing technology. Virtual synchrony systems allow programs running in a network to organize themselves into process groups, and to send messages to groups. Each message is delivered to all the group members, in the identical order, and this is true even when two messages are transmitted simultaneously by different senders. Application design and implementation is greatly simplified by this property: every group member sees the same events in the same order.

In fault-tolerant distributed computing, an atomic broadcast or total order broadcast is a broadcast where all correct processes in a system of multiple processes receive the same set of messages in the same order; that is, the same sequence of messages. The broadcast is termed "atomic" because it either eventually completes correctly at all participants, or all participants abort without side effects. Atomic broadcasts are an important distributed computing primitive.

The framework of universal composability (UC) is a general-purpose model for the analysis of cryptographic protocols. It guarantees very strong security properties. Protocols remain secure even if arbitrarily composed with other instances of the same or other protocols. Security is defined in the sense of protocol emulation. Intuitively, a protocol is said to emulate another one, if no environment (observer) can distinguish the executions. Literally, the protocol may simulate the other protocol. The notion of security is derived by implication. Assume a protocol is secure per definition. If another protocol emulates protocol such that no environment tells apart the emulation from the execution of the protocol, then the emulated protocol is as secure as protocol .

Byzantine fault tolerant protocols are algorithms that are robust to arbitrary types of failures in distributed algorithms. With the advent and popularity of the Internet, there is a need to develop algorithms that do not require any centralized control that have some guarantee of always working correctly. The Byzantine agreement protocol is an essential part of this task. In this article the quantum version of the Byzantine protocol, which works in constant time is described.

Gbcast is a reliable multicast protocol that provides ordered, fault-tolerant (all-or-none) message delivery in a group of receivers within a network of machines that experience crash failure. The protocol is capable of solving Consensus in a network of unreliable processors, and can be used to implement state machine replication. Gbcast can be used in a standalone manner, or can support the virtual synchrony execution model, in which case Gbcast is normally used for group membership management while other, faster, protocols are often favored for routine communication tasks.

References

  1. Alvisi, Lorenzo (2006). "Consensus and Reliable Broadcast" (PDF). Retrieved 2006-05-21.