SWIM Protocol

Last updated March 28, 2024

SWIM "Outsourced Heartbeats" SwimOutsourced.png — SWIM "Outsourced Heartbeats"

The Scalable Weakly Consistent Infection-style Process Group Membership (SWIM) Protocol is a group membership protocol based on "outsourced heartbeats"^[1] used in distributed systems, first introduced by Indranil Gupta in 2001.^[2]^[3] It is a hybrid algorithm which combines failure detection with group membership dissemination.

Protocol

The protocol has two components, the Failure Detector Component and the Dissemination Component.

The Failure Detector Component functions as follows:

Every T' time units, each node ( $N_{1}$ ) sends a ping to random other node ( $N_{2}$ ) in its membership list.
If $N_{1}$ receives a response from $N_{2}$ , $N_{2}$ is decided to be healthy and N1 updates its "last heard from" timestamp for $N_{2}$ to be the current time.
If $N_{1}$ does not receive a response, $N_{1}$ contacts k other nodes on its list ( $\{N_{3},...,N_{3+k}\}$ ), and requests that they ping $N_{2}$ .
If after T' units of time: if no successful response is received, $N_{1}$ marks $N_{2}$ as failed.

The Dissemination Component functions as follows:

Upon $N_{1}$ detecting a failed node $N_{2}$ , $N_{1}$ sends a multicast message to the rest of the nodes in its membership list, with information about the failed node.
Voluntary requests for a node to enter/leave the group are also sent via multicast.

Properties

The protocol provides the following guarantees:

Strong Completeness: Full completeness is guaranteed (e.g. the crash-failure of any node in the group is eventually detected by all live nodes).
Detection Time: The expected value of detection time (from node failure to detection) is $T'{\dot {}}{\frac {1}{1-e^{-q_{f}}}}$ , where $T'$ is the length of the protocol period, and $q_{f}$ is the fraction of non-faulty nodes in the group.^[3]

Extensions

The original SWIM paper lists the following extensions to make the protocol more robust:^[2]

Suspicion: Nodes that are unresponsive to ping messages are not initially marked as failed. Instead, they are marked as "suspicious"; nodes which discover a "suspicious" node still send a multicast to all other nodes including this mechanism. If a "suspicious" node responds to a ping before some time-out threshold, an "alive" message is sent via multicast to remove the "suspicious" label from the node.
Infection-Style Dissemination: Instead of propagating node failure information via multicast, protocol messages are piggybacked on the ping messages used to determine node liveness. This is equivalent to gossip dissemination.
Round-Robin Probe Target Selection: Instead of randomly picking a node to probe during each protocol time step, the protocol is modified so that each node performs a round-robin selection of probe target. This bounds the worst-case detection time of the protocol, without degrading the average detection time.

Related Research Articles

In chemistry, the mole fraction or molar fraction, also called mole proportion or molar proportion, is a quantity defined as the ratio between the amount of a constituent substance, n_i, and the total amount of all constituents in a mixture, n_tot :

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

In electrical engineering, the Y-Δ transform, also written wye-delta and also known by many other names, is a mathematical technique to simplify the analysis of an electrical network. The name derives from the shapes of the circuit diagrams, which look respectively like the letter Y and the Greek capital letter Δ. This circuit transformation theory was published by Arthur Edwin Kennelly in 1899. It is widely used in analysis of three-phase electric power circuits.

<span class="mw-page-title-main">Linking number</span> Numerical invariant that describes the linking of two closed curves in three-dimensional space

In mathematics, the linking number is a numerical invariant that describes the linking of two closed curves in three-dimensional space. Intuitively, the linking number represents the number of times that each curve winds around the other. In Euclidean space, the linking number is always an integer, but may be positive or negative depending on the orientation of the two curves.

In distributed computing, the bully algorithm is a method for dynamically electing a coordinator or leader from a group of distributed computer processes. The process with the highest process ID number from amongst the non-failed processes is selected as the coordinator.

<span class="mw-page-title-main">Broadcasting (networking)</span> Network messaging to multiple recipients simultaneously

In computer networking, telecommunication and information theory, broadcasting is a method of transferring a message to all recipients simultaneously. Broadcasting can be performed as a high-level operation in a program, for example, broadcasting in Message Passing Interface, or it may be a low-level networking operation, for example broadcasting on Ethernet.

Pastry is an overlay network and routing network for the implementation of a distributed hash table (DHT) similar to Chord. The key–value pairs are stored in a redundant peer-to-peer network of connected Internet hosts. The protocol is bootstrapped by supplying it with the IP address of a peer already in the network and from then on via the routing table which is dynamically built and repaired. It is claimed that because of its redundant and decentralized nature there is no single point of failure and any single node can leave the network at any time without warning and with little or no chance of data loss. The protocol is also capable of using a routing metric supplied by an outside program, such as ping or traceroute, to determine the best routes to store in its routing table.

Internet Control Message Protocol version 6 (ICMPv6) is the implementation of the Internet Control Message Protocol (ICMP) for Internet Protocol version 6 (IPv6). ICMPv6 is an integral part of IPv6 and performs error reporting and diagnostic functions.

A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts. Real-world applications often requiring consensus include cloud computing, clock synchronization, PageRank, opinion formation, smart power grids, state estimation, control of UAVs, load balancing, blockchain, and others.

In mathematics, the Bianchi classification provides a list of all real 3-dimensional Lie algebras. The classification contains 11 classes, 9 of which contain a single Lie algebra and two of which contain a continuum-sized family of Lie algebras. The classification is important in geometry and physics, because the associated Lie groups serve as symmetry groups of 3-dimensional Riemannian manifolds. It is named for Luigi Bianchi, who worked it out in 1898.

JGroups is a library for reliable one-to-one or one-to-many communication written in the Java language.

In distributed computing, leader election is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task has begun, all network nodes are either unaware which node will serve as the "leader" of the task, or unable to communicate with the current coordinator. After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task leader.

A gossip protocol or epidemic protocol is a procedure or process of computer peer-to-peer communication that is based on the way epidemics spread. Some distributed systems use peer-to-peer gossip to ensure that data is disseminated to all members of a group. Some ad-hoc networks have no central registry and the only way to spread common data is to rely on each member to pass it along to their neighbors.

In computer vision, maximally stable extremal regions (MSER) are used as a method of blob detection in images. This technique was proposed by Matas et al. to find correspondences between image elements from two images with different viewpoints. This method of extracting a comprehensive number of corresponding image elements contributes to the wide-baseline matching, and it has led to better stereo matching and object recognition algorithms.

In mathematics, Tucker decomposition decomposes a tensor into a set of matrices and one small core tensor. It is named after Ledyard R. Tucker although it goes back to Hitchcock in 1927. Initially described as a three-mode extension of factor analysis and principal component analysis it may actually be generalized to higher mode analysis, which is also called higher-order singular value decomposition (HOSVD).

IEEE 802.1aq is an amendment to the IEEE 802.1Q networking standard which adds support for Shortest Path Bridging (SPB). This technology is intended to simplify the creation and configuration of Ethernet networks while enabling multipath routing.

In a distributed computing system, a failure detector is a computer application or a subsystem that is responsible for the detection of node failures or crashes. Failure detectors were first introduced in 1996 by Chandra and Toueg in their book Unreliable Failure Detectors for Reliable Distributed Systems. The book depicts the failure detector as a tool to improve consensus and atomic broadcast in the distributed system. In other words, failure detectors seek errors in the process, and the system will maintain a level of reliability. In practice, after failure detectors spot crashes, the system will ban the processes that are making mistakes to prevent any further serious crashes or errors.

Gbcast is a reliable multicast protocol that provides ordered, fault-tolerant (all-or-none) message delivery in a group of receivers within a network of machines that experience crash failure. The protocol is capable of solving Consensus in a network of unreliable processors, and can be used to implement state machine replication. Gbcast can be used in a standalone manner, or can support the virtual synchrony execution model, in which case Gbcast is normally used for group membership management while other, faster, protocols are often favored for routine communication tasks.

In computer science, a heartbeat is a periodic signal generated by hardware or software to indicate normal operation or to synchronize other parts of a computer system. Heartbeat mechanism is one of the common techniques in mission critical systems for providing high availability and fault tolerance of network services by detecting the network or systems failures of nodes or daemons which belongs to a network cluster—administered by a master server—for the purpose of automatic adaptation and rebalancing of the system by using the remaining redundant nodes on the cluster to take over the load of failed nodes for providing constant services. Usually a heartbeat is sent between machines at a regular interval in the order of seconds; a heartbeat message. If the endpoint does not receive a heartbeat for a time—usually a few heartbeat intervals—the machine that should have sent the heartbeat is assumed to have failed. Heartbeat messages are typically sent non-stop on a periodic or recurring basis from the originator's start-up until the originator's shutdown. When the destination identifies a lack of heartbeat messages during an anticipated arrival period, the destination may determine that the originator has failed, shutdown, or is generally no longer available.

References

↑ Petrov, Alex (2019). Database Internals. O'Reilly Media.
1 2 Gupta, Indranil; Chandra, Tushar D.; Goldszmidt, Germán S. (August 1, 2001). "On scalable and efficient distributed failure detectors". Proceedings of the twentieth annual ACM symposium on Principles of distributed computing. PODC '01. Newport, Rhode Island, US: Association for Computing Machinery. pp. 170–179. doi:10.1145/383962.384010. ISBN 978-1-58113-383-7. S2CID 216594.
1 2 Das, A.; Gupta, I.; Motivala, A. (June 23, 2002). "SWIM: Scalable weakly-consistent infection-style process group membership protocol". Proceedings International Conference on Dependable Systems and Networks. pp. 303–312. doi:10.1109/DSN.2002.1028914. ISBN 0-7695-1597-5. S2CID 11094028.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Petrov, Alex (2019). Database Internals. O'Reilly Media.

[:0-2] 1 2 Gupta, Indranil; Chandra, Tushar D.; Goldszmidt, Germán S. (August 1, 2001). "On scalable and efficient distributed failure detectors". Proceedings of the twentieth annual ACM symposium on Principles of distributed computing. PODC '01. Newport, Rhode Island, US: Association for Computing Machinery. pp. 170–179. doi:10.1145/383962.384010. ISBN 978-1-58113-383-7. S2CID 216594.

[:1-3] 1 2 Das, A.; Gupta, I.; Motivala, A. (June 23, 2002). "SWIM: Scalable weakly-consistent infection-style process group membership protocol". Proceedings International Conference on Dependable Systems and Networks. pp. 303–312. doi:10.1109/DSN.2002.1028914. ISBN 0-7695-1597-5. S2CID 11094028.

[1]

[2]

[3]

v t e Computer science
Note: This template roughly follows the 2012 ACM Computing Classification System.
Hardware	Printed circuit board Peripheral Integrated circuit Very Large Scale Integration Systems on Chip (SoCs) Energy consumption (Green computing) Electronic design automation Hardware acceleration
Computer systems organization	Computer architecture Embedded system Real-time computing Dependability
Networks	Network architecture Network protocol Network components Network scheduler Network performance evaluation Network service
Software organization	Interpreter Middleware Virtual machine Operating system Software quality
Software notations and tools	Programming paradigm Programming language Compiler Domain-specific language Modeling language Software framework Integrated development environment Software configuration management Software library Software repository
Software development	Control variable Software development process Requirements analysis Software design Software construction Software deployment Software engineering Software maintenance Programming team Open-source model
Theory of computation	Model of computation Formal language Automata theory Computability theory Computational complexity theory Logic Semantics
Algorithms	Algorithm design Analysis of algorithms Algorithmic efficiency Randomized algorithm Computational geometry
Mathematics of computing	Discrete mathematics Probability Statistics Mathematical software Information theory Mathematical analysis Numerical analysis Theoretical computer science
Information systems	Database management system Information storage systems Enterprise information system Social information systems Geographic information system Decision support system Process control system Multimedia information system Data mining Digital library Computing platform Digital marketing World Wide Web Information retrieval
Security	Cryptography Formal methods Security hacker Security services Intrusion detection system Hardware security Network security Information security Application security
Human–computer interaction	Interaction design Social computing Ubiquitous computing Visualization Accessibility
Concurrency	Concurrent computing Parallel computing Distributed computing Multithreading Multiprocessing
Artificial intelligence	Natural language processing Knowledge representation and reasoning Computer vision Automated planning and scheduling Search methodology Control method Philosophy of artificial intelligence Distributed artificial intelligence
Machine learning	Supervised learning Unsupervised learning Reinforcement learning Multi-task learning Cross-validation
Graphics	Animation Rendering Photograph manipulation Graphics processing unit Mixed reality Virtual reality Image compression Solid modeling
Applied computing	Quantum Computing E-commerce Enterprise software Computational mathematics Computational physics Computational chemistry Computational biology Computational social science Computational engineering Differentiable computing Computational healthcare Digital art Electronic publishing Cyberwarfare Electronic voting Video games Word processing Operations research Educational technology Document management
Category Outline WikiProject Commons