Reliable multicast

Last updated January 06, 2025

A reliable multicast is any computer networking protocol that provides a reliable sequence of packets to multiple recipients simultaneously, making it suitable for applications such as multi-receiver file transfer.

Overview

Multicast is a network addressing method for the delivery of information to a group of destinations simultaneously using the most efficient strategy to deliver the messages over each link of the network only once, creating copies only when the links to the multiple destinations split (typically network switches and routers). However, like the User Datagram Protocol, multicast does not guarantee the delivery of a message stream. Messages may be dropped, delivered multiple times, or delivered out of order. A reliable multicast protocol adds the ability for receivers to detect lost and/or out-of-order messages and take corrective action (similar in principle to TCP), resulting in a gap-free, in-order message stream.

Reliability

The exact meaning of reliability depends on the specific protocol instance. A minimal definition of reliable multicast is eventual delivery of all the data to all the group members, without enforcing any particular delivery order.^[1] However, not all reliable multicast protocols ensure this level of reliability; many of them trade efficiency for reliability, in different ways. For example, while TCP makes the sender responsible for transmission reliability, multicast NAK-based protocols shift the responsibility to receivers: the sender never knows for sure that all the receivers have in fact received all the data.^[2] RFC- 2887 explores the design space for bulk data transfer, with a brief discussion on the various issues and some hints at the possible different meanings of reliable.

Reliable Group Data Delivery

Reliable Group Data Delivery (RGDD) is a form of multicasting where an object is to be moved from a single source to a fixed set of receivers known before transmission begins.^[3]^[4] A variety of applications may need such delivery: Hadoop Distributed File System (HDFS) replicates any chunk of data two additional times to specific servers, VM replication to multiple servers may be required for scale out of applications and data replication to multiple servers may be necessary for load balancing by allowing multiple servers to serve the same data from their local cached copies. Such delivery is frequent within datacenters due to plethora of servers communicating while running highly distributed applications.

RGDD may also occur across datacenters and is sometimes referred to as inter-datacenter Point to Multipoint (P2MP) Transfers.^[5] Such transfers deliver huge volumes of data from one datacenter to multiple datacenters for various applications: search engines distribute search index updates periodically (e.g. every 24 hours), social media applications push new content to many cache locations across the world (e.g. YouTube and Facebook), and backup services make several geographically dispersed copies for increased fault tolerance. To maximize bandwidth utilization and reduce completion times of bulk transfers, a variety of techniques have been proposed for selection of multicast forwarding trees.^[5]^[6]

Virtual synchrony

Modern systems like the Spread Toolkit, Quicksilver, and Corosync can achieve data rates of 10,000 multicasts per second or more, and can scale to large networks with huge numbers of groups or processes.

Most distributed computing platforms support one or more of these models. For example, the widely supported object-oriented CORBA platforms all support transactions and some CORBA products support transactional replication in the one-copy-serializability model. The "CORBA Fault Tolerant Objects standard" is based on the virtual synchrony model.^[7] Virtual synchrony was also used in developing the New York Stock Exchange fault-tolerance architecture, the French Air Traffic Control System, the US Navy AEGIS system, IBM's Business Process replication architecture for WebSphere and Microsoft's Windows Clustering architecture for Windows Longhorn enterprise servers.^[8]

Systems that support virtual synchrony

Virtual synchrony was first supported by the Cornell University and was called the "Isis Toolkit".^[9] Cornell's most current version, Vsync was released in 2013 under the name Isis2 (the name was changed from Isis2 to Vsync in 2015 in the wake of a terrorist attack in Paris by an extremist organization called ISIS), with periodic updates and revisions since that time. The most current stable release is V2.2.2020; it was released on November 14, 2015; the V2.2.2048 release is currently available in Beta form.^[10] Vsync aims at the massive data centers that support cloud computing.

Other such systems include the Horus system^[11] the Transis system, the Totem system, an IBM system called Phoenix, a distributed security key management system called Rampart, the "Ensemble system",^[12] the Quicksilver system, "The OpenAIS project",^[13] its derivative the Corosync Cluster Engine and a number of products (including the IBM and Microsoft ones mentioned earlier).

Other existing or proposed protocols

Library support

JGroups (Java API)
Spread: C/C++ API, Java API
RMF (C# API)
hmbdc open source (headers only) C++ middleware, ultra-low latency/high throughput, scalable and reliable inter-thread, IPC and network messaging

Related Research Articles

<span class="mw-page-title-main">Multicast</span> Computer networking technique

In computer networking, multicast is a type of group communication where data transmission is addressed to a group of destination computers simultaneously. Multicast can be one-to-many or many-to-many distribution. Multicast differs from physical layer point-to-multipoint communication.

In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in a different address space, which is written as if it were a normal (local) procedure call, without the programmer explicitly writing the details for the remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. This is a form of client–server interaction, typically implemented via a request–response message passing system. In the object-oriented programming paradigm, RPCs are represented by remote method invocation (RMI). The RPC model implies a level of location transparency, namely that calling procedures are largely the same whether they are local or remote, but usually, they are not identical, so local calls can be distinguished from remote calls. Remote calls are usually orders of magnitude slower and less reliable than local calls, so distinguishing them is important.

In computer networking, the User Datagram Protocol (UDP) is one of the core communication protocols of the Internet protocol suite used to send messages to other hosts on an Internet Protocol (IP) network. Within an IP network, UDP does not require prior communication to set up communication channels or data paths.

Message-oriented middleware (MOM) is software or hardware infrastructure supporting sending and receiving messages between distributed systems. Message-oriented middleware is in contrast to streaming-oriented middleware where data is communicated as a sequence of bytes with no explicit message boundaries. Note that streaming protocols are almost always built above protocols using discrete messages such as frames (Ethernet), datagrams (UDP), packets (IP), cells (ATM), et al.

A content delivery network or content distribution network (CDN) is a geographically distributed network of proxy servers and their data centers. The goal is to provide high availability and performance ("speed") by distributing the service spatially relative to end users. CDNs came into existence in the late 1990s as a means for alleviating the performance bottlenecks of the Internet as the Internet was starting to become a mission-critical medium for people and enterprises. Since then, CDNs have grown to serve a large portion of the Internet content today, including web objects, downloadable objects, applications, live streaming media, on-demand streaming media, and social media sites.

An overlay network is a computer network that is layered on top of another network. The concept of overlay networking is distinct from the traditional model of OSI layered networks, and almost always assumes that the underlay network is an IP network of some kind.

In software architecture, publish–subscribe is a messaging pattern where publishers categorize messages into classes that are received by subscribers. This is contrasted to the typical messaging pattern model where publishers send messages directly to subscribers.

Push technology, also known as server Push, refers to a communication method, where the communication is initiated by a server rather than a client. This approach is different from the "pull" method where the communication is initiated by a client.

IP multicast is a method of sending Internet Protocol (IP) datagrams to a group of interested receivers in a single transmission. It is the IP-specific form of multicast and is used for streaming media and other network applications. It uses specially reserved multicast address blocks in IPv4 and IPv6.

In computer networking, a reliable protocol is a communication protocol that notifies the sender whether or not the delivery of data to intended recipients was successful. Reliability is a synonym for assurance, which is the term used by the ITU and ATM Forum, and leads to fault-tolerant messaging.

Replication in computing refers to maintaining multiple copies of data, processes, or resources to ensure consistency across redundant components. This fundamental technique spans databases, file systems, and distributed systems, serving to improve availability, fault-tolerance, accessibility, and performance. Through replication, systems can continue operating when components fail (failover), serve requests from geographically distributed locations, and balance load across multiple machines. The challenge lies in maintaining consistency between replicas while managing the fundamental tradeoffs between data consistency, system availability, and network partition tolerance – constraints known as the CAP theorem.

In computer science, state machine replication (SMR) or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols.

Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.

A message broker is an intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver. Message brokers are elements in telecommunication or computer networks where software applications communicate by exchanging formally-defined messages. Message brokers are a building block of message-oriented middleware (MOM) but are typically not a replacement for traditional middleware like MOM and remote procedure call (RPC).

A gossip protocol or epidemic protocol is a procedure or process of computer peer-to-peer communication that is based on the way epidemics spread. Some distributed systems use peer-to-peer gossip to ensure that data is disseminated to all members of a group. Some ad-hoc networks have no central registry and the only way to spread common data is to rely on each member to pass it along to their neighbors.

Live distributed object refers to a running instance of a distributed multi-party protocol, viewed from the object-oriented perspective, as an entity that has a distinct identity, may encapsulate internal state and threads of execution, and that exhibits a well-defined externally visible behavior.

Software-defined networking (SDN) is an approach to network management that uses abstraction to enable dynamic and programmatically efficient network configuration to create grouping and segmentation while improving network performance and monitoring in a manner more akin to cloud computing than to traditional network management. SDN is meant to improve the static architecture of traditional networks and may be employed to centralize network intelligence in one network component by disassociating the forwarding process of network packets from the routing process. The control plane consists of one or more controllers, which are considered the brains of the SDN network, where the whole intelligence is incorporated. However, centralization has certain drawbacks related to security, scalability and elasticity.

Gbcast is a reliable multicast protocol that provides ordered, fault-tolerant (all-or-none) message delivery in a group of receivers within a network of machines that experience crash failure. The protocol is capable of solving Consensus in a network of unreliable processors, and can be used to implement state machine replication. Gbcast can be used in a standalone manner, or can support the virtual synchrony execution model, in which case Gbcast is normally used for group membership management while other, faster, protocols are often favored for routine communication tasks.

Kenneth P. Birman is a professor in the Department of Computer Science at Cornell University. He currently holds the N. Rama Rao Chair in Computer Science.

The Vsync software library is a BSD-licensed open source library written in C# for the .NET platform, providing a wide variety of primitives for fault-tolerant distributed computing, including: state machine replication, virtual synchrony process groups, atomic broadcast with several levels of ordering and durability, a distributed lock manager, persistent replicated data, a distributed key-value store, and scalable aggregation. The system implements the virtual synchrony execution model, and includes an implementation of Leslie Lamport's Paxos Protocol.

References

↑ Floyd, S.; Jacobson, V.; Liu, C. -G.; McCanne, S.; Zhang, L. (December 1997). "A reliable multicast framework for light-weight sessions and application level framing". IEEE/ACM Transactions on Networking. 5 (6): 784–803. doi:10.1109/90.650139. S2CID 221634489.
↑ Diot, C.; Dabbous, W.; Crowcroft, J. (April 1997). "Multipoint communication: A survey of protocols, functions, and mechanisms" (PDF). IEEE Journal on Selected Areas in Communications. 15 (3): 277–290. doi:10.1109/49.564128.
↑ C. Guo; et al. (November 1, 2012). "Datacast: A Scalable and Efficient Reliable Group Data Delivery Service For Data Centers". ACM. Retrieved July 26, 2017.
↑ T. Zhu; et al. (Oct 18, 2016). "MCTCP: Congestion-aware and robust multicast TCP in Software-Defined networks". 2016 IEEE/ACM 24th International Symposium on Quality of Service (IWQoS). IEEE. pp. 1–10. doi:10.1109/IWQoS.2016.7590433. ISBN 978-1-5090-2634-0. S2CID 28159768.
1 2 M. Noormohammadpour; et al. (July 10, 2017). "DCCast: Efficient Point to Multipoint Transfers Across Datacenters". USENIX. Retrieved July 26, 2017.
↑ M. Noormohammadpour; et al. (2018). "QuickCast: Fast and Efficient Inter-Datacenter Transfers using Forwarding Tree Cohorts" . Retrieved January 23, 2018.
↑ "Catalog of CORBA/IIOP Specifications". 2004-10-09. Archived from the original on 2004-10-09. Retrieved 2024-09-19.
↑ K. P. Birman (July 1999). "A Review of Experiences with Reliable Multicast". Software: Practice and Experience. 29 (9): 741–774. doi:10.1002/(SICI)1097-024X(19990725)29:9<741::AID-SPE259>3.0.CO;2-I. hdl: 1813/7380 .
↑ "Isis Toolkit"
↑ "Vsync Cloud Computing Library".
↑ "Horus system"
↑ "Ensemble system"
↑ "The OpenAIS project"