Transparent Inter-process Communication

Last updated

Transparent Inter Process Communication (TIPC) is an inter-process communication (IPC) service in Linux designed for cluster-wide operation. It is sometimes presented as Cluster Domain Sockets, in contrast to the well-known Unix Domain Socket service; the latter working only on a single kernel.

Contents

Features

Some features of TIPC:

Examples of Service Addressing and Tracking TIPC Service Addressing and Tracking.png
Examples of Service Addressing and Tracking

Implementations

The TIPC protocol is available as a module in the mainstream Linux kernel, and hence in most Linux distributions. The TIPC project also provides open source implementations of the protocol for other operating systems including Wind River's VxWorks and Sun Microsystems' Solaris. TIPC applications are typically written in C (or C++) and utilize sockets of the AF_TIPC address family. Support for Go, D, Perl, Python, and Ruby is also available.

Service addressing

A TIPC application may use three types of addresses.

TIPC Service Addressing Service Addressing.png
TIPC Service Addressing

A socket can be bound to several different service addresses or ranges, just as different sockets can be bound to the same service address or range. Bindings are also qualified with a visibility scope, i.e., node local or cluster global visibility.

Datagram messaging

Datagram messages are discrete data units between 1 and 66,000 byte of length, transmitted between non-connected sockets. Just like their UDP counterparts, TIPC datagrams are not guaranteed to reach their destination, but their chances of being delivered are still much better than for the former. Because of the link layer delivery guarantee, the only limiting factor for datagram delivery is the socket receive buffer size. The chances of success can also be increased by the sender, by giving his socket an appropriate delivery importance priority. Datagrams can be transmitted in three different ways.

Connection-oriented messaging

Connections can be established the same way as with TCP, by means of accept() and connect() on SOCK_STREAM sockets. However, in TIPC the client and server use service addresses or ranges instead of port numbers and IP addresses. TIPC does also provide two alternatives to this standard setup scenario.

The most distinguishing property of TIPC connections is still their ability to react promptly to loss of contact with the peer socket, without resorting to active neighbor heart-beating.

Group messaging

Group messaging is similar to datagram messaging, as described above, but with end-to-end flow control, and hence with delivery guarantee. There are however a few notable differences.

When joining a group, a member may indicate if it wants to receive join or leave events for other members of the group. This feature leverages the service tracking feature, and the group member will receive the events in the member socket proper.

Service tracking

An application accesses the tracking service by opening a connection to the TIPC internal topology server, using a reserved service address. It can then send one or more service subscription messages to the tracking service, indicating the service address or range it wants to track. In return, the topology service sends service event messages back to the application whenever matching addresses are bound or unbound by sockets within the cluster. A service event contains the found matching service range, plus the port and node number of the bound/unbound socket. There are two special cases of service tracking:

Although most service subscriptions are directed towards the node local topology server, it is possible to establish connections to other nodes' servers and observe their local bindings. This might be useful if e.g., a connectivity subscriber wants to create a matrix of all connectivity across the cluster, - not limited to what can be seen from the local node.

Cluster

A TIPC network consists of individual processing elements or nodes. Nodes can be either physical processors, virtual machines or network namespaces, e.g., in the form of Docker Containers. Those nodes are arranged into a cluster according to their assigned cluster identity. All nodes having the same cluster identity will establish links to each other, provided the network is set up to allow mutual neighbor discovery between them. It is only necessary to change the cluster identity from its default value if nodes in different clusters potentially may discover each other, e.g., if they are attached to the same subnet. Nodes in different clusters cannot communicate with each other using TIPC.

Two physically interconnected, but logically separate, TIPC clusters. TIPC Clusters.png
Two physically interconnected, but logically separate, TIPC clusters.

Before Linux 4.17, nodes must be configured a unique 32-bit node number or address, which must comply with certain restrictions. As from Linux 4.17, each node has a 128-bit node identity which must be unique within the node's cluster. The node number is then calculated as a guaranteed unique hash from that identity.

If the node will be part of a cluster, the user can either rely on the auto configuration capability of the node, where the identity is generated when the first interface is attached, or he can set the identity explicitly, e.g., from the node's host name or a UUID. If a node will not be part of a cluster its identity can remain at the default value, zero.

Neighbor discovery is performed by UDP multicast or L2 broadcast, when available. If broadcast/multicast support is missing in the infrastructure, discovery can be performed by explicitly configured IP addresses.

A cluster consists of nodes interconnected with one or two links. A link constitutes a reliable packet transport service, sometimes referred to as an "L2.5" data link layer.

Cluster scalability

Since Linux 4.7, TIPC comes with a unique, patent pending, auto-adaptive hierarchical neighbor monitoring algorithm. This Overlapping Ring Monitoring algorithm, in reality a combination of ring monitoring and the Gossip protocol, makes it possible to establish full-mesh clusters of up to 1000 nodes with a failure discovery time of 1.5 seconds, while it in smaller clusters can be made much shorter.

Performance

TIPC provides outstanding performance, especially regarding round-trip latency times. Inter-node it is typically 33% faster than TCP, intra-node 2 times faster for small messages and 7 times faster for large messages. Inter-node, it provides a 10–30% lower maximal throughput than TCP, while its intra-node throughput is 25–30% higher. The TIPC team is currently studying how to add GSO/GRO support for intra node messaging, in order to match TCP even here.

Transport media

While designed to be able to use all kinds of transport media, as of May 2018 implementations support UDP, Ethernet and InfiniBand. The VxWorks implementation also supports shared memory which can be accessed by multiple instances of the operating system, running simultaneously on the same hardware.

Security

Security must currently be provided by the transport media carrying TIPC. When running across UDP, IPSec can be used, when on Ethernet, MACSec is the best option. The TIPC team is currently looking into how to support TLS or DTLS, ether natively or by an addition to OpenSSL.

History

This protocol was originally developed by Jon Paul Maloy at Ericsson during 1996–2005 and was used by that company in cluster applications for several years, before subsequently being released to the open source community and integrated in the mainstream Linux kernel. It has since then undergone numerous improvements and upgrades, all performed by a dedicated TIPC project team with participants from various companies. The management tool for TIPC is part of the iproute2 tool package which comes as standard with all Linux distributions.

Related Research Articles

<span class="mw-page-title-main">Multicast</span> Computer networking technique

In computer networking, multicast is group communication where data transmission is addressed to a group of destination computers simultaneously. Multicast can be one-to-many or many-to-many distribution. Multicast should not be confused with physical layer point-to-multipoint communication.

In computing, traceroute and tracert are computer network diagnostic commands for displaying possible routes (paths) and measuring transit delays of packets across an Internet Protocol (IP) network. The history of the route is recorded as the round-trip times of the packets received from each successive host in the route (path); the sum of the mean times in each hop is a measure of the total time spent to establish the connection. Traceroute proceeds unless all sent packets are lost more than twice; then the connection is lost and the route cannot be evaluated. Ping, on the other hand, only computes the final round-trip times from the destination point.

In computer networking, the User Datagram Protocol (UDP) is one of the core communication protocols of the Internet protocol suite used to send messages to other hosts on an Internet Protocol (IP) network. Within an IP network, UDP does not require prior communication to set up communication channels or data paths.

Connectionless communication, often referred to as CL-mode communication, is a data transmission method used in packet switching networks in which each data unit is individually addressed and routed based on information carried in each unit, rather than in the setup information of a prearranged, fixed data channel as in connection-oriented communication.

Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

<span class="mw-page-title-main">Transport layer</span> Layer in the OSI and TCP/IP models providing host-to-host communication services for applications

In computer networking, the transport layer is a conceptual division of methods in the layered architecture of protocols in the network stack in the Internet protocol suite and the OSI model. The protocols of this layer provide end-to-end communication services for applications. It provides services such as connection-oriented communication, reliability, flow control, and multiplexing.

<span class="mw-page-title-main">GNUnet</span> Framework for decentralized, peer-to-peer networking which is part of the GNU Project

GNUnet is a software framework for decentralized, peer-to-peer networking and an official GNU package. The framework offers link encryption, peer discovery, resource allocation, communication over many transports and various basic peer-to-peer algorithms for routing, multicast and network size estimation.

<span class="mw-page-title-main">Anycast</span> Network addressing and routing methodology

Anycast is a network addressing and routing methodology in which a single IP address is shared by devices in multiple locations. Routers direct packets addressed to this destination to the location nearest the sender, using their normal decision-making algorithms, typically the lowest number of BGP network hops. Anycast routing is widely used by content delivery networks such as web and name servers, to bring their content closer to end users.

An overlay network is a computer network that is layered on top of another network. The concept of overlay networking is distinct from the traditional model of OSI layered networks, and almost always assumes that the underlay network is an IP network of some kind.

In computer networking, the Datagram Congestion Control Protocol (DCCP) is a message-oriented transport layer protocol. DCCP implements reliable connection setup, teardown, Explicit Congestion Notification (ECN), congestion control, and feature negotiation. The IETF published DCCP as RFC 4340, a proposed standard, in March 2006. RFC 4336 provides an introduction.

In computer networking, Teredo is a transition technology that gives full IPv6 connectivity for IPv6-capable hosts that are on the IPv4 Internet but have no native connection to an IPv6 network. Unlike similar protocols such as 6to4, it can perform its function even from behind network address translation (NAT) devices such as home routers.

IP multicast is a method of sending Internet Protocol (IP) datagrams to a group of interested receivers in a single transmission. It is the IP-specific form of multicast and is used for streaming media and other network applications. It uses specially reserved multicast address blocks in IPv4 and IPv6.

<span class="mw-page-title-main">Broadcasting (networking)</span> Network messaging to multiple recipients simultaneously

In computer networking, telecommunication and information theory, broadcasting is a method of transferring a message to all recipients simultaneously. Broadcasting can be performed as a high-level operation in a program, for example, broadcasting in Message Passing Interface, or it may be a low-level networking operation, for example broadcasting on Ethernet.

A network socket is a software structure within a network node of a computer network that serves as an endpoint for sending and receiving data across the network. The structure and properties of a socket are defined by an application programming interface (API) for the networking architecture. Sockets are created only during the lifetime of a process of an application running in the node.

A Unix domain socket aka UDS or IPC socket is a data communications endpoint for exchanging data between processes executing on the same host operating system. It is also referred to by its address family AF_UNIX.

In computer networking, a port or port number is a number assigned to uniquely identify a connection endpoint and to direct data to a specific service. At the software level, within an operating system, a port is a logical construct that identifies a specific process or a type of network service. A port at the software level is identified for each transport protocol and address combination by the port number assigned to it. The most common transport protocols that use port numbers are the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP); those port numbers are 16-bit unsigned numbers.

In computing, Microsoft's Windows Vista and Windows Server 2008 introduced in 2007/2008 a new networking stack named Next Generation TCP/IP stack, to improve on the previous stack in several ways. The stack includes native implementation of IPv6, as well as a complete overhaul of IPv4. The new TCP/IP stack uses a new method to store configuration settings that enables more dynamic control and does not require a computer restart after a change in settings. The new stack, implemented as a dual-stack model, depends on a strong host-model and features an infrastructure to enable more modular components that one can dynamically insert and remove.

The Secure Real-Time Media Flow Protocol (RTMFP) is a protocol suite developed by Adobe Systems for encrypted, efficient multimedia delivery through both client-server and peer-to-peer models over the Internet. The protocol was originally proprietary, but was later opened up and is now published as RFC 7016.

<span class="mw-page-title-main">IPv6 address</span> Label to identify a network interface of a computer or other network node

An Internet Protocol version 6 address is a numeric label that is used to identify and locate a network interface of a computer or a network node participating in a computer network using IPv6. IP addresses are included in the packet header to indicate the source and the destination of each packet. The IP address of the destination is used to make decisions about routing IP packets to other networks.

Constrained Application Protocol (CoAP) is a specialized UDP-based Internet application protocol for constrained devices, as defined in RFC 7252. It enables those constrained devices called "nodes" to communicate with the wider Internet using similar protocols. CoAP is designed for use between devices on the same constrained network, between devices and general nodes on the Internet, and between devices on different constrained networks both joined by an internet. CoAP is also being used via other mechanisms, such as SMS on mobile communication networks.