Remote direct memory access

Last updated

In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.

Contents

Overview

RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer.

However, this strategy presents several problems related to the fact that the target node is not notified of the completion of the request (single-sided communications).

Acceptance

As of 2018 RDMA had achieved broader acceptance as a result of implementation enhancements that enable good performance over ordinary networking infrastructure. [1] For example RDMA over Converged Ethernet (RoCE) now is able to run over either lossy or lossless infrastructure. In addition iWARP enables an Ethernet RDMA implementation at the physical layer using TCP/IP as the transport, combining the performance and latency advantages of RDMA with a low-cost, standards-based solution. [2] The RDMA Consortium and the DAT Collaborative [3] have played key roles in the development of RDMA protocols and APIs for consideration by standards groups such as the Internet Engineering Task Force and the Interconnect Software Consortium. [4]

Hardware vendors have started working on higher-capacity RDMA-based network adapters, with rates of 100 Gbit/s reported. [5] [6] Software vendors, such as IBM, [7] Red Hat and Oracle Corporation, support these APIs in their latest products, [8] and as of 2013 engineers have started developing network adapters that implement RDMA over Ethernet. [9] Both Red Hat Enterprise Linux and Red Hat Enterprise MRG [10] have support for RDMA. Microsoft supports RDMA in Windows Server 2012 via SMB Direct. VMware's ESXi product also supports RDMA as of 2015.

Common RDMA implementations include the Virtual Interface Architecture, RDMA over Converged Ethernet (RoCE), InfiniBand, Omni-Path and iWARP.

Using RDMA

Applications access control structures using well-defined APIs originally designed for the InfiniBand Protocol (although the APIs can be used for any of the underlying RDMA implementations). Using send and completion queues, applications perform RDMA operations by submitting work queue entries (WQEs) into the submission queue (SQ) and getting notified of responses from the completion queue (CQ). [11]

Transport Types

RDMA can transport data reliably or unreliably over the Reliably Connected (RC) and Unreliable Datagram (UD) transport protocols, respectively. The former has the benefit of preserving requests (no requests are lost), while the latter requires fewer queue pairs when handling multiple connections. This is due to the fact that UD is connection-less, allowing a single host to communicate with any other using a single queue. [12]


Related Research Articles

Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems (Sun) in 1984, allowing a user on a client computer to access files over a computer network much like local storage is accessed. NFS, like many other protocols, builds on the Open Network Computing Remote Procedure Call system. NFS is an open IETF standard defined in a Request for Comments (RFC), allowing anyone to implement the protocol.

<span class="mw-page-title-main">InfiniBand</span> Network standard

InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. It is designed to be scalable and uses a switched fabric network topology. Between 2014 and June 2016, it was the most commonly used interconnect in the TOP500 list of supercomputers.

<span class="mw-page-title-main">Inter-process communication</span> How computer operating systems enable data sharing

In computer science, inter-process communication (IPC), also spelled interprocess communication, are the mechanisms provided by an operating system for processes to manage shared data. Typically, applications can use IPC, categorized as clients and servers, where the client requests data and the server responds to client requests. Many applications are both clients and servers, as commonly seen in distributed computing.

<span class="mw-page-title-main">Network interface controller</span> Hardware component that connects a computer to a network

A network interface controller is a computer hardware component that connects a computer to a computer network.

<span class="mw-page-title-main">Server Message Block</span> Network communication protocol for providing shared access to resources

Server Message Block (SMB) is a communication protocol used to share files, printers, serial ports, and miscellaneous communications between nodes on a network. On Microsoft Windows, the SMB implementation consists of two vaguely named Windows services: "Server" and "Workstation". It uses NTLM or Kerberos protocols for user authentication. It also provides an authenticated inter-process communication (IPC) mechanism.

TCP offload engine (TOE) is a technology used in some network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant. TOEs are often used as a way to reduce the overhead associated with Internet Protocol (IP) storage protocols such as iSCSI and Network File System (NFS).

In computer networking, DECserver initially referred to a highly successful family of asynchronous console server / terminal server / print server products introduced by Digital Equipment Corporation (DEC) and later referred to a class of UNIX-variant application and file server products based upon the MIPS processor. In February 1998, DEC sold its Network Products Business to Cabletron, which then spun out as its own company, Digital Networks, in September 2000.

iWARP is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks. Contrary to some accounts, iWARP is not an acronym.

The iSCSI Extensions for RDMA (iSER) is a computer network protocol that extends the Internet Small Computer System Interface (iSCSI) protocol to use Remote Direct Memory Access (RDMA). RDMA can be provided by the Transmission Control Protocol (TCP) with RDMA services (iWARP), which uses an existing Ethernet setup and therefore has lower hardware costs, RoCE, which does not need the TCP layer and therefore provides lower latency, or InfiniBand. iSER permits data to be transferred directly into and out of SCSI computer memory buffers without intermediate data copies and with minimal CPU involvement.

<span class="mw-page-title-main">OpenFabrics Alliance</span> Organization

The OpenFabrics Alliance is a non-profit organization that promotes remote direct memory access (RDMA) switched fabric technologies for server and storage connectivity. These high-speed data-transport technologies are used in high-performance computing facilities, in research and various industries.

In computing the SCSI RDMA Protocol (SRP) is a protocol that allows one computer to access SCSI devices attached to another computer via remote direct memory access (RDMA). The SRP protocol is also known as the SCSI Remote Protocol. The use of RDMA makes higher throughput and lower latency possible than what is generally available through e.g. the TCP/IP communication protocol.

<span class="mw-page-title-main">Storage area network</span> Network which provides access to consolidated, block-level data storage

A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block-level data storage. SANs are primarily used to access data storage devices, such as disk arrays and tape libraries from servers so that the devices appear to the operating system as direct-attached storage. A SAN typically is a dedicated network of storage devices not accessible through the local area network (LAN).

In computer science, memory virtualization decouples volatile random access memory (RAM) resources from individual systems in the data centre, and then aggregates those resources into a virtualized memory pool available to any computer in the cluster. The memory pool is accessed by the operating system or applications running on top of the operating system. The distributed memory pool can then be utilized as a high-speed cache, a messaging layer, or a large, shared memory resource for a CPU or a GPU application.

<span class="mw-page-title-main">LIO (SCSI target)</span> Open-source version of SCSI target

In computing, Linux-IO (LIO) Target is an open-source implementation of the SCSI target that has become the standard one included in the Linux kernel. Internally, LIO does not initiate sessions, but instead provides one or more Logical Unit Numbers (LUNs), waits for SCSI commands from a SCSI initiator, and performs required input/output data transfers. LIO supports common storage fabrics, including FCoE, Fibre Channel, IEEE 1394, iSCSI, iSCSI Extensions for RDMA (iSER), SCSI RDMA Protocol (SRP) and USB. It is included in most Linux distributions; native support for LIO in QEMU/KVM, libvirt, and OpenStack makes LIO also a storage option for cloud deployments.

<span class="mw-page-title-main">Chelsio Communications</span> American technology company

Chelsio Communications is a privately held technology company headquartered in Sunnyvale, California with a design center in Bangalore, India. Early venture capital funding came from Horizons Ventures, Invesco, Investor Growth Capital, NTT Finance, Vendanta Capital, Abacus Capital Group, Pacesetter Capital Group, and New Enterprise Associates. A third round of funding raised $25 million in late 2004. LSI Corporation was added as investor in 2006 in the series D round. By January 2008, a $25M financing round was announced as series E. In 2009, an additional $17M was raised from previous investors plus Mobile Internet Capital.

RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE) is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet. There are multiple RoCE versions. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.

<span class="mw-page-title-main">Mellanox Technologies</span> Israeli-American multinational supplier of computer networking products

Mellanox Technologies Ltd. was an Israeli-American multinational supplier of computer networking products based on InfiniBand and Ethernet technology. Mellanox offered adapters, switches, software, cables and silicon for markets including high-performance computing, data centers, cloud computing, computer data storage and financial services.

RTP-MIDI is a protocol to transport MIDI messages within Real-time Transport Protocol (RTP) packets over Ethernet and WiFi networks. It is completely open and free, and is compatible both with LAN and WAN application fields. Compared to MIDI 1.0, RTP-MIDI includes new features like session management, device synchronization and detection of lost packets, with automatic regeneration of lost data. RTP-MIDI is compatible with real-time applications, and supports sample-accurate synchronization for each MIDI message.

<span class="mw-page-title-main">SoftEther VPN</span> Open-source VPN client and server software

SoftEther VPN is free open-source, cross-platform, multi-protocol VPN client and VPN server software, developed as part of Daiyuu Nobori's master's thesis research at the University of Tsukuba. VPN protocols such as SSL VPN, L2TP/IPsec, OpenVPN, and Microsoft Secure Socket Tunneling Protocol are provided in a single VPN server. It was released using the GPLv2 license on January 4, 2014. The license was switched to Apache License 2.0 on January 21, 2019.

SHMEM is a family of parallel programming libraries, providing one-sided, RDMA, parallel-processing interfaces for low-latency distributed-memory supercomputers. The SHMEM acronym was subsequently reverse engineered to mean "Symmetric Hierarchical MEMory”. Later it was expanded to distributed memory parallel computer clusters, and is used as parallel programming interface or as low-level interface to build partitioned global address space (PGAS) systems and languages. “Libsma”, the first SHMEM library, was created by Richard Smith at Cray Research in 1993 as a set of thin interfaces to access the CRAY T3D's inter-processor-communication hardware. SHMEM has been implemented by Cray Research, SGI, Cray Inc., Quadrics, HP, GSHMEM, IBM, QLogic, Mellanox, Universities of Houston and Florida; there is also open-source OpenSHMEM.

References

  1. RoCE Rocks over Lossy Network: https://dl.acm.org/citation.cfm?id=3098588&dl=ACM&coll=DL
  2. "Understanding iWARP" (PDF). Intel Corporation. Retrieved 16 May 2018.
  3. "DAT Collaborative website". Archived from the original on 17 January 2015. Retrieved 14 October 2014.
  4. The Interconnect Software Consortium website Archived 2005-08-30 at the Wayback Machine
  5. "Microsoft Based Solutions - Mellanox Technologies" . Retrieved 14 October 2014.
  6. "40Gbe SMB Direct RDMA Over Ethernet For Windows Server 2012 - Chelsio Communications". 2 April 2013. Retrieved 14 October 2014.
  7. "SOFA-STORAGE: CREATING A VENDOR AGNOSTIC FRAMEWORK TO ENABLE SEAMLESS STORAGE OFFLOAD USING SMARTNICS" (PDF).
  8. "What RDMA hardware is supported in Red Hat Enterprise Linux?". 2 June 2016.
  9. "40Gbe SMB Direct RDMA Over Ethernet For Windows Server 2012 - Chelsio Communications". Chelsio Communications. 2013-04-02. Retrieved 2016-07-15. The demonstration will show Microsoft's Windows Server 2012 SMB Direct running at line-rate 40Gb using RDMA over Ethernet (iWARP).
  10. "Red Hat Enterprise MRG 2.0 Now Available". Archived from the original on 25 August 2016. Retrieved 23 June 2011.
  11. Storm: a fast transactional dataplane for remote data structures: https://dl.acm.org/doi/abs/10.1145/3319647.3325827
  12. Storm: a fast transactional dataplane for remote data structures: https://dl.acm.org/doi/pdf/10.1145/3319647.3325827