Remote direct memory access

Last updated

In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.

Contents

Overview

RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer.

However, this strategy presents several problems related to the fact that the target node is not notified of the completion of the request (single-sided communications).

Acceptance

As of 2018 RDMA had achieved broader acceptance as a result of implementation enhancements that enable good performance over ordinary networking infrastructure. [1] For example RDMA over Converged Ethernet (RoCE) now is able to run over either lossy or lossless infrastructure. In addition iWARP enables an Ethernet RDMA implementation at the physical layer using TCP/IP as the transport, combining the performance and latency advantages of RDMA with a low-cost, standards-based solution. [2] The RDMA Consortium and the DAT Collaborative [3] have played key roles in the development of RDMA protocols and APIs for consideration by standards groups such as the Internet Engineering Task Force and the Interconnect Software Consortium. [4]

Hardware vendors have started working on higher-capacity RDMA-based network adapters, with rates of 100 Gbit/s reported. [5] [6] Software vendors, such as IBM, [7] Red Hat and Oracle Corporation, support these APIs in their latest products, [8] and as of 2013 engineers have started developing network adapters that implement RDMA over Ethernet. [9] Both Red Hat Enterprise Linux and Red Hat Enterprise MRG [10] have support for RDMA. Microsoft supports RDMA in Windows Server 2012 via SMB Direct. VMware ESXi also supports RDMA as of 2015.

Common RDMA implementations include the Virtual Interface Architecture, RDMA over Converged Ethernet (RoCE), InfiniBand, Omni-Path and iWARP.

Using RDMA

Applications access control structures using well-defined APIs originally designed for the InfiniBand Protocol (although the APIs can be used for any of the underlying RDMA implementations). Using send and completion queues, applications perform RDMA operations by submitting work queue entries (WQEs) into the submission queue (SQ) and getting notified of responses from the completion queue (CQ). [11]

Transport types

RDMA can transport data reliably or unreliably over the Reliably Connected (RC) and Unreliable Datagram (UD) transport protocols, respectively. The former has the benefit of preserving requests (no requests are lost), while the latter requires fewer queue pairs when handling multiple connections. This is due to the fact that UD is connection-less, allowing a single host to communicate

Related Research Articles

Internet Small Computer Systems Interface or iSCSI is an Internet Protocol-based storage networking standard for linking data storage facilities. iSCSI provides block-level access to storage devices by carrying SCSI commands over a TCP/IP network. iSCSI facilitates data transfers over intranets and to manage storage over long distances. It can be used to transmit data over local area networks (LANs), wide area networks (WANs), or the Internet and can enable location-independent data storage and retrieval.

<span class="mw-page-title-main">InfiniBand</span> Network standard

InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. It is designed to be scalable and uses a switched fabric network topology. Between 2014 and June 2016, it was the most commonly used interconnect in the TOP500 list of supercomputers.

<span class="mw-page-title-main">Network interface controller</span> Hardware component that connects a computer to a network

A network interface controller is a computer hardware component that connects a computer to a computer network.

<span class="mw-page-title-main">Server Message Block</span> Network communication protocol for providing shared access to resources

Server Message Block (SMB) is a communication protocol used to share files, printers, serial ports, and miscellaneous communications between nodes on a network. On Microsoft Windows, the SMB implementation consists of two vaguely named Windows services: "Server" and "Workstation". It uses NTLM or Kerberos protocols for user authentication. It also provides an authenticated inter-process communication (IPC) mechanism.

TCP offload engine (TOE) is a technology used in some network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant. TOEs are often used as a way to reduce the overhead associated with Internet Protocol (IP) storage protocols such as iSCSI and Network File System (NFS).

iWARP is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks. Contrary to some accounts, iWARP is not an acronym.

The Virtual Interface Architecture (VIA) is an abstract model of a user-level zero-copy network, and is the basis for InfiniBand, iWARP and RoCE. Created by Microsoft, Intel, and Compaq, the original VIA sought to standardize the interface for high-performance network technologies known as System Area Networks.

The iSCSI Extensions for RDMA (iSER) is a computer network protocol that extends the Internet Small Computer System Interface (iSCSI) protocol to use Remote Direct Memory Access (RDMA). RDMA can be provided by the Transmission Control Protocol (TCP) with RDMA services (iWARP), which uses an existing Ethernet setup and therefore has lower hardware costs, RoCE, which does not need the TCP layer and therefore provides lower latency, or InfiniBand. iSER permits data to be transferred directly into and out of SCSI computer memory buffers without intermediate data copies and with minimal CPU involvement.

<span class="mw-page-title-main">OpenFabrics Alliance</span> Organization

The OpenFabrics Alliance is a non-profit organization that promotes remote direct memory access (RDMA) switched fabric technologies for server and storage connectivity. These high-speed data-transport technologies are used in high-performance computing facilities, in research and various industries.

In computing the SCSI RDMA Protocol (SRP) is a protocol that allows one computer to access SCSI devices attached to another computer via remote direct memory access (RDMA). The SRP protocol is also known as the SCSI Remote Protocol. The use of RDMA makes higher throughput and lower latency possible than what is generally available through e.g. the TCP/IP communication protocol.

<span class="mw-page-title-main">Storage area network</span> Network which provides access to consolidated, block-level data storage

A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block-level data storage. SANs are primarily used to access data storage devices, such as disk arrays and tape libraries from servers so that the devices appear to the operating system as direct-attached storage. A SAN typically is a dedicated network of storage devices not accessible through the local area network (LAN).

In computer science, memory virtualization decouples volatile random access memory (RAM) resources from individual systems in the data center, and then aggregates those resources into a virtualized memory pool available to any computer in the cluster. The memory pool is accessed by the operating system or applications running on top of the operating system. The distributed memory pool can then be utilized as a high-speed cache, a messaging layer, or a large, shared memory resource for a CPU or a GPU application.

<span class="mw-page-title-main">LIO (SCSI target)</span> Open-source version of SCSI target

In computing, Linux-IO (LIO) Target is an open-source implementation of the SCSI target that has become the standard one included in the Linux kernel. Internally, LIO does not initiate sessions, but instead provides one or more Logical Unit Numbers (LUNs), waits for SCSI commands from a SCSI initiator, and performs required input/output data transfers. LIO supports common storage fabrics, including FCoE, Fibre Channel, IEEE 1394, iSCSI, iSCSI Extensions for RDMA (iSER), SCSI RDMA Protocol (SRP) and USB. It is included in some Linux distributions; native support for LIO in QEMU/KVM, libvirt, and OpenStack makes LIO also a storage option for cloud deployments.

<span class="mw-page-title-main">Chelsio Communications</span> American technology company

Chelsio Communications is a privately held technology company headquartered in Sunnyvale, California with a design center in Bangalore, India. Early venture capital funding came from Horizons Ventures, Invesco, Investor Growth Capital, NTT Finance, Vendanta Capital, Abacus Capital Group, Pacesetter Capital Group, and New Enterprise Associates. A third round of funding raised $25 million in late 2004. LSI Corporation was added as investor in 2006 in the series D round. By January 2008, a $25M financing round was announced as series E. In 2009, an additional $17M was raised from previous investors plus Mobile Internet Capital.

RDMA over Converged Ethernet (RoCE) is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. There are multiple RoCE versions. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.

<span class="mw-page-title-main">Mellanox Technologies</span> Israeli-American multinational supplier of computer networking products

Mellanox Technologies Ltd. was an Israeli-American multinational supplier of computer networking products based on InfiniBand and Ethernet technology. Mellanox offered adapters, switches, software, cables and silicon for markets including high-performance computing, data centers, cloud computing, computer data storage and financial services.

<span class="mw-page-title-main">SCST</span>

SCST is a GPL licensed SCSI target software stack. The design goals of this software stack are high performance, high reliability, strict conformance to existing SCSI standards, being easy to extend and easy to use. SCST does not only support multiple SCSI protocols but also supports multiple local storage interfaces and also storage drivers implemented in user-space via the scst_user driver.

SHMEM is a family of parallel programming libraries, providing one-sided, RDMA, parallel-processing interfaces for low-latency distributed-memory supercomputers. The SHMEM acronym was subsequently reverse engineered to mean "Symmetric Hierarchical MEMory”. Later it was expanded to distributed memory parallel computer clusters, and is used as parallel programming interface or as low-level interface to build partitioned global address space (PGAS) systems and languages. “Libsma”, the first SHMEM library, was created by Richard Smith at Cray Research in 1993 as a set of thin interfaces to access the CRAY T3D's inter-processor-communication hardware. SHMEM has been implemented by Cray Research, SGI, Cray Inc., Quadrics, HP, GSHMEM, IBM, QLogic, Mellanox, Universities of Houston and Florida; there is also open-source OpenSHMEM.

Enterprise Storage OS, also known as ESOS, is a Linux distribution that serves as a block-level storage server in a storage area network (SAN). ESOS is composed of open-source software projects that are required for a Linux distribution and several proprietary build and install time options. The SCST project is the core component of ESOS; it provides the back-end storage functionality.

Omni-Path Architecture (OPA) is a high-performance communication architecture developed by Intel. It aims for low communication latency, low power consumption and a high throughput. It directly competes with InfiniBand. Intel planned to develop technology based on this architecture for exascale computing. The current owner of Omni-Path is Cornelis Networks.

References

  1. RoCE Rocks over Lossy Network: https://dl.acm.org/citation.cfm?id=3098588&dl=ACM&coll=DL
  2. "Understanding iWARP" (PDF). Intel Corporation. Retrieved 16 May 2018.
  3. "DAT Collaborative website". Archived from the original on 17 January 2015. Retrieved 14 October 2014.
  4. The Interconnect Software Consortium website Archived 2005-08-30 at the Wayback Machine
  5. "Microsoft Based Solutions - Mellanox Technologies" . Retrieved 14 October 2014.
  6. "40Gbe SMB Direct RDMA Over Ethernet For Windows Server 2012 - Chelsio Communications". 2 April 2013. Retrieved 14 October 2014.
  7. "SOFA-STORAGE: CREATING A VENDOR AGNOSTIC FRAMEWORK TO ENABLE SEAMLESS STORAGE OFFLOAD USING SMARTNICS" (PDF).
  8. "What RDMA hardware is supported in Red Hat Enterprise Linux?". 2 June 2016.
  9. "40Gbe SMB Direct RDMA Over Ethernet For Windows Server 2012 - Chelsio Communications". Chelsio Communications. 2013-04-02. Retrieved 2016-07-15. The demonstration will show Microsoft's Windows Server 2012 SMB Direct running at line-rate 40Gb using RDMA over Ethernet (iWARP).
  10. "Red Hat Enterprise MRG 2.0 Now Available". Archived from the original on 25 August 2016. Retrieved 23 June 2011.
  11. Storm: a fast transactional dataplane for remote data structures: https://dl.acm.org/doi/abs/10.1145/3319647.3325827