RDMA over Converged Ethernet

Last updated

RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE) [1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet. There are multiple RoCE versions. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network. [2] [3] [4] [5]

Contents

Background

Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth. [6] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol. [7] There are RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds [8] [9] while the lowest known iWARP HCA latency in 2011 was 3 microseconds. [10]

RoCE Header format RoCE Header format.png
RoCE Header format

RoCE v1

The RoCE v1 protocol is an Ethernet link layer protocol with Ethertype 0x8915. [2] This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.

RoCE v1.5

The RoCE v1.5 is an uncommon, experimental, non-standardized protocol that is based on the IP protocol. RoCE v1.5 uses the IP protocol field to differentiate its traffic from other IP protocols such as TCP and UDP. The value used for the protocol number is unspecified and is left to the deployment to select.

RoCE v2

The RoCE v2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol. [3] The UDP destination port number 4791 has been reserved for RoCE v2. [11] Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE [12] or RRoCE. [4] Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered. [4] In addition, RoCEv2 defines a congestion control mechanism that uses the IP ECN bits for marking and CNP [13] frames for the acknowledgment notification. [14] Software support for RoCE v2 is still emerging[ when? ]. Mellanox OFED 2.3 or later has RoCE v2 support and also Linux Kernel v4.5. [15]

RoCE versus InfiniBand

RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric. [16] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet. [17]

The technical differences between the RoCE and InfiniBand protocols are:

RoCE versus iWARP

While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP). RoCE v1 is limited to a single Ethernet broadcast domain. RoCE v2 and iWARP packets are routable. The memory requirements of a large number of connections along with TCP's flow and reliability controls lead to scalability and performance issues when using iWARP in large-scale datacenters and for large-scale applications (i.e., large-scale enterprises, cloud computing, web 2.0 applications etc. [21] ). Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA. [22] [23] [24]

Reliability in iWARP is given by the protocol itself, as TCP is reliable. RoCEv2 on the other hand utilizes UDP which has a far smaller overhead and better performance but does not provide inherent reliability, and therefore reliability must be implemented alongside RoCEv2. One solution is to use converged Ethernet switches to make the local area network reliable. This requires converged Ethernet support on all the switches in the local area network and prevents RoCEv2 packets from traveling through a wide area network such as the internet which is not reliable. Another solution is to add reliability to the RoCE protocol (i.e., reliable RoCE) which adds handshaking to RoCE to provide reliability at the cost of performance.

The question of which protocol is better depends on the vendor. Chelsio recommends and exclusively support iWARP. Mellanox, Xilinx, and Broadcom recommend and exclusively support RoCE/RoCEv2. Intel initially supported iWARP but now supports both iWARP and RoCEv2. [25] Other vendors involved in the network industry provide support for both protocols such as Marvell, Microsoft, Linux and Kazan. [26] Cisco supports both RoCE [27] and their own VIC RDMA protocol.

Both Protocols are standardized with iWARP being the standard for RDMA over TCP defined by the IETF and RoCE being the standard for RDMA over Ethernet defined by the IBTA. [26]

Criticism

Some aspects that could have been defined in the RoCE specification have been left out. These are:

In addition, any protocol running over IP cannot assume the underlying network has guaranteed ordering, any more than it can assume congestion cannot occur.

It is known that the use of PFC can lead to a network-wide deadlock. [32] [33] [34]

Vendors

Some vendors of RoCE enabled equipment include:

Related Research Articles

<span class="mw-page-title-main">Multicast</span> Computer networking technique

In computer networking, multicast is group communication where data transmission is addressed to a group of destination computers simultaneously. Multicast can be one-to-many or many-to-many distribution. Multicast should not be confused with physical layer point-to-multipoint communication.

The Routing Information Protocol (RIP) is one of the oldest distance-vector routing protocols which employs the hop count as a routing metric. RIP prevents routing loops by implementing a limit on the number of hops allowed in a path from source to destination. The largest number of hops allowed for RIP is 15, which limits the size of networks that RIP can support.

<span class="mw-page-title-main">InfiniBand</span> Network standard

InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. It is designed to be scalable and uses a switched fabric network topology. Between 2014 and June 2016, it was the most commonly used interconnect in the TOP500 list of supercomputers.

<span class="mw-page-title-main">Transport layer</span> Layer in the OSI and TCP/IP models providing host-to-host communication services for applications

In computer networking, the transport layer is a conceptual division of methods in the layered architecture of protocols in the network stack in the Internet protocol suite and the OSI model. The protocols of this layer provide end-to-end communication services for applications. It provides services such as connection-oriented communication, reliability, flow control, and multiplexing.

A multilayer switch (MLS) is a computer networking device that switches on OSI layer 2 like an ordinary network switch and provides extra functions on higher OSI layers. The MLS was invented by engineers at Digital Equipment Corporation.

In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.

<span class="mw-page-title-main">NetFlow</span> Communications protocol

NetFlow is a feature that was introduced on Cisco routers around 1996 that provides the ability to collect IP network traffic as it enters or exits an interface. By analyzing the data provided by NetFlow, a network administrator can determine things such as the source and destination of traffic, class of service, and the causes of congestion. A typical flow monitoring setup consists of three main components:

IP multicast is a method of sending Internet Protocol (IP) datagrams to a group of interested receivers in a single transmission. It is the IP-specific form of multicast and is used for streaming media and other network applications. It uses specially reserved multicast address blocks in IPv4 and IPv6.

In computer networking, a reliable protocol is a communication protocol that notifies the sender whether or not the delivery of data to intended recipients was successful. Reliability is a synonym for assurance, which is the term used by the ITU and ATM Forum.

The Sockets Direct Protocol (SDP) is a transport-agnostic protocol to support stream sockets over remote direct memory access (RDMA) network fabrics. SDP was originally defined by the Software Working Group (SWG) of the InfiniBand Trade Association. Originally designed for InfiniBand (IB), SDP is currently maintained by the OpenFabrics Alliance.

iWARP is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks. Contrary to some accounts, iWARP is not an acronym.

The iSCSI Extensions for RDMA (iSER) is a computer network protocol that extends the Internet Small Computer System Interface (iSCSI) protocol to use Remote Direct Memory Access (RDMA). RDMA can be provided by the Transmission Control Protocol (TCP) with RDMA services (iWARP), which uses an existing Ethernet setup and therefore has lower hardware costs, RoCE, which does not need the TCP layer and therefore provides lower latency, or InfiniBand. iSER permits data to be transferred directly into and out of SCSI computer memory buffers without intermediate data copies and with minimal CPU involvement.

<span class="mw-page-title-main">OpenFabrics Alliance</span> Organization

The OpenFabrics Alliance is a non-profit organization that promotes remote direct memory access (RDMA) switched fabric technologies for server and storage connectivity. These high-speed data-transport technologies are used in high-performance computing facilities, in research and various industries.

In packet switching networks, traffic flow, packet flow or network flow is a sequence of packets from a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. RFC 2722 defines traffic flow as "an artificial logical equivalent to a call or connection." RFC 3697 defines traffic flow as "a sequence of packets sent from a particular source to a particular unicast, anycast, or multicast destination that the source desires to label as a flow. A flow could consist of all packets in a specific transport connection or a media stream. However, a flow is not necessarily 1:1 mapped to a transport connection." Flow is also defined in RFC 3917 as "a set of IP packets passing an observation point in the network during a certain time interval." Packet flow temporal efficiency can be affected by one-way delay (OWD) that is described as a combination of the following components:

Data center bridging (DCB) is a set of enhancements to the Ethernet local area network communication protocol for use in data center environments, in particular for use with clustering and storage area networks.

<span class="mw-page-title-main">LIO (SCSI target)</span> Open-source version of SCSI target

In computing, Linux-IO (LIO) Target is an open-source implementation of the SCSI target that has become the standard one included in the Linux kernel. Internally, LIO does not initiate sessions, but instead provides one or more Logical Unit Numbers (LUNs), waits for SCSI commands from a SCSI initiator, and performs required input/output data transfers. LIO supports common storage fabrics, including FCoE, Fibre Channel, IEEE 1394, iSCSI, iSCSI Extensions for RDMA (iSER), SCSI RDMA Protocol (SRP) and USB. It is included in most Linux distributions; native support for LIO in QEMU/KVM, libvirt, and OpenStack makes LIO also a storage option for cloud deployments.

<span class="mw-page-title-main">Mellanox Technologies</span> Israeli-American multinational supplier of computer networking products

Mellanox Technologies Ltd. was an Israeli-American multinational supplier of computer networking products based on InfiniBand and Ethernet technology. Mellanox offered adapters, switches, software, cables and silicon for markets including high-performance computing, data centers, cloud computing, computer data storage and financial services.

Virtual Extensible LAN (VXLAN) is a network virtualization technology that attempts to address the scalability problems associated with large cloud computing deployments. It uses a VLAN-like encapsulation technique to encapsulate OSI layer 2 Ethernet frames within layer 4 UDP datagrams, using 4789 as the default IANA-assigned destination UDP port number, although many implementations that predate the IANA assignment use port 8472. VXLAN endpoints, which terminate VXLAN tunnels and may be either virtual or physical switch ports, are known as VXLAN tunnel endpoints (VTEPs).

<span class="mw-page-title-main">Broadcast, unknown-unicast and multicast traffic</span> Computer networking concept

Broadcast, unknown-unicast and multicast traffic is network traffic transmitted using one of three methods of sending data link layer network traffic to a destination of which the sender does not know the network address. This is achieved by sending the network traffic to multiple destinations on an Ethernet network. As a concept related to computer networking, it includes three types of Ethernet modes: broadcast, unicast and multicast Ethernet. BUM traffic refers to that kind of network traffic that will be forwarded to multiple destinations or that cannot be addressed to the intended destination only.

References

  1. "Roland's Blog » Blog Archive » Two notes on IBoE".
  2. 1 2 "InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE". InfiniBand Trade Association. 13 April 2010. Archived from the original on 9 March 2016. Retrieved 29 April 2015.
  3. 1 2 "InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2". InfiniBand Trade Association. 2 September 2014. Archived from the original on 17 September 2020. Retrieved 19 October 2014.
  4. 1 2 3 Ophir Maor (December 2015). "RoCEv2 Considerations". Mellanox.
  5. Ophir Maor (December 2015). "RoCE and Storage Solutions". Mellanox.
  6. Cameron, Don; Regnier, Greg (2002). Virtual Interface Architecture. Intel Press. ISBN   978-0-9712887-0-6.
  7. Feldman, Michael (22 April 2010). "RoCE: An Ethernet-InfiniBand Love Story". HPC wire.
  8. "End-to-End Lowest Latency Ethernet Solution for Financial Services" (PDF). Mellanox. March 2011.
  9. "RoCE vs. iWARP Competitive Analysis Brief" (PDF). Mellanox. 9 November 2010.
  10. "Low Latency Server Connectivity With New Terminator 4 (T4) Adapter". Chelsio. 25 May 2011.
  11. Diego Crupnicoff (17 October 2014). "Service Name and Transport Protocol Port Number Registry". IANA. Retrieved 14 October 2018.
  12. InfiniBand Trade Association (November 2013). "RoCE Status and Plans" (PDF). IETF.
  13. Ophir Maor (December 2015). "RoCEv2 CNP Packet Format". Mellanox.
  14. Ophir Maor (December 2015). "RoCEv2 Congestion Management". Mellanox.
  15. "Kernel GIT". January 2016.
  16. Merritt, Rick (19 April 2010). "New converged network blends Ethernet, InfiniBand". EE Times.
  17. Kerner, Sean Michael (2 April 2010). "InfiniBand Moving to Ethernet ?". Enterprise Networking Planet.
  18. Mellanox (2 June 2014). "Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes". Mellanox.
  19. "SX1036 - 36-Port 40/56GbE Switch System". Mellanox. Retrieved April 21, 2014.
  20. "IS5024 - 36-Port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System". Mellanox. Retrieved April 21, 2014.
  21. Rashti, Mohammad (2010). "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet" (PDF). International Conference on High Performance Computing (HiPC).
  22. H. Shah; et al. (October 2007). "Direct Data Placement over Reliable Transports". RFC 5041 . doi: 10.17487/RFC5041 . Retrieved May 4, 2011.
  23. C. Bestler; et al. (October 2007). Bestler, C.; Stewart, R. (eds.). "Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation". RFC 5043 . doi: 10.17487/RFC5043 . Retrieved May 4, 2011.
  24. P. Culley; et al. (October 2007). "Marker PDU Aligned Framing for TCP Specification". RFC 5044 . doi: 10.17487/RFC5044 . Retrieved May 4, 2011.
  25. "Intel® Ethernet 800 Series". Intel. May 2021.
  26. 1 2 T Lustig; F Zhang; J Ko (October 2007). "RoCE vs. iWARP – The Next "Great Storage Debate"". Archived from the original on May 20, 2019. Retrieved August 22, 2018.
  27. "Benefits of Remote Direct Memory Access Over Routed Fabrics" (PDF). Cisco. October 2018.
  28. Dreier, Roland (6 December 2010). "Two notes on IBoE". Roland Dreier's blog.
  29. Cohen, Eli (26 August 2010). "IB/core: Add VLAN support for IBoE". kernel.org.
  30. Cohen, Eli (13 October 2010). "RDMA/cm: Add RDMA CM support for IBoE devices". kernel.org.
  31. Crawford, M. (1998). "RFC 2464 - Transmission of IPv6 Packets over Ethernet Networks". IETF. doi: 10.17487/RFC2464 .
  32. Hu, Shuihai; Zhu, Yibo; Cheng, Peng; Guo, Chuanxiong; Tan, Kun; Padhye1, Jitendra; Chen, Kai (2016). Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them (PDF). 15th ACM Workshop on Hot Topics in Networks. pp. 92–98.{{cite conference}}: CS1 maint: numeric names: authors list (link)
  33. Shpiner, Alex; Zahavi, Eitan; Zdornov, Vladimir; Anker, Tal; Kadosh, Matty (2016). Unlocking credit loop deadlocks. 15th ACM Workshop on Hot Topics in Networks. pp. 85–91.
  34. Mittal, Radhika; Shpiner, Alexander; Panda, Aurojit; Zahavi, Eitan; Krishnamurthy, Arvind; Ratnasamy, Sylvia; Shenker, Scott (21 June 2018). "Revisiting Network Support for RDMA". arXiv: 1806.08159 [cs.NI].
  35. "Nvidia: Mellanox Deal May Not Close Until Early 2020". 14 November 2019.
  36. "Israel's AI Ecosystem Toasts NVIDIA's Proposed Mellanox Acquisition | NVIDIA Blog". 27 March 2019.
  37. "Grovf Inc. Releases Low Latency RDMA RoCE V2 FPGA IP Core for Smart NICs". Yahoo News.