TCP offload engine

Last updated

TCP offload engine (TOE) is a technology used in some network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant. TOEs are often used [1] as a way to reduce the overhead associated with Internet Protocol (IP) storage protocols such as iSCSI and Network File System (NFS).

Contents

Purpose

Originally TCP was designed for unreliable low speed networks (such as early dial-up modems) but with the growth of the Internet in terms of backbone transmission speeds (using Optical Carrier, Gigabit Ethernet and 10 Gigabit Ethernet links) and faster and more reliable access mechanisms (such as DSL and cable modems) it is frequently used in data centers and desktop PC environments at speeds of over 1 Gigabit per second. At these speeds the TCP software implementations on host systems require significant computing power. In the early 2000s, full-duplex gigabit TCP communication could consume more than 80% of a 2.4 GHz Pentium 4 processor, [2] resulting in small or no processing resources left for the applications to run on the system.

TCP is a connection-oriented protocol which adds complexity and processing overhead. These aspects include:

Moving some or all of these functions to dedicated hardware, a TCP offload engine, frees the system's main CPU for other tasks.

Freed-up CPU cycles

A generally accepted rule of thumb is that 1 Hertz of CPU processing is required to send or receive 1  bit/s of TCP/IP. [2] For example, 5 Gbit/s (625 MB/s) of network traffic requires 5 GHz of CPU processing. This implies that 2 entire cores of a 2.5 GHz multi-core processor will be required to handle the TCP/IP processing associated with 5 Gbit/s of TCP/IP traffic. Since Ethernet (10GE in this example) is bidirectional, it is possible to send and receive 10 Gbit/s (for an aggregate throughput of 20 Gbit/s). Using the 1 Hz/(bit/s) rule this equates to eight 2.5 GHz cores.

Many of the CPU cycles used for TCP/IP processing are freed-up by TCP/IP offload and may be used by the CPU (usually a server CPU) to perform other tasks such as file system processing (in a file server) or indexing (in a backup media server). In other words, a server with TCP/IP offload can do more server work than a server without TCP/IP offload NICs.

Reduction of PCI traffic

In addition to the protocol overhead that TOE can address, it can also address some architectural issues that affect a large percentage of host based (server and PC) endpoints. Many older end point hosts are PCI bus based, which provides a standard interface for the addition of certain peripherals such as Network Interfaces to Servers and PCs. PCI is inefficient for transferring small bursts of data from main memory, across the PCI bus to the network interface ICs, but its efficiency improves as the data burst size increases. Within the TCP protocol, a large number of small packets are created (e.g. acknowledgements) and as these are typically generated on the host CPU and transmitted across the PCI bus and out the network physical interface, this impacts the host computer IO throughput.

A TOE solution, located on the network interface, is located on the other side of the PCI bus from the CPU host so it can address this I/O efficiency issue, as the data to be sent across the TCP connection can be sent to the TOE from the CPU across the PCI bus using large data burst sizes with none of the smaller TCP packets having to traverse the PCI bus.

History

One of the first patents in this technology, for UDP offload, was issued to Auspex Systems in early 1990. [3] Auspex founder Larry Boucher and a number of Auspex engineers went on to found Alacritech in 1997 with the idea of extending the concept of network stack offload to TCP and implementing it in custom silicon. They introduced the first parallel-stack full offload network card in early 1999; the company's SLIC (Session Layer Interface Card) was the predecessor to its current TOE offerings. Alacritech holds a number of patents in the area of TCP/IP offload. [4]

By 2002, as the emergence of TCP-based storage such as iSCSI spurred interest, it was said that "At least a dozen newcomers, most founded toward the end of the dot-com bubble, are chasing the opportunity for merchant semiconductor accelerators for storage protocols and applications, vying with half a dozen entrenched vendors and in-house ASIC designs." [5]

In 2005 Microsoft licensed Alacritech's patent base and along with Alacritech created the partial TCP offload architecture that has become known as TCP chimney offload. TCP chimney offload centers on the Alacritech "Communication Block Passing Patent". At the same time, Broadcom also obtained a license to build TCP chimney offload chips.

Types

Instead of replacing the TCP stack with a TOE entirely, there are alternative techniques to offload some operations in co-operation with the operating system's TCP stack. TCP checksum offload and large segment offload are supported by the majority of today's Ethernet NICs. Newer techniques like large receive offload and TCP acknowledgment offload are already implemented in some high-end Ethernet hardware, but are effective even when implemented purely in software. [6] [7]

Parallel-stack full offload

Parallel-stack full offload gets its name from the concept of two parallel TCP/IP Stacks. The first is the main host stack which is included with the host OS. The second or "parallel stack" is connected between the Application Layer and the Transport Layer (TCP) using a "vampire tap". The vampire tap intercepts TCP connection requests by applications and is responsible for TCP connection management as well as TCP data transfer. Many of the criticisms in the following section relate to this type of TCP offload.

HBA full offload

HBA (Host Bus Adapter) full offload is found in iSCSI host adapters which present themselves as disk controllers to the host system while connecting (via TCP/IP) to an iSCSI storage device. This type of TCP offload not only offloads TCP/IP processing but it also offloads the iSCSI initiator function. Because the HBA appears to the host as a disk controller, it can only be used with iSCSI devices and is not appropriate for general TCP/IP offload.

TCP chimney partial offload

TCP chimney offload addresses the major security criticism of parallel-stack full offload. In partial offload, the main system stack controls all connections to the host. After a connection has been established between the local host (usually a server) and a foreign host (usually a client) the connection and its state are passed to the TCP offload engine. The heavy lifting of data transmit and receive is handled by the offload device. Almost all TCP offload engines use some type of TCP/IP hardware implementation to perform the data transfer without host CPU intervention. When the connection is closed, the connection state is returned from the offload engine to the main system stack. Maintaining control of TCP connections allows the main system stack to implement and control connection security.

Large receive offload

Large receive offload (LRO) is a technique for increasing inbound throughput of high-bandwidth network connections by reducing central processing unit (CPU) overhead. It works by aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack, thus reducing the number of packets that have to be processed. Linux implementations generally use LRO in conjunction with the New API (NAPI) to also reduce the number of interrupts.

According to benchmarks, even implementing this technique entirely in software can increase network performance significantly. [6] [7] [8] As of April  2007, the Linux kernel supports LRO for TCP in software only. FreeBSD 8 supports LRO in hardware on adapters that support it. [9] [10] [11] [12]

LRO should not operate on machines acting as routers, as it breaks the end-to-end principle and can significantly impact performance. [13] [14]

Generic receive offload

Generic receive offload (GRO) implements a generalised LRO in software that isn't restricted to TCP/IPv4 or have the issues created by LRO. [15] [16]

Large send offload

In computer networking, large send offload (LSO) is a technique for increasing egress throughput of high-bandwidth network connections by reducing CPU overhead. It works by passing a multipacket buffer to the network interface card (NIC). The NIC then splits this buffer into separate packets. The technique is also called TCP segmentation offload (TSO) or generic segmentation offload (GSO) when applied to TCP. LSO and LRO are independent and use of one does not require the use of the other.

When a system needs to send large chunks of data out over a computer network, the chunks first need breaking down into smaller segments that can pass through all the network elements like routers and switches between the source and destination computers. This process is referred to as segmentation . Often the TCP protocol in the host computer performs this segmentation. Offloading this work to the NIC is called TCP segmentation offload (TSO).

For example, a unit of 64 KiB (65,536 bytes) of data is usually segmented to 45 segments of 1460 bytes each before it is sent through the NIC and over the network. With some intelligence in the NIC, the host CPU can hand over the 64 KB of data to the NIC in a single transmit-request, the NIC can break that data down into smaller segments of 1460 bytes, add the TCP, IP, and data link layer protocol headers — according to a template provided by the host's TCP/IP stack — to each segment, and send the resulting frames over the network. This significantly reduces the work done by the CPU. As of 2014 many new NICs on the market support TSO.

Some network cards implement TSO generically enough that it can be used for offloading fragmentation of other transport layer protocols, or for doing IP fragmentation for protocols that don't support fragmentation by themselves, such as UDP.

Support in Linux

Unlike other operating systems, such as FreeBSD, the Linux kernel does not include support for TOE (not to be confused with other types of network offload). [17] While there are patches from the hardware manufacturers such as Chelsio or Qlogic that add TOE support, the Linux kernel developers are opposed to this technology for several reasons: [18]

Suppliers

Much of the current work on TOE technology is by manufacturers of 10 Gigabit Ethernet interface cards, such as Broadcom, Chelsio Communications, Emulex, Mellanox Technologies, QLogic.

See also

Related Research Articles

<span class="mw-page-title-main">SCSI</span> Set of computer and peripheral connection standards

Small Computer System Interface is a set of standards for physically connecting and transferring data between computers and peripheral devices, best known for its use with storage devices such as hard disk drives. SCSI was introduced in the 1980s and has seen widespread use on servers and high-end workstations, with new SCSI standards being published as recently as SAS-4 in 2017.

<span class="mw-page-title-main">Wake-on-LAN</span> Mechanism to wake up computers via a network

Wake-on-LAN is an Ethernet or Token Ring computer networking standard that allows a computer to be turned on or awakened from sleep mode by a network message.

Internet Small Computer Systems Interface or iSCSI is an Internet Protocol-based storage networking standard for linking data storage facilities. iSCSI provides block-level access to storage devices by carrying SCSI commands over a TCP/IP network. iSCSI facilitates data transfers over intranets and to manage storage over long distances. It can be used to transmit data over local area networks (LANs), wide area networks (WANs), or the Internet and can enable location-independent data storage and retrieval.

<span class="mw-page-title-main">InfiniBand</span> Network standard

InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. It is designed to be scalable and uses a switched fabric network topology. By 2014, it was the most commonly used interconnect in the TOP500 list of supercomputers, until about 2016.

<span class="mw-page-title-main">Network interface controller</span> Hardware component that connects a computer to a network

A network interface controller is a computer hardware component that connects a computer to a computer network.

<span class="mw-page-title-main">Quadrics (company)</span>

Quadrics was a supercomputer company formed in 1996 as a joint venture between Alenia Spazio and the technical team from Meiko Scientific. They produced hardware and software for clustering commodity computer systems into massively parallel systems. Their highpoint was in June 2003 when six out of the ten fastest supercomputers in the world were based on Quadrics' interconnect. They officially closed on June 29, 2009.

Netfilter is a framework provided by the Linux kernel that allows various networking-related operations to be implemented in the form of customized handlers. Netfilter offers various functions and operations for packet filtering, network address translation, and port translation, which provide the functionality required for directing packets through a network and prohibiting packets from reaching sensitive locations within a network.

Local Area Transport (LAT) is a non-routable networking technology developed by Digital Equipment Corporation to provide connection between the DECserver terminal servers and Digital's VAX and Alpha and MIPS host computers via Ethernet, giving communication between those hosts and serial devices such as video terminals and printers. The protocol itself was designed in such a manner as to maximize packet efficiency over Ethernet by bundling multiple characters from multiple ports into a single packet for Ethernet transport.

<span class="mw-page-title-main">Link aggregation</span> Using multiple network connections in parallel to increase capacity and reliability

In computer networking, link aggregation is the combining of multiple network connections in parallel by any of several methods. Link aggregation increases total throughput beyond what a single connection could sustain, and provides redundancy where all but one of the physical links may fail without losing connectivity. A link aggregation group (LAG) is the combined collection of physical ports.

ATA over Ethernet (AoE) is a network protocol developed by the Brantley Coile Company, designed for simple, high-performance access of block storage devices over Ethernet networks. It is used to build storage area networks (SANs) with low-cost, standard technologies.

In computer networking, jumbo frames are Ethernet frames with more than 1500 bytes of payload, the limit set by the IEEE 802.3 standard. The payload limit for jumbo frames is variable: while 9000 bytes is the most commonly used limit, smaller and larger limits exist. Many Gigabit Ethernet switches and Gigabit Ethernet network interface controllers and some Fast Ethernet switches and Fast Ethernet network interface cards can support jumbo frames.

A network socket is a software structure within a network node of a computer network that serves as an endpoint for sending and receiving data across the network. The structure and properties of a socket are defined by an application programming interface (API) for the networking architecture. Sockets are created only during the lifetime of a process of an application running in the node.

Ethernet over USB is the use of a USB link as a part of an Ethernet network, resulting in an Ethernet connection over USB.

iWARP is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks. Contrary to some accounts, iWARP is not an acronym.

In computing, Microsoft's Windows Vista and Windows Server 2008 introduced in 2007/2008 a new networking stack named Next Generation TCP/IP stack, to improve on the previous stack in several ways. The stack includes native implementation of IPv6, as well as a complete overhaul of IPv4. The new TCP/IP stack uses a new method to store configuration settings that enables more dynamic control and does not require a computer restart after a change in settings. The new stack, implemented as a dual-stack model, depends on a strong host-model and features an infrastructure to enable more modular components that one can dynamically insert and remove.

<span class="mw-page-title-main">Fibre Channel over Ethernet</span> Computer network technology

Fibre Channel over Ethernet (FCoE) is a computer network technology that encapsulates Fibre Channel frames over Ethernet networks. This allows Fibre Channel to use 10 Gigabit Ethernet networks while preserving the Fibre Channel protocol. The specification was part of the International Committee for Information Technology Standards T11 FC-BB-5 standard published in 2009.

The uIP is an open-source implementation of the TCP/IP network protocol stack intended for use with tiny 8- and 16-bit microcontrollers. It was initially developed by Adam Dunkels of the Networked Embedded Systems group at the Swedish Institute of Computer Science, licensed under a BSD style license, and further developed by a wide group of developers.

<span class="mw-page-title-main">Storage area network</span> Network which provides access to consolidated, block-level data storage

A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block-level data storage. SANs are primarily used to access data storage devices, such as disk arrays and tape libraries from servers so that the devices appear to the operating system as direct-attached storage. A SAN typically is a dedicated network of storage devices not accessible through the local area network (LAN).

<span class="mw-page-title-main">LIO (SCSI target)</span> Open-source version of SCSI target

In computing, Linux-IO (LIO) Target is an open-source implementation of the SCSI target that has become the standard one included in the Linux kernel. Internally, LIO does not initiate sessions, but instead provides one or more Logical Unit Numbers (LUNs), waits for SCSI commands from a SCSI initiator, and performs required input/output data transfers. LIO supports common storage fabrics, including FCoE, Fibre Channel, IEEE 1394, iSCSI, iSCSI Extensions for RDMA (iSER), SCSI RDMA Protocol (SRP) and USB. It is included in most Linux distributions; native support for LIO in QEMU/KVM, libvirt, and OpenStack makes LIO also a storage option for cloud deployments.

<span class="mw-page-title-main">Dell M1000e</span> Server computer

The Dell blade server products are built around their M1000e enclosure that can hold their server blades, an embedded EqualLogic iSCSI storage area network and I/O modules including Ethernet, Fibre Channel and InfiniBand switches.

References

  1. Jeffrey C. Mogul (2003-05-18). TCP Offload Is a Dumb Idea Whose Time Has Come. HotOS. Usenix.
  2. 1 2 Annie P. Foong; Thomas R. Huff; Herbert H. Hum; Jaidev P. Patwardhan; Greg J. Regnier (2003-04-02). TCP performance re-visited (PDF). Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). Austin, Texas.
  3. United States Patent: 5355453 "Parallel I/O network file server architecture category"
  4. United States Patent: 6247060 "Passing a Communication Block from Host to a Local Device such that a message is processed on the Device"
  5. "Newcomers spin storage network silicon ", Rick Merritt, 10/21/2002, EE Times
  6. 1 2 Jonathan Corbet (2007-08-01). "Large receive offload". LWN.net . Retrieved 2007-08-22.
  7. 1 2 Aravind Menon; Willy Zwaenepoel (2008-04-28). Optimizing TCP Receive Performance. USENIX Annual Technical Conference. USENIX.
  8. Andrew Gallatin (2007-07-25). "lro: Generic Large Receive Offload for TCP traffic". linux-kernel (Mailing list). Retrieved 2007-08-22.
  9. "Cxgb". Freebsd.org. Retrieved 12 July 2018.
  10. "Mxge". Freebsd.org. Retrieved 12 July 2018.
  11. "Nxge". Freebsd.org. Retrieved 12 July 2018.
  12. "Poor TCP performance can occur in Linux virtual machines with LRO enabled". VMware. 2011-07-04. Retrieved 2011-08-17.
  13. "Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Family of Adapters". Intel Corporation. 2013-02-12. Retrieved 2013-04-24.
  14. "Disable LRO for all NICs that have LRO enabled". Red Hat, Inc. 2013-01-10. Retrieved 2013-04-24.
  15. "JLS2009: Generic receive offload". lwn.net .
  16. Huang, Shu; Baldine, Ilia (March 2012). Schmitt, Jens B. (ed.). Performance Evaluation of 10GE NICs with SR-IOV Support: I/O Virtualization and network Stack Optimizations. Measurement, Modeling, and Evaluation of Computing Systems and Dependability and Fault Tolerance: 16th International GI/ITG Conference, MMB & DFT 2012. Lecture Notes in Computer Science. Vol. 7201. Kaiserslautern, Germany: Springer (published 2012). p. 198. ISBN   9783642285400 . Retrieved 2016-10-11. Large-Receive-Offload (LRO) reduces the per-packet processing overhead by aggregating smaller packets into larger ones and passing them up to the network stack. Generic-Receive-Offload (GRO) provides a generalized software version of LRO [...].
  17. "Linux and TCP offload engines", August 22, 2005, LWN.net
  18. Networking:TOE, Linux Foundation.