Virtual Interface Architecture

Last updated

The Virtual Interface Architecture (VIA) is an abstract model of a user-level zero-copy network, and is the basis for InfiniBand, iWARP and RoCE. Created by Microsoft, Intel, and Compaq, the original VIA sought to standardize the interface for high-performance network technologies known as System Area Networks (SANs; not to be confused with Storage Area Networks).

Networks are a shared resource. With traditional network APIs such as the Berkeley socket API, the kernel is involved in every network communication. This presents a tremendous performance bottleneck when latency is an issue.

One of the classic developments in computing systems is virtual memory, a combination of hardware and software that creates the illusion of private memory for each process. In the same school of thought, a virtual network interface protected across process boundaries could be accessed at the user level. With this technology, the "consumer" manages its own buffers and communication schedule while the "provider" handles the protection.

Thus, the network interface card (NIC) provides a "private network" for a process, and a process is usually allowed to have multiple such networks. The virtual interface (VI) of VIA refers to this network and is merely the destination of the user's communication requests. Communication takes place over a pair of VIs, one on each of the processing nodes involved in the transmission. In "kernel-bypass" communication, the user manages its own buffers.

Another facet of traditional networks is that arriving data is placed in a pre-allocated buffer and then copied to the user-specified final destination. Copying large messages can take a long time, and so eliminating this step is beneficial. Another classic development in computing systems is direct memory access (DMA), in which a device can access main memory directly while the CPU is free to perform other tasks.

In a network with "remote direct memory access" (RDMA), the sending NIC uses DMA to read data in the user-specified buffer and transmit it as a self-contained message across the network. The receiving NIC then uses DMA to place the data into the user-specified buffer. There is no intermediary copying and all of these actions occur without the involvement of the CPUs, which has an added benefit of lower CPU utilization.

For the NIC to actually access the data through DMA, the user's page must be in memory. In VIA, the user must "pin-down" its buffers before transmission, so as to prevent the OS from swapping the page out to the disk. This action—one of the few that involve the kernel—ties the page to physical memory. To ensure that only the process that owns the registered memory may access it, the VIA NICs require permission keys known as "protection tags" during communication.

So essentially VIA is a standard that defines kernel bypassing and RDMA in a network. It also defines a programming library called "VIPL". It has been implemented, most notably in cLAN from Giganet (now Emulex). Mostly though, VIA's major contribution has been in providing a basis for the InfiniBand, iWARP and RoCE standards.


Related Research Articles

<span class="mw-page-title-main">Operating system</span> Software that manages computer hardware resources

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

<span class="mw-page-title-main">InfiniBand</span> Network standard

InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. It is designed to be scalable and uses a switched fabric network topology. Between 2014 and June 2016, it was the most commonly used interconnect in the TOP500 list of supercomputers.

<span class="mw-page-title-main">Network interface controller</span> Hardware component that connects a computer to a network

A network interface controller is a computer hardware component that connects a computer to a computer network.

Memory-mapped I/O (MMIO) and port-mapped I/O (PMIO) are two complementary methods of performing input/output (I/O) between the central processing unit (CPU) and peripheral devices in a computer. An alternative approach is using dedicated I/O processors, commonly known as channels on mainframe computers, which execute their own instructions.

TCP offload engine (TOE) is a technology used in some network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant. TOEs are often used as a way to reduce the overhead associated with Internet Protocol (IP) storage protocols such as iSCSI and Network File System (NFS).

In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.

The Direct Rendering Manager (DRM) is a subsystem of the Linux kernel responsible for interfacing with GPUs of modern video cards. DRM exposes an API that user-space programs can use to send commands and data to the GPU and perform operations such as configuring the mode setting of the display. DRM was first developed as the kernel-space component of the X Server Direct Rendering Infrastructure, but since then it has been used by other graphic stack alternatives such as Wayland and standalone applications and libraries such as SDL2 and Kodi.

"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another or in which unnecessary data copies are avoided. This is frequently used to save CPU cycles and memory bandwidth in many time consuming tasks, such as when transmitting a file at high speed over a network, etc., thus improving the performance of programs (processes) executed by a computer.

<span class="mw-page-title-main">Input–output memory management unit</span> Configuration in computing

In computing, an input–output memory management unit (IOMMU) is a memory management unit (MMU) connecting a direct-memory-access–capable (DMA-capable) I/O bus to the main memory. Like a traditional MMU, which translates CPU-visible virtual addresses to physical addresses, the IOMMU maps device-visible virtual addresses to physical addresses. Some units also provide memory protection from faulty or malicious devices.

iWARP is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks. Contrary to some accounts, iWARP is not an acronym.

The iSCSI Extensions for RDMA (iSER) is a computer network protocol that extends the Internet Small Computer System Interface (iSCSI) protocol to use Remote Direct Memory Access (RDMA). RDMA is provided by either the Transmission Control Protocol (TCP) with RDMA services (iWARP) that uses existing Ethernet setup and therefore no need of huge hardware investment, RoCE that does not need the TCP layer and therefore provides lower latency, or InfiniBand. It permits data to be transferred directly into and out of SCSI computer memory buffers without intermediate data copies and without much CPU intervention.

<span class="mw-page-title-main">OpenFabrics Alliance</span> Organization

The OpenFabrics Alliance is a non-profit organization that promotes remote direct memory access (RDMA) switched fabric technologies for server and storage connectivity. These high-speed data-transport technologies are used in high-performance computing facilities, in research and various industries.

In computing the SCSI RDMA Protocol (SRP) is a protocol that allows one computer to access SCSI devices attached to another computer via remote direct memory access (RDMA). The SRP protocol is also known as the SCSI Remote Protocol. The use of RDMA makes higher throughput and lower latency possible than what is generally available through e.g. the TCP/IP communication protocol.

<span class="mw-page-title-main">LIO (SCSI target)</span> Open-source version of SCSI target

In computing, Linux-IO (LIO) Target is an open-source implementation of the SCSI target that has become the standard one included in the Linux kernel. Internally, LIO does not initiate sessions, but instead provides one or more Logical Unit Numbers (LUNs), waits for SCSI commands from a SCSI initiator, and performs required input/output data transfers. LIO supports common storage fabrics, including FCoE, Fibre Channel, IEEE 1394, iSCSI, iSCSI Extensions for RDMA (iSER), SCSI RDMA Protocol (SRP) and USB. It is included in most Linux distributions; native support for LIO in QEMU/KVM, libvirt, and OpenStack makes LIO also a storage option for cloud deployments.

RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE) is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet. There are multiple RoCE versions. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.

The Data Plane Development Kit (DPDK) is an open source software project managed by the Linux Foundation. It provides a set of data plane libraries and network interface controller polling-mode drivers for offloading TCP packet processing from the operating system kernel to processes running in user space. This offloading achieves higher computing efficiency and higher packet throughput than is possible using the interrupt-driven processing provided in the kernel.

Omni-Path Architecture (OPA) is a high-performance communication architecture developed by Intel. It aims for low communication latency, low power consumption and a high throughput. It directly competes with InfiniBand. Intel planned to develop technology based on this architecture for exascale computing. The current owner of Omni-Path is Cornelis Networks.

XDP is an eBPF-based high-performance data path used to send and receive network packets at high rates by bypassing most of the operating system networking stack. It is merged in the Linux kernel since version 4.8. This implementation is licensed under GPL. Large technology firms including Amazon, Google and Intel support its development. Microsoft released their free and open source implementation XDP for Windows in May 2022. It is licensed under MIT License.

Compute Express Link (CXL) is an open standard for high-speed, high capacity central processing unit (CPU)-to-device and CPU-to-memory connections, designed for high performance data center computers. CXL is built on the serial PCI Express (PCIe) physical and electrical interface and includes PCIe-based block input/output protocol (CXL.io) and new cache-coherent protocols for accessing system memory (CXL.cache) and device memory (CXL.mem). The serial communication and pooling capabilities allows CXL memory to overcome performance and socket packaging limitations of common DIMM memory when implementing high storage capacities.