Express Data Path

Last updated
XDP
Original author(s) Brenden Blanco,
Tom Herbert
Developer(s) Open source community, Google, Amazon, Intel, Microsoft [1]
Initial release2016;8 years ago (2016)
Written in C
Operating system Linux, Windows
Type Packet filtering
License Linux: GPL
Windows: MIT License

XDP (eXpress Data Path) is an eBPF-based high-performance data path used to send and receive network packets at high rates by bypassing most of the operating system networking stack. It is merged in the Linux kernel since version 4.8. [2] This implementation is licensed under GPL. Large technology firms including Amazon, Google and Intel support its development. Microsoft released their free and open source implementation XDP for Windows in May 2022. [1] It is licensed under MIT License. [3]

Contents

Data path

Packet flow paths in the Linux kernel. XDP bypasses the networking stack and memory allocation for packet metadata. Netfilter-packet-flow.svg
Packet flow paths in the Linux kernel. XDP bypasses the networking stack and memory allocation for packet metadata.

The idea behind XDP is to add an early hook in the RX path of the kernel, and let a user supplied eBPF program decide the fate of the packet. The hook is placed in the network interface controller (NIC) driver just after the interrupt processing, and before any memory allocation needed by the network stack itself, because memory allocation can be an expensive operation. Due to this design, XDP can drop 26 million packets per second per core with commodity hardware. [4]

The eBPF program must pass a preverifier test [5] before being loaded, to avoid executing malicious code in kernel space. The preverifier checks that the program contains no out-of-bounds accesses, loops or global variables.

The program is allowed to edit the packet data and, after the eBPF program returns, an action code determines what to do with the packet:

XDP requires support in the NIC driver but, as not all drivers support it, it can fallback to a generic implementation, which performs the eBPF processing in the network stack, though with slower performance. [6]

XDP has infrastructure to offload the eBPF program to a network interface controller which supports it, reducing the CPU load. In 2023, only Netronome [7] cards support it.

Microsoft is partnering with other companies and adding support for XDP in its MsQuic implementation of the QUIC protocol. [1]

AF_XDP

Along with XDP, a new address family entered in the Linux kernel starting 4.18. [8] AF_XDP, formerly known as AF_PACKETv4 (which was never included in the mainline kernel), [9] is a raw socket optimized for high performance packet processing and allows zero-copy between kernel and applications. As the socket can be used for both receiving and transmitting, it supports high performance network applications purely in user space. [10]

See also

Related Research Articles

<span class="mw-page-title-main">Network interface controller</span> Hardware component that connects a computer to a network

A network interface controller is a computer hardware component that connects a computer to a computer network.

IPX/SPX stands for Internetwork Packet Exchange/Sequenced Packet Exchange. IPX and SPX are networking protocols used initially on networks using the Novell NetWare operating systems. They also became widely used on networks deploying Microsoft Windows LANS, as they replaced NetWare LANS, but are no longer widely used. IPX/SPX was also widely used prior to and up to Windows XP, which supported the protocols, while later Windows versions do not, and TCP/IP took over for networking.

TCP offload engine (TOE) is a technology used in some network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant. TOEs are often used as a way to reduce the overhead associated with Internet Protocol (IP) storage protocols such as iSCSI and Network File System (NFS).

<span class="mw-page-title-main">Link aggregation</span> Using multiple network connections in parallel to increase capacity and reliability

In computer networking, link aggregation is the combining of multiple network connections in parallel by any of several methods. Link aggregation increases total throughput beyond what a single connection could sustain, and provides redundancy where all but one of the physical links may fail without losing connectivity. A link aggregation group (LAG) is the combined collection of physical ports.

<span class="mw-page-title-main">Linux kernel interfaces</span> An overview and comparison of the Linux kernal APIs and ABIs.

The Linux kernel provides multiple interfaces to user-space and kernel-mode code that are used for varying purposes and that have varying properties by design. There are two types of application programming interface (API) in the Linux kernel:

  1. the "kernel–user space" API; and
  2. the "kernel internal" API.

"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another or in which unnecessary data copies are avoided. This is frequently used to save CPU cycles and memory bandwidth in many time consuming tasks, such as when transmitting a file at high speed over a network, etc., thus improving the performance of programs (processes) executed by a computer.

seccomp is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit , sigreturn , read and write to already-open file descriptors. Should it attempt any other system calls, the kernel will either just log the event or terminate the process with SIGKILL or SIGSYS. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

Netlink is a socket family used for inter-process communication (IPC) between both the kernel and userspace processes, and between different userspace processes, in a way similar to the Unix domain sockets available on certain Unix-like operating systems, including its original incarnation as a Linux kernel interface, as well as in the form of a later implementation on FreeBSD. Similarly to the Unix domain sockets, and unlike INET sockets, Netlink communication cannot traverse host boundaries. However, while the Unix domain sockets use the file system namespace, Netlink sockets are usually addressed by process identifiers (PIDs).

The Berkeley Packet Filter is a network tap and packet filter which permits computer network packets to be captured and filtered at the operating system level. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received, and allows a userspace process to supply a filter program that specifies which packets it wants to receive. For example, a tcpdump process may want to receive only packets that initiate a TCP connection. BPF returns only packets that pass the filter that the process supplies. This avoids copying unwanted packets from the operating system kernel to the process, greatly improving performance. The filter program is in the form of instructions for a virtual machine, which are interpreted, or compiled into machine code by a just-in-time (JIT) mechanism and executed, in the kernel.

<span class="mw-page-title-main">SystemTap</span> Scripting language and tool

In computing, SystemTap is a scripting language and tool for dynamically instrumenting running production Linux-based operating systems. System administrators can use SystemTap to extract, filter and summarize data in order to enable diagnosis of complex performance or functional problems.

netsniff-ng Linux networking toolkit

netsniff-ng is a free Linux network analyzer and networking toolkit originally written by Daniel Borkmann. Its gain of performance is reached by zero-copy mechanisms for network packets, so that the Linux kernel does not need to copy packets from kernel space to user space via system calls such as recvmsg . libpcap, starting with release 1.0.0, also supports the zero-copy mechanism on Linux for capturing (RX_RING), so programs using libpcap also use that mechanism on Linux.

<span class="mw-page-title-main">Network scheduler</span> Arbiter on a node in packet switching communication network

A network scheduler, also called packet scheduler, queueing discipline (qdisc) or queueing algorithm, is an arbiter on a node in a packet switching communication network. It manages the sequence of network packets in the transmit and receive queues of the protocol stack and network interface controller. There are several network schedulers available for the different operating systems, that implement many of the existing network scheduling algorithms.

QUIC is a general-purpose transport layer network protocol initially designed by Jim Roskind at Google, implemented, and deployed in 2012, announced publicly in 2013 as experimentation broadened, and described at an IETF meeting. QUIC is used by more than half of all connections from the Chrome web browser to Google's servers. Microsoft Edge, Firefox and Safari support it.

The Data Plane Development Kit (DPDK) is an open source software project managed by the Linux Foundation. It provides a set of data plane libraries and network interface controller polling-mode drivers for offloading TCP packet processing from the operating system kernel to processes running in user space. This offloading achieves higher computing efficiency and higher packet throughput than is possible using the interrupt-driven processing provided in the kernel.

HTTP/3 is the third major version of the Hypertext Transfer Protocol used to exchange information on the World Wide Web, complementing the widely-deployed HTTP/1.1 and HTTP/2. Unlike previous versions which relied on the well-established TCP, HTTP/3 uses QUIC, a multiplexed transport protocol built on UDP. On 6 June 2022, IETF published HTTP/3 as a Proposed Standard in RFC 9114.

Microsoft, a technology company historically known for its opposition to the open source software paradigm, turned to embrace the approach in the 2010s. From the 1970s through 2000s under CEOs Bill Gates and Steve Ballmer, Microsoft viewed the community creation and sharing of communal code, later to be known as free and open source software, as a threat to its business, and both executives spoke negatively against it. In the 2010s, as the industry turned towards cloud, embedded, and mobile computing—technologies powered by open source advances—CEO Satya Nadella led Microsoft towards open source adoption although Microsoft's traditional Windows business continued to grow throughout this period generating revenues of 26.8 billion in the third quarter of 2018, while Microsoft's Azure cloud revenues nearly doubled.

<span class="mw-page-title-main">MsQuic</span> Microsoft open source library

MsQuic is a free and open source implementation of the IETF QUIC protocol written in C that is officially supported on the Microsoft Windows, Linux, and Xbox platforms. The project also provides libraries for macOS and Android, which are unsupported. It is designed to be a cross-platform general purpose QUIC library optimized for client and server applications benefitting from maximal throughput and minimal latency. By the end of 2021 the codebase had over 200,000 lines of production code, with 50,000 lines of "core" code, sharable across platforms. The source code is licensed under MIT License and available on GitHub.

io_uring is a Linux kernel system call interface for storage device asynchronous I/O operations addressing performance issues with similar interfaces provided by functions like read /write or aio_read /aio_write etc. for operations on data accessed by file descriptors.

eBPF Safe dynamic programs and tools

eBPF is a technology that can run programs in a privileged context such as the operating system kernel. It is the successor to the Berkeley Packet Filter (BPF) filtering mechanism in Linux, and is also used in other parts of the Linux kernel as well.

<span class="mw-page-title-main">Cilium (computing)</span>

Cilium is a cloud native technology for networking, observability, and security. It is based on the kernel technology eBPF, originally for better networking performance, and now leverages many additional features for different use cases. The core networking component has evolved from only providing a flat Layer 3 network for containers to including advanced networking features, like BGP and Service mesh, within a Kubernetes cluster, across multiple clusters, and connecting with the world outside Kubernetes. Hubble was created as the network observability component and Tetragon was later added for security observability and runtime enforcement. Cilium runs on Linux and is one of the first eBPF applications being ported to Microsoft Windows through the eBPF on Windows project.

References

  1. 1 2 3 Jawad, Usama (25 May 2022). "Microsoft brings Linux XDP project to Windows". Neowin . Retrieved 26 May 2022.
  2. "[GIT] Networking - David Miller". lore.kernel.org. Retrieved 2019-05-14.
  3. Yasar, Erdem (25 May 2022). "Microsoft introduced open-source XDP for Windows". cloud7 . Retrieved 26 May 2022.
  4. Høiland-Jørgensen, Toke (2019-05-03), Source text and experimental data for our paper describing XDP: tohojo/xdp-paper , retrieved 2019-05-21
  5. "A thorough introduction to eBPF [LWN.net]". lwn.net. Retrieved 2019-05-14.
  6. "net: Generic XDP". www.mail-archive.com. Retrieved 2019-05-14.
  7. "BPF, eBPF, XDP and Bpfilter… What are these things and what do they mean for the enterprise? - Netronome". www.netronome.com. Retrieved 2019-05-14.
  8. "kernel/git/torvalds/linux.git - Linux kernel source tree". git.kernel.org. Retrieved 2019-05-16.
  9. "Questions about AF_PACKET V4 and AF_XDP". Kernel.org.
  10. "Accelerating networking with AF_XDP [LWN.net]". lwn.net. Retrieved 2019-05-16.