Original author(s) | Alexei Starovoitov, Daniel Borkmann [1] [2] |
---|---|
Developer(s) | Open source community, Meta, Google, Isovalent, Microsoft, Netflix [1] |
Initial release | 2014[3] |
Repository | Linux: git Windows: github |
Written in | C |
Operating system | Linux, Windows [4] |
Type | Runtime system |
License | Linux: GPL Windows: MIT License |
Website | ebpf.io |
eBPF is a technology that can run programs in a privileged context such as the operating system kernel. [5] It is the successor to the Berkeley Packet Filter (BPF, with the "e" originally meaning "extended") filtering mechanism in Linux and is also used in non-networking parts of the Linux kernel as well.
It is used to safely and efficiently extend the capabilities of the kernel at runtime without requiring changes to kernel source code or loading kernel modules. [6] Safety is provided through an in-kernel verifier which performs static code analysis and rejects programs which crash, hang or otherwise interfere with the kernel negatively. [7] [8]
This validation model differs from sandboxed environments, where the execution environment is restricted and the runtime has no insight about the program. [9] Examples of programs that are automatically rejected are programs without strong exit guarantees (i.e. for/while loops without exit conditions) and programs dereferencing pointers without safety checks. [10]
Loaded programs which passed the verifier are either interpreted or in-kernel just-in-time compiled (JIT compiled) for native execution performance. The execution model is event-driven and with few exceptions run-to-completion, [2] meaning, programs can be attached to various hook points in the operating system kernel and are run upon triggering of an event. eBPF use cases include (but are not limited to) networking such as XDP, tracing and security subsystems. [5] Given eBPF's efficiency and flexibility opened up new possibilities to solve production issues, Brendan Gregg famously dubbed eBPF "superpowers for Linux". [11] Linus Torvalds said, "BPF has actually been really useful, and the real power of it is how it allows people to do specialized code that isn't enabled until asked for". [12] Due to its success in Linux, the eBPF runtime has been ported to other operating systems such as Windows. [4]
eBPF evolved from the classic Berkeley Packet Filter (cBPF, a retroactively-applied name). At the most basic level, it introduced the use of ten 64-bit registers (instead of two 32-bit long registers for cBPF), different jump semantics, a call instruction and corresponding register passing convention, new instructions, and a different encoding for these instructions. [13]
Date | Event |
---|---|
April 2011 | The first in-kernel Linux just-in-time compiler (JIT compiler) for the classic Berkeley Packet Filter was merged. [14] |
January 2012 | The first non-networking use case of the classic Berkeley Packet Filter, seccomp-bpf, [15] appeared; it allows filtering of system calls using a configurable policy implemented through BPF instructions. |
March 2014 | David S. Miller, primary maintainer of the Linux networking stack, accepted the rework of the old in-kernel BPF interpreter. It was replaced by an eBPF interpreter and the Linux kernel internally translates classic BPF (cBPF) into eBPF instructions. [16] It was released in version 3.18 of the Linux kernel. [17] |
March 2015 | The ability to attach eBPF to kprobes as first tracing use case was merged. [19] In the same month, initial infrastructure work got accepted to attach eBPF to the networking traffic control (tc) layer allowing to attach eBPF to the core ingress and later also egress paths of the network stack, later heavily used by projects such as Cilium. [20] [21] [22] |
August 2015 | The eBPF compiler backend got merged into LLVM 3.7.0 release. [23] |
September 2015 | Brendan Gregg announced a collection of new eBPF-based tracing tools as the bcc project, providing a front-end for eBPF to make it easier to write programs. [24] |
July 2016 | eBPF got the ability to be attached into network driver's core receive path. This layer is known today as eXpress DataPath (XDP) and was added as a response to DPDK to create a fast data path which works in combination with the Linux kernel rather than bypassing it. [25] [26] [27] |
August 2016 | Cilium was initially announced during LinuxCon as a project providing fast IPv6 container networking with eBPF and XDP. Today, Cilium has been adopted by major cloud provider's Kubernetes offerings and is one of the most widely used CNIs. [28] [22] [29] |
November 2016 | Netronome added offload of eBPF programs for XDP and tc BPF layer to their NIC. [30] |
May 2017 | Meta's layer 4 load-balancer, Katran, went live. Every packet towards facebook.com since then has been processed by eBPF & XDP. [31] |
November 2017 | eBPF becomes its own kernel subsystem to ease the continuously growing kernel patch management. The first pull request by eBPF maintainers was submitted. [32] |
September 2017 | Bpftool was added to the Linux kernel as a user space utility to introspect the eBPF subsystem. [33] |
January 2018 | A new socket family called AF_XDP was published, allowing for high performance packet processing with zero-copy semantics at the XDP layer. [34] Today, DPDK has an official AF_XDP poll-mode driver support. [35] |
February 2018 | The bpfilter prototype has been published, allowing translation of a subset of iptables rulesets into eBPF via a newly developed user mode driver. The work has caused controversies due to the ongoing nftables development effort and has not been merged into mainline. [36] [37] |
October 2018 | The new bpftrace tool has been announced by Brendan Gregg as DTrace 2.0 for Linux. [38] |
November 2018 | eBPF introspection has been added for kTLS in order to support the ability for in-kernel TLS policy enforcement. [39] |
November 2018 | BTF (BPF Type Format) has been added to the Linux kernel as an efficient meta data format which is approximately 100x smaller in size than DWARF. [40] |
December 2019 | The first 880-page long book on BPF, written by Brendan Gregg, was released. [41] |
March 2020 | Google upstreamed BPF LSM support into the Linux kernel, enabling programmable Linux Security Modules (LSMs) through eBPF. [42] |
September 2020 | The eBPF compiler backend for GNU Compiler Collection (GCC) was merged. [43] |
July 2022 | Microsoft released eBPF for Windows, which runs code in the NT kernel. [4] |
October 2024 | The eBPF instruction set architecture (ISA) is published as RFC 9669. |
eBPF maps are efficient key/value stores that reside in kernel space and can be used to share data among multiple eBPF programs or to communicate between a user space application and eBPF code running in the kernel. eBPF programs can leverage eBPF maps to store and retrieve data in a wide set of data structures. Map implementations are provided by the core kernel. There are various types, [44] including hash maps, arrays, and ring buffers.
In practice, eBPF maps are typically used for scenarios such as a user space program writing configuration information to be retrieved by an eBPF program, an eBPF program storing state for later retrieval by another eBPF program (or a future run of the same program), or an eBPF program writing results or metrics into a map for retrieval by a user space program that will present results. [45]
The eBPF virtual machine runs within the kernel and takes in a program in the form of eBPF bytecode instructions which are converted to native machine instructions that run on the CPU. Early implementations of eBPF saw eBPF bytecode interpreted, but this has now been replaced with a Just-in-Time (JIT) compilation process for performance and security-related reasons. [45]
The eBPF virtual machine consists of eleven 64-bit registers with 32-bit subregisters, a program counter and a 512-byte large BPF stack space. These general purpose registers keep track of state when eBPF programs are executed. [46]
Tail calls can call and execute another eBPF program and replace the execution context, similar to how the execve() system call operates for regular processes. This basically allows an eBPF program to call another eBPF program. Tail calls are implemented as a long jump, reusing the same stack frame. Tail calls are particularly useful in eBPF, where the stack is limited to 512 bytes. During runtime, functionality can be added or replaced atomically, thus altering the BPF program’s execution behavior. [46] A popular use case for tail calls is to spread the complexity of eBPF programs over several programs. Another use case is for replacing or extending logic by replacing the contents of the program array while it is in use. For example, to update a program version without downtime or to enable/disable logic. [47]
It is generally considered good practice in software development to group common code into a function encapsulating logic for reusability. Prior to Linux kernel 4.16 and LLVM 6.0, a typical eBPF C program had to explicitly direct the compiler to inline a function resulting in a BPF object file that had duplicate functions. This restriction was lifted, and mainstream eBPF compilers now support writing functions naturally in eBPF programs. This reduces the generated eBPF code size making it friendlier to a CPU instruction cache. [45] [46]
The verifier is a core component of eBPF, and its main responsibility is to ensure that an eBPF program is safe to execute. It performs a static analysis of the eBPF bytecode to guarantee its safety. The verifier analyzes the program to assess all possible execution paths. It steps through the instructions in order and evaluates them. The verification process starts with a depth-first search through all possible paths of the program, the verifier simulates the execution of each instruction, tracking the state of registers and stack if any instruction could lead to an unsafe state, verification fails. This process continues until all paths have been analyzed or a violation is found. Depending on the type of program, the verifier checks for violations of specific rules. These rules can include checking that an eBPF program always terminates within a reasonable amount of time (no infinite loops or infinite recursion), checking that an eBPF program is not allowed to read arbitrary memory because being able to arbitrary read memory could allow a program leak sensitive information, checking that network programs are not allowed to access memory outside of packet bounds because adjacent memory could contain sensitive information, checking that programs are not allowed to deadlock, so any held spinlocks must be released and only one lock can be held at a time to avoid deadlocks over multiple programs, checking that programs are not allowed to read uninitialized memory. This is not an exhaustive list of the checks the verifier does, and there are exceptions to these rules. An example is that tracing programs have access to helpers that allow them to read memory in a controlled way, but these program types require root privileges and thus do not pose a security risk. [47] [45]
Over time the eBPF verifier has evolved to include newer features and optimizations, such as support for bounded loops, dead-code elimination, function-by-function verification, and callbacks.
eBPF programs use the memory and data structures from the kernel. Some structures can be modified between different kernel versions, altering the memory layout. Since the Linux kernel is continuously developed, there is no guarantee that the internal data structures will remain the same across different versions. CO-RE is a fundamental concept in modern eBPF development that allows eBPF programs to be portable across different kernel versions and configurations. It addresses the challenge of kernel structure variations between different Linux distributions and versions. CO-RE comprises BTF (BPF Type Format) - a metadata format that describes the types used in the kernel and eBPF programs and provides detailed information about struct layouts, field offsets, and data types. It enables runtime accessibility of kernel types, which is crucial for BPF program development and verification. BTF is included in the kernel image of BTF-enable kernels. Special relocations are emitted by the compiler(e.g., LLVM). These relocations capture high-level descriptions of what information the eBPF program intends to access. The libbpf library adapts eBPF programs to work with the data structure layout on the target kernel where they run, even if this layout is different from the kernel where the code was compiled. To do this, libbpf needs the BPF CO-RE relocation information generated by Clang as part of the compilation process. [45] The compiled eBPF program is stored in an ELF (Executable and Linkable Format) object file. This file contains BTF-type information and Clang-generated relocations. The ELF format allows the eBPF loader (e.g., libbpf) to process and adjust the BPF program dynamically for the target kernel. [48]
The alias eBPF is often interchangeably used with BPF, [2] [49] for example by the Linux kernel community. eBPF and BPF is referred to as a technology name like LLVM. [2] eBPF evolved from the machine language for the filtering virtual machine in the Berkeley Packet Filter as an extended version, but as its use cases outgrew networking, today "eBPF" is preferentially interpreted as a pseudo-acronym. [2]
The bee is the official logo for eBPF. At the first eBPF Summit there was a vote taken and the bee mascot was named "eBee". [50] [51] The logo has originally been created by Vadim Shchekoldin. [51] Earlier unofficial eBPF mascots have existed in the past, [52] but have not seen widespread adoption.
The eBPF Foundation was created in August 2021 with the goal to expand the contributions being made to extend the powerful capabilities of eBPF and grow beyond Linux. [1] Founding members include Meta, Google, Isovalent, Microsoft and Netflix. The purpose is to raise, budget and spend funds in support of various open source, open data and/or open standards projects relating to eBPF technologies [53] to further drive the growth and adoption of the eBPF ecosystem. Since inception, Red Hat, Huawei, Crowdstrike, Tigera, DaoCloud, Datoms, FutureWei also joined. [54]
eBPF has been adopted by a number of large-scale production users, for example:
Due to the ease of programmability, eBPF has been used as a tool for implementing microarchitectural timing side-channel attacks such as Spectre against vulnerable microprocessors. [99] While unprivileged eBPF implemented mitigations against transient execution attacks, [100] unprivileged use has ultimately been disabled by the kernel community by default to protect from use against future hardware vulnerabilities. [101]
Red Hat Enterprise Linux (RHEL) is a commercial open-source Linux distribution developed by Red Hat for the commercial market. Red Hat Enterprise Linux is released in server versions for x86-64, Power ISA, ARM64, and IBM Z and a desktop version for x86-64. Fedora Linux and CentOS Stream serve as its upstream sources. All of Red Hat's official support and training, together with the Red Hat Certification Program, focuses on the Red Hat Enterprise Linux platform.
Linux Virtual Server (LVS) is load balancing software for Linux kernel–based operating systems.
seccomp is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit
, sigreturn
, read
and write
to already-open file descriptors. Should it attempt any other system calls, the kernel will either just log the event or terminate the process with SIGKILL or SIGSYS. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.
OS-level virtualization is an operating system (OS) virtualization paradigm in which the kernel allows the existence of multiple isolated user space instances, including containers, zones, virtual private servers (OpenVZ), partitions, virtual environments (VEs), virtual kernels, and jails. Such instances may look like real computers from the point of view of programs running in them. A computer program running on an ordinary operating system can see all resources of that computer. Programs running inside a container can only see the container's contents and devices assigned to the container.
In Linux systems, initrd
is a scheme for loading a temporary root file system into memory, to be used as part of the Linux startup process. initrd
and initramfs
refer to two different methods of achieving this. Both are commonly used to make preparations before the real root file system can be mounted.
Oracle Linux is a Linux distribution packaged and freely distributed by Oracle, available partially under the GNU General Public License since late 2006. It is compiled from Red Hat Enterprise Linux (RHEL) source code, replacing Red Hat branding with Oracle's. It is also used by Oracle Cloud and Oracle Engineered Systems such as Oracle Exadata and others.
The Berkeley Packet Filter is a network tap and packet filter which permits computer network packets to be captured and filtered at the operating system level. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received, and allows a userspace process to supply a filter program that specifies which packets it wants to receive. For example, a tcpdump process may want to receive only packets that initiate a TCP connection. BPF returns only packets that pass the filter that the process supplies. This avoids copying unwanted packets from the operating system kernel to the process, greatly improving performance. The filter program is in the form of instructions for a virtual machine, which are interpreted, or compiled into machine code by a just-in-time (JIT) mechanism and executed, in the kernel.
In computing, SystemTap is a scripting language and tool for dynamically instrumenting running production Linux-based operating systems. System administrators can use SystemTap to extract, filter and summarize data in order to enable diagnosis of complex performance or functional problems.
cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes.
netsniff-ng is a free Linux network analyzer and networking toolkit originally written by Daniel Borkmann. Its gain of performance is reached by zero-copy mechanisms for network packets, so that the Linux kernel does not need to copy packets from kernel space to user space via system calls such as recvmsg
. libpcap, starting with release 1.0.0, also supports the zero-copy mechanism on Linux for capturing (RX_RING), so programs using libpcap also use that mechanism on Linux.
A network scheduler, also called packet scheduler, queueing discipline (qdisc) or queueing algorithm, is an arbiter on a node in a packet switching communication network. It manages the sequence of network packets in the transmit and receive queues of the protocol stack and network interface controller. There are several network schedulers available for the different operating systems, that implement many of the existing network scheduling algorithms.
Darling is a free and open-source macOS compatibility layer for Linux. It duplicates functions of macOS by providing alternative implementations of the libraries and frameworks that macOS programs call. This method of duplication differs from other methods that might also be considered emulation, where macOS programs run in a virtual machine. Darling has been called the counterpart to WINE for running macOS apps.
DigitalOcean Holdings, Inc. is an American multinational technology company and cloud service provider. The company is headquartered in New York City, New York, US, with 15 globally distributed data centers. DigitalOcean provides developers, startups, and SMBs with cloud infrastructure-as-a-service platforms.
Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google, the project is now maintained by a worldwide community of contributors, and the trademark is held by the Cloud Native Computing Foundation.
gVisor is a container sandbox developed by Google that focuses on security, efficiency and ease of use. gVisor implements around 200 of the Linux system calls in userspace, for additional security compared to Docker containers that run directly on top of the Linux kernel and are isolated with namespaces. Unlike the Linux kernel, gVisor is written in the memory-safe programming language Go to prevent common pitfalls which frequently occur in software written in C.
XDP is an eBPF-based high-performance data path used to send and receive network packets at high rates by bypassing most of the operating system networking stack. It is merged in the Linux kernel since version 4.8. This implementation is licensed under GPL. Large technology firms including Amazon, Google and Intel support its development. Microsoft released their free and open source implementation XDP for Windows in May 2022. It is licensed under MIT License.
The Cloud Native Computing Foundation (CNCF) is a Linux Foundation project that was started in 2015 to help advance container technology and align the tech industry around its evolution.
io_uring is a Linux kernel system call interface for storage device asynchronous I/O operations addressing performance issues with similar interfaces provided by functions like read
/write
or aio_read
/aio_write
etc. for operations on data accessed by file descriptors.
The booting process of Android devices starts at the power-on of the SoC and ends at the visibility of the home screen, or special modes like recovery and fastboot. The boot process of devices that run Android is influenced by the firmware design of the SoC manufacturers.
Cilium is a cloud native technology for networking, observability, and security. It is based on the kernel technology eBPF, originally for better networking performance, and now leverages many additional features for different use cases. The core networking component has evolved from only providing a flat Layer 3 network for containers to including advanced networking features, like BGP and Service mesh, within a Kubernetes cluster, across multiple clusters, and connecting with the world outside Kubernetes. Hubble was created as the network observability component and Tetragon was later added for security observability and runtime enforcement. Cilium runs on Linux and is one of the first eBPF applications being ported to Microsoft Windows through the eBPF on Windows project.