Io uring

Last updated

io_uring [a] (previously known as aioring) is a Linux kernel system call interface for storage device asynchronous I/O operations addressing performance issues with similar interfaces provided by functions like read()/write() or aio_read()/aio_write() etc. for operations on data accessed by file descriptors. [2] [3] :2

Contents

Development is ongoing, worked on primarily by Jens Axboe at Meta. [2]

Interface

It works by creating two circular buffers, called "queue rings", for storage of submission and completion of I/O requests, respectively. For storage devices, these are called the submission queue (SQ) and completion queue (CQ). [4] Keeping these buffers shared between the kernel and application helps to boost the I/O performance by eliminating the need to issue extra and expensive system calls to copy these buffers between the two. [2] [4] [5] According to the io_uring design paper, the SQ buffer is writable only by consumer applications, and the CQ buffer is writable only by the kernel. [2] :3

eBPF can be combined with io_uring. [6]

History

The Linux kernel has supported asynchronous I/O since version 2.5, but it was seen as difficult to use and inefficient. [7] This older API only supported certain niche use cases, [8] notably it only enables asynchronous operation when using the O_DIRECT flag and while accessing already allocated files. This prevents utilizing the page cache, while also exposing the application to complex O_DIRECT semantics. Linux AIO also does not support sockets, so it cannot be used to multiplex network and disk I/O. [9]

The io_uring kernel interface was adopted in Linux kernel version 5.1 to resolve the deficiencies of Linux AIO. [2] [5] [10] The liburing library provides an API to interact with the kernel interface easily from userspace. [2] :12

Security

io_uring has been noted for exposing a significant attack surface and structural difficulties integrating it with the Linux security subsystem. [11]

In June 2023, Google's security team reported that 60% of the exploits submitted to their bug bounty program in 2022 were exploits of the Linux kernel's io_uring vulnerabilities. As a result, io_uring was disabled for apps in Android, and disabled entirely in ChromeOS as well as Google servers. [12] Docker also consequently disabled io_uring from their default seccomp profile. [13]

Notes

  1. Input/output user ring [1]

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as its default file system.

<span class="mw-page-title-main">System call</span> Way for programs to access kernel services

In computing, a system call is the programmatic way in which a computer program requests a service from the operating system on which it is executed. This may include hardware-related services, creation and execution of new processes, and communication with integral kernel services such as process scheduling. System calls provide an essential interface between a process and the operating system.

Completely Fair Queuing (CFQ) is an I/O scheduler for the Linux kernel which was written in 2003 by Jens Axboe.

The Direct Rendering Manager (DRM) is a subsystem of the Linux kernel responsible for interfacing with GPUs of modern video cards. DRM exposes an API that user-space programs can use to send commands and data to the GPU and perform operations such as configuring the mode setting of the display. DRM was first developed as the kernel-space component of the X Server Direct Rendering Infrastructure, but since then it has been used by other graphic stack alternatives such as Wayland and standalone applications and libraries such as SDL2 and Kodi.

In computer science, asynchronous I/O is a form of input/output processing that permits other processing to continue before the I/O operation has finished. A name used for asynchronous I/O in the Windows API is overlapped I/O.

<span class="mw-page-title-main">Linux kernel interfaces</span> An overview and comparison of the Linux kernel API and ABI.

The Linux kernel provides multiple interfaces to user-space and kernel-mode code that are used for varying purposes and that have varying properties by design. There are two types of application programming interface (API) in the Linux kernel:

  1. the "kernel–user space" API; and
  2. the "kernel internal" API.

seccomp is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit , sigreturn , read and write to already-open file descriptors. Should it attempt any other system calls, the kernel will either just log the event or terminate the process with SIGKILL or SIGSYS. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

In computing, ioctl is a system call for device-specific input/output operations and other operations which cannot be expressed by regular file semantics. It takes a parameter specifying a request code; the effect of a call depends completely on the request code. Request codes are often device-specific. For instance, a CD-ROM device driver which can instruct a physical device to eject a disc would provide an ioctl request code to do so. Device-independent request codes are sometimes used to give userspace access to kernel functions which are only used by core system software or still under development.

sync is a standard system call in the Unix operating system, which commits all data from the kernel filesystem buffers to non-volatile storage, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

<span class="mw-page-title-main">Disk buffer</span>

In computer storage, a disk buffer is the embedded memory in a hard disk drive (HDD) or solid-state drive (SSD) acting as a buffer between the rest of the computer and the physical hard disk platter or flash memory that is used for storage. Modern hard disk drives come with 8 to 256 MiB of such memory, and solid-state drives come with up to 4 GB of cache memory.

Jens Axboe is a Linux kernel hacker.

The Berkeley Packet Filter is a network tap and packet filter which permits computer network packets to be captured and filtered at the operating system level. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received, and allows a userspace process to supply a filter program that specifies which packets it wants to receive. For example, a tcpdump process may want to receive only packets that initiate a TCP connection. BPF returns only packets that pass the filter that the process supplies. This avoids copying unwanted packets from the operating system kernel to the process, greatly improving performance. The filter program is in the form of instructions for a virtual machine, which are interpreted, or compiled into machine code by a just-in-time (JIT) mechanism and executed, in the kernel.

<span class="mw-page-title-main">Linux kernel</span> Free Unix-like operating system kernel

The Linux kernel is a free and open source, UNIX-like kernel that is used in many computer systems worldwide. The kernel was created by Linus Torvalds in 1991 and was soon adopted as the kernel for the GNU operating system (OS) which was created to be a free replacement for Unix. Since the late 1990s, it has been included in many operating system distributions, many of which are called Linux. One such Linux kernel operating system is Android which is used in many mobile and embedded devices.

cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes.

<span class="mw-page-title-main">Network scheduler</span> Arbiter on a node in packet switching communication network

A network scheduler, also called packet scheduler, queueing discipline (qdisc) or queueing algorithm, is an arbiter on a node in a packet switching communication network. It manages the sequence of network packets in the transmit and receive queues of the protocol stack and network interface controller. There are several network schedulers available for the different operating systems, that implement many of the existing network scheduling algorithms.

bcache is a cache mechanism in the Linux kernel's block layer, which is used for accessing secondary storage devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices, such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides performance improvements.

dm-cache is a component of the Linux kernel's device mapper, which is a framework for mapping block devices onto higher-level virtual block devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides secondary storage performance improvements.

XDP is an eBPF-based high-performance data path used to send and receive network packets at high rates by bypassing most of the operating system networking stack. It is merged in the Linux kernel since version 4.8. This implementation is licensed under GPL. Large technology firms including Amazon, Google and Intel support its development. Microsoft released their free and open source implementation XDP for Windows in May 2022. It is licensed under MIT License.

<span class="mw-page-title-main">Tokio (software)</span> Library for Rust programming language

Tokio is a software library for the Rust programming language. It provides a runtime and functions that enable the use of asynchronous I/O, allowing for concurrency in regards to task completion.

eBPF Safe dynamic programs and tools

eBPF is a technology that can run programs in a privileged context such as the operating system kernel. It is the successor to the Berkeley Packet Filter filtering mechanism in Linux and is also used in non-networking parts of the Linux kernel as well.

References

  1. Axboe, Jens. "@axboe@fosstodon.org".
  2. 1 2 3 4 5 6 "Linux Kernel Getting io_uring To Deliver Fast & Efficient I/O". Phoronix . 2019-02-14. Retrieved 2021-03-14.
  3. Axboe, Jens (October 15, 2019). "Efficient IO with io_uring" (PDF).
  4. 1 2 "Getting Hands-on with io_uring using Go". developers.mattermost.com. Retrieved 2021-11-20.
  5. 1 2 "The rapid growth of io_uring [LWN.net]". lwn.net. Retrieved 2021-11-20.
  6. "BPF meets io_uring [LWN.net]". LWN.net . Retrieved 2023-04-17.
  7. Corbet, Jonathan. "Ringing in a new asynchronous I/O API". LWN.net . Retrieved 2021-03-14.
  8. "What's new with io_uring" (PDF). Retrieved 2022-06-01.
  9. "Linux Asynchronous I/O". 2014-04-21. Archived from the original on 2015-04-06. Retrieved 2023-06-16. Blocking during io_submit on ext4, on buffered operations, network access, pipes, etc. Some operations are not well-represented by the AIO interface. With completely unsupported operations like buffered reads, operations on a socket or pipes, the entire operation will be performed during the io_submit syscall, with the completion available immediately for access with io_getevents. AIO access to a file on a filesystem like ext4 is partially supported: if a metadata read is required to look up the data block (ie if the metadata is not already in memory), then the io_submit call will block on the metadata read. Certain types of file-enlarging writes are completely unsupported and block for the entire duration of the operation.
  10. "Faster IO through io_uring". Kernel Recipes 2019. Retrieved 2021-03-14.
  11. Corbet, Jonathan (2022-07-28). "Security requirements for new kernel features". LWN.net . Retrieved 2023-06-16.
  12. Koczka, Tamás. "Learnings from kCTF VRP's 42 Linux kernel exploits submissions". Google Online Security Blog. Google. Archived from the original on 2024-09-22. Retrieved 14 June 2023. 60% of the submissions exploited the io_uring component of the Linux kernel
  13. "Update RuntimeDefault seccomp profile to disallow io_uring related syscalls by vinayakankugoyal · Pull Request #9320 · containerd/containerd". GitHub. 2023-11-02. Archived from the original on 2024-01-06. Retrieved 2024-10-20.