Io uring

Last updated

io_uring (previously known as aioring) is a Linux kernel system call interface for storage device asynchronous I/O operations addressing performance issues with similar interfaces provided by functions like read()/write() or aio_read()/aio_write() etc. for operations on data accessed by file descriptors. [1] [2] :2

Contents

Development is ongoing, worked on primarily by Jens Axboe at Meta. [1]

Interface

It works by creating two circular buffers, called "queue rings", for storage of submission and completion of I/O requests, respectively. For storage devices, these are called the submission queue (SQ) and completion queue (CQ). [3] Keeping these buffers shared between the kernel and application helps to boost the I/O performance by eliminating the need to issue extra and expensive system calls to copy these buffers between the two. [1] [3] [4] According to the io_uring design paper, the SQ buffer is writable only by consumer applications, and the CQ buffer is writable only by the kernel. [1] :3

eBPF can be combined with io_uring. [5]

History

The Linux kernel has supported asynchronous I/O since version 2.5, but it was seen as difficult to use and inefficient. [6] This older API only supported certain niche use cases, [7] notably it only enables asynchronous operation when using the O_DIRECT flag and while accessing already allocated files. This prevents utilizing the page cache, while also exposing the application to complex O_DIRECT semantics. Linux AIO also does not support sockets, so it cannot be used to multiplex network and disk I/O. [8]

The io_uring kernel interface was adopted in Linux kernel version 5.1 to resolve the deficiencies of Linux AIO. [1] [4] [9] The liburing library provides an API to interact with the kernel interface easily from userspace. [1] [1] :12

Security

io_uring has been noted for exposing a significant attack surface and structural difficulties integrating it with the Linux security subsystem. [10]

In June 2023, Google's security team reported that 60% of Linux kernel exploits submitted to their bug bounty program in 2022 were exploits of io_uring vulnerabilities. As a result, io_uring was disabled for apps in Android, and disabled entirely in ChromeOS as well as Google servers. [11] Docker also consequently disabled io_uring from their default seccomp profile. [12]

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as its default file system.

Completely Fair Queuing (CFQ) is an I/O scheduler for the Linux kernel which was written in 2003 by Jens Axboe.

The Direct Rendering Manager (DRM) is a subsystem of the Linux kernel responsible for interfacing with GPUs of modern video cards. DRM exposes an API that user-space programs can use to send commands and data to the GPU and perform operations such as configuring the mode setting of the display. DRM was first developed as the kernel-space component of the X Server Direct Rendering Infrastructure, but since then it has been used by other graphic stack alternatives such as Wayland and standalone applications and libraries such as SDL2 and Kodi.

In computer science, asynchronous I/O is a form of input/output processing that permits other processing to continue before the I/O operation has finished. A name used for asynchronous I/O in the Windows API is overlapped I/O.

<span class="mw-page-title-main">Linux kernel interfaces</span> An overview and comparison of the Linux kernal APIs and ABIs.

The Linux kernel provides multiple interfaces to user-space and kernel-mode code that are used for varying purposes and that have varying properties by design. There are two types of application programming interface (API) in the Linux kernel:

  1. the "kernel–user space" API; and
  2. the "kernel internal" API.

seccomp is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit , sigreturn , read and write to already-open file descriptors. Should it attempt any other system calls, the kernel will either just log the event or terminate the process with SIGKILL or SIGSYS. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

sync is a standard system call in the Unix operating system, which commits all data from the kernel filesystem buffers to non-volatile storage, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

<span class="mw-page-title-main">Disk buffer</span>

In computer storage, disk buffer is the embedded memory in a hard disk drive (HDD) or solid state drive (SSD) acting as a buffer between the rest of the computer and the physical hard disk platter or flash memory that is used for storage. Modern hard disk drives come with 8 to 256 MiB of such memory, and solid-state drives come with up to 4 GB of cache memory.

Jens Axboe is a Linux kernel hacker.

<span class="mw-page-title-main">I/O scheduling</span> Arbiter for mass storage access in an operating system

Input/output (I/O) scheduling is the method that computer operating systems use to decide in which order I/O operations will be submitted to storage volumes. I/O scheduling is sometimes called disk scheduling.

The Berkeley Packet Filter (BPF) is a technology used in certain computer operating systems for programs that need to, among other things, analyze network traffic. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received. In addition, if the driver for the network interface supports promiscuous mode, it allows the interface to be put into that mode so that all packets on the network can be received, even those destined to other hosts.

The deadline scheduler is an I/O scheduler for the Linux kernel which was written in 2002 by Jens Axboe.

<span class="mw-page-title-main">Linux kernel</span> Operating system kernel

The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel. It was originally written in 1991 by Linus Torvalds for his i386-based PC, and it was soon adopted as the kernel for the GNU operating system, which was written to be a free (libre) replacement for Unix.

cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes.

<span class="mw-page-title-main">Network scheduler</span> Arbiter on a node in packet switching communication network

A network scheduler, also called packet scheduler, queueing discipline (qdisc) or queueing algorithm, is an arbiter on a node in a packet switching communication network. It manages the sequence of network packets in the transmit and receive queues of the protocol stack and network interface controller. There are several network schedulers available for the different operating systems, that implement many of the existing network scheduling algorithms.

bcache is a cache in the Linux kernel's block layer, which is used for accessing secondary storage devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices, such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides performance improvements.

dm-cache is a component of the Linux kernel's device mapper, which is a framework for mapping block devices onto higher-level virtual block devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides secondary storage performance improvements.

Bcachefs is a copy-on-write (COW) file system for Linux-based operating systems. Its primary developer, Kent Overstreet, first announced it in 2015, and it will be added to the Linux kernel beginning with 6.7. It is intended to compete with the modern features of ZFS or Btrfs, and the speed and performance of ext4 or XFS. It self-describes as "stable", as of December 2022.

EROFS is a lightweight read-only file system initially developed by Huawei for the Linux kernel and now maintained by an open-source community from all over the world.

<span class="mw-page-title-main">Tokio (software)</span> Library for Rust programming language

Tokio is a software library for the Rust programming language. It provides a runtime and functions that enable the use of asynchronous I/O, allowing for concurrency in regards to task completion.

References

  1. 1 2 3 4 5 6 7 "Linux Kernel Getting io_uring To Deliver Fast & Efficient I/O - Phoronix". Phoronix . Retrieved 2021-03-14.
  2. Axboe, Jens (October 15, 2019). "Efficient IO with io_uring" (PDF).
  3. 1 2 "Getting Hands-on with io_uring using Go". developers.mattermost.com. Retrieved 2021-11-20.
  4. 1 2 "The rapid growth of io_uring [LWN.net]". lwn.net. Retrieved 2021-11-20.
  5. "BPF meets io_uring [LWN.net]". LWN.net . Retrieved 2023-04-17.
  6. Corbet, Jonathan. "Ringing in a new asynchronous I/O API". LWN.net . Retrieved 2021-03-14.
  7. "What's new with io_uring" (PDF). Retrieved 2022-06-01.
  8. "Linux Asynchronous I/O". 2014-04-21. Archived from the original on 2015-04-06. Retrieved 2023-06-16. Blocking during io_submit on ext4, on buffered operations, network access, pipes, etc. Some operations are not well-represented by the AIO interface. With completely unsupported operations like buffered reads, operations on a socket or pipes, the entire operation will be performed during the io_submit syscall, with the completion available immediately for access with io_getevents. AIO access to a file on a filesystem like ext4 is partially supported: if a metadata read is required to look up the data block (ie if the metadata is not already in memory), then the io_submit call will block on the metadata read. Certain types of file-enlarging writes are completely unsupported and block for the entire duration of the operation.
  9. "Faster IO through io_uring | Kernel Recipes 2019" . Retrieved 2021-03-14.
  10. Corbet, Jonathan (2022-07-28). "Security requirements for new kernel features". LWN.net . Retrieved 2023-06-16.
  11. Koczka, Tamás. "Learnings from kCTF VRP's 42 Linux kernel exploits submissions". Google Online Security Blog. Google. Retrieved 14 June 2023.
  12. "seccomp: block io_uring_* syscalls in default profile by akerouanton · Pull Request #46762 · moby/moby". GitHub. Retrieved 2023-11-02.