Splice (system call)

Last updated January 29, 2025

splice() is a Linux-specific system call that moves data between a file descriptor and a pipe without a round trip to user space. The related system call vmsplice() moves or copies data between a pipe and user space. Ideally, splice and vmsplice work by remapping pages and do not actually copy any data, which may improve I/O performance. As linear addresses do not necessarily correspond to contiguous physical addresses, this may not be possible in all cases and on all hardware combinations.

Workings

With splice(), one can move data from one file descriptor to another without incurring any copies from user space into kernel space, which is usually required to enforce system security and also to keep a simple interface for processes to read and write to files. splice() works by using the pipe buffer. A pipe buffer is an in-kernel memory buffer that is opaque to the userspace process. A user process can splice the contents of a source file into this pipe buffer, then splice the pipe buffer into the destination file, all without moving any data through userspace.

Origins

Linus Torvalds described splice() in a 2006 email, which was included in a KernelTrap article.^[1]

The Linux splice implementation borrows some ideas from an original proposal by Larry McVoy in 1998.^[2] The splice system calls first appeared in Linux kernel version 2.6.17^[1] and were written by Jens Axboe.

Prototype

ssize_tsplice(intfd_in,loff_t*off_in,intfd_out,loff_t*off_out,size_tlen,unsignedintflags);

Some constants that are of interest are:

/* Splice flags (not laid down in stone yet). */#ifndef SPLICE_F_MOVE#define SPLICE_F_MOVE           0x01#endif#ifndef SPLICE_F_NONBLOCK#define SPLICE_F_NONBLOCK       0x02#endif#ifndef SPLICE_F_MORE#define SPLICE_F_MORE           0x04#endif#ifndef SPLICE_F_GIFT#define SPLICE_F_GIFT           0x08#endif

Example

This is an example of splice in action:

/* Transfer from disk to a log. */intlog_blocks(structlog_handle*handle,intfd,loff_toffset,size_tsize){intfiledes[2];intret;size_tto_write=size;ret=pipe(filedes);if(ret<0)gotoout;/* splice the file into the pipe (data in kernel memory). */while(to_write>0){ret=splice(fd,&offset,filedes[1],NULL,to_write,SPLICE_F_MORE|SPLICE_F_MOVE);if(ret<0)gotopipe;elseto_write-=ret;}to_write=size;/* splice the data in the pipe (in kernel memory) into the file. */while(to_write>0){ret=splice(filedes[0],NULL,handle->fd,&(handle->fd_offset),to_write,SPLICE_F_MORE|SPLICE_F_MOVE);if(ret<0)gotopipe;elseto_write-=ret;}pipe:close(filedes[0]);close(filedes[1]);out:if(ret<0)return-errno;return0;}

Complementary system calls

splice() is one of three system calls that complete the splice() architecture. vmsplice() can map an application data area into a pipe (or vice versa), thus allowing transfers between pipes and user memory where sys_splice() transfers between a file descriptor and a pipe. tee() is the last part of the trilogy. It duplicates one pipe to another, enabling forks in the way applications are connected with pipes.

Requirements

When using splice() with sockets, the network controller (NIC) should support DMA, otherwise splice() will not deliver a large performance improvement. The reason for this is that each page of the pipe will just fill up to frame size (1460 bytes of the available 4096 bytes per page).

Not all filesystem types support splice().

Related Research Articles

ext2, or second extended file system, is a file system for the Linux kernel. It was initially designed by French software developer Rémy Card as a replacement for the extended file system (ext). Having been designed according to the same principles as the Berkeley Fast File System from BSD, it was the first commercial-grade filesystem for Linux.

A Berkeley (BSD) socket is an application programming interface (API) for Internet domain sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

The C shell is a Unix shell created by Bill Joy while he was a graduate student at University of California, Berkeley in the late 1970s. It has been widely distributed, beginning with the 2BSD release of the Berkeley Software Distribution (BSD) which Joy first distributed in 1978. Other early contributors to the ideas or the code were Michael Ubell, Eric Allman, Mike O'Brien and Jim Kulp.

x86 assembly language is a family of low-level programming languages that are used to produce object code for the x86 class of processors. These languages provide backward compatibility with CPUs dating back to the Intel 8008 microprocessor, introduced in April 1972. As assembly languages, they are closely tied to the architecture's machine code instructions, allowing for precise control over hardware.

In Unix and Unix-like computer operating systems, a file descriptor is a process-unique identifier (handle) for a file or other input/output resource, such as a pipe or network socket.

The GNU coding standards are a set of rules and guidelines for writing programs that work consistently within the GNU system. The GNU Coding Standards were written by Richard Stallman and other GNU Project volunteers. The standards document is part of the GNU Project and is available from the GNU website. Though it focuses on writing free software for GNU in C, much of it can be applied more generally. In particular, the GNU Project encourages its contributors to always try to follow the standards—whether or not their programs are implemented in C.

The proc filesystem (procfs) is a special filesystem in Unix-like operating systems that presents information about processes and other system information in a hierarchical file-like structure, providing a more convenient and standardized method for dynamically accessing process data held in the kernel than traditional tracing methods or direct access to kernel memory. Typically, it is mapped to a mount point named /proc at boot time. The proc file system acts as an interface to internal data structures about running processes in the kernel. In Linux, it can also be used to obtain information about the kernel and to change certain kernel parameters at runtime (sysctl).

"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another or in which unnecessary data copies are avoided. This is frequently used to save CPU cycles and memory bandwidth in many time consuming tasks, such as when transmitting a file at high speed over a network, etc., thus improving the performance of programs (processes) executed by a computer.

In computing, ioctl is a system call for device-specific input/output operations and other operations which cannot be expressed by regular file semantics. It takes a parameter specifying a request code; the effect of a call depends completely on the request code. Request codes are often device-specific. For instance, a CD-ROM device driver which can instruct a physical device to eject a disc would provide an ioctl request code to do so. Device-independent request codes are sometimes used to give userspace access to kernel functions which are only used by core system software or still under development.

For most file systems, a program initializes access to a file in a file system using the open system call. This allocates resources associated to the file, and returns a handle that the process will use to refer to that file. In some cases the open is performed by the first access.

In computer science, the event loop is a programming construct or design pattern that waits for and dispatches events or messages in a program. The event loop works by making a request to some internal or external "event provider", then calls the relevant event handler.

In client-server computing, a Unix domain socket is a Berkeley socket that allows data to be exchanged between two processes executing on the same Unix or Unix-like host computer. This is similar to an Internet domain socket that allows data to be exchanged between two processes executing on different host computers.

ext4 is a journaling file system for Linux, developed as the successor to ext3.

select is a system call and application programming interface (API) in Unix-like and POSIX-compliant operating systems for examining the status of file descriptors of open input/output channels. The select system call is similar to the poll facility introduced in UNIX System V and later operating systems. However, with the c10k problem, both select and poll have been superseded by the likes of kqueue, epoll, /dev/poll and I/O completion ports.

In Unix-like operating systems, a device file, device node, or special file is an interface to a device driver that appears in a file system as if it were an ordinary file. There are also special files in DOS, OS/2, and Windows. These special files allow an application program to interact with a device by using its device driver via standard input/output system calls. Using standard system calls simplifies many programming tasks, and leads to consistent user-space I/O mechanisms regardless of device features and functions.

"Everything is a file" is an approach to interface design in Unix derivatives. While this turn of phrase does not as such figure as a Unix design principle or philosophy, it is a common way to analyse designs, and informs the design of new interfaces in a way that prefers, in rough order of import:

representing objects as file descriptors in favour of alternatives like abstract handles or names,
operating on the objects with standard input/output operations returning byte streams to be interpreted by applications, and
allowing the usage or creation of objects by opening or creating files in the global filesystem name space.

epoll is a Linux kernel system call for a scalable I/O event notification mechanism, first introduced in version 2.5.45 of the Linux kernel. Its function is to monitor multiple file descriptors to see whether I/O is possible on any of them. It is meant to replace the older POSIX select(2) and poll(2) system calls, to achieve better performance in more demanding applications, where the number of watched file descriptors is large (unlike the older system calls, which operate in O(n) time, epoll operates in O(1) time).

In modern POSIX compliant operating systems, a program that needs to access data from a file stored in a file system uses the read system call. The file is identified by a file descriptor that is normally obtained from a previous call to open. This system call reads in data in bytes, the number of which is specified by the caller, from the file and stores then into a buffer supplied by the calling process.

The write is one of the most basic routines provided by a Unix-like operating system kernel. It writes data from a buffer declared by the user to a given device, such as a file. This is the primary way to output data from a program by directly using a system call. The destination is identified by a numeric code. The data to be written, for instance a piece of text, is defined by a pointer and a size, given in number of bytes.

A code sanitizer is a programming tool that detects bugs in the form of undefined or suspicious behavior by a compiler inserting instrumentation code at runtime. The class of tools was first introduced by Google's AddressSanitizer of 2012, which uses directly mapped shadow memory to detect memory corruption such as buffer overflows or accesses to a dangling pointer (use-after-free).

References

1 2 "Linux: Explaining splice() and tee()". kerneltrap.org. 2006-04-21. Archived from the original on 2013-05-21. Retrieved 2014-04-27.
↑ "Archived copy". Archived from the original on 2016-03-04. Retrieved 2016-02-28.{{cite web}}: CS1 maint: archived copy as title (link)

External links

Linux kernel 2.6.17 (kernelnewbies.org)
Two new system calls: splice() and sync_file_range() (LWN.net)
Some new system calls (LWN.net)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 "Linux: Explaining splice() and tee()". kerneltrap.org. 2006-04-21. Archived from the original on 2013-05-21. Retrieved 2014-04-27.

[2] "Archived copy". Archived from the original on 2016-03-04. Retrieved 2016-02-28.{{cite web}}: CS1 maint: archived copy as title (link)

[1]

[2]