Splice (system call)

Last updated

splice() is a Linux-specific system call that moves data between a file descriptor and a pipe without a round trip to user space. The related system call vmsplice() moves or copies data between a pipe and user space. Ideally, splice and vmsplice work by remapping pages and do not actually copy any data, which may improve I/O performance. As linear addresses do not necessarily correspond to contiguous physical addresses, this may not be possible in all cases and on all hardware combinations.

Contents

Workings

With splice(), one can move data from one file descriptor to another without incurring any copies from user space into kernel space, which is usually required to enforce system security and also to keep a simple interface for processes to read and write to files. splice() works by using the pipe buffer. A pipe buffer is an in-kernel memory buffer that is opaque to the userspace process. A user process can splice the contents of a source file into this pipe buffer, then splice the pipe buffer into the destination file, all without moving any data through userspace.

Origins

Linus Torvalds described splice() in a 2006 email, which was included in a KernelTrap article. [1]

The Linux splice implementation borrows some ideas from an original proposal by Larry McVoy in 1998. [2] The splice system calls first appeared in Linux kernel version 2.6.17 [1] and were written by Jens Axboe.

Prototype

ssize_tsplice(intfd_in,loff_t*off_in,intfd_out,loff_t*off_out,size_tlen,unsignedintflags);

Some constants that are of interest are:

/* Splice flags (not laid down in stone yet). */#ifndef SPLICE_F_MOVE#define SPLICE_F_MOVE           0x01#endif#ifndef SPLICE_F_NONBLOCK#define SPLICE_F_NONBLOCK       0x02#endif#ifndef SPLICE_F_MORE#define SPLICE_F_MORE           0x04#endif#ifndef SPLICE_F_GIFT#define SPLICE_F_GIFT           0x08#endif

Example

This is an example of splice in action:

/* Transfer from disk to a log. */intlog_blocks(structlog_handle*handle,intfd,loff_toffset,size_tsize){intfiledes[2];intret;size_tto_write=size;ret=pipe(filedes);if(ret<0)gotoout;/* splice the file into the pipe (data in kernel memory). */while(to_write>0){ret=splice(fd,&offset,filedes[1],NULL,to_write,SPLICE_F_MORE|SPLICE_F_MOVE);if(ret<0)gotopipe;elseto_write-=ret;}to_write=size;/* splice the data in the pipe (in kernel memory) into the file. */while(to_write>0){ret=splice(filedes[0],NULL,handle->fd,&(handle->fd_offset),to_write,SPLICE_F_MORE|SPLICE_F_MOVE);if(ret<0)gotopipe;elseto_write-=ret;}pipe:close(filedes[0]);close(filedes[1]);out:if(ret<0)return-errno;return0;}

Complementary system calls

splice() is one of three system calls that complete the splice() architecture. vmsplice() can map an application data area into a pipe (or vice versa), thus allowing transfers between pipes and user memory where sys_splice() transfers between a file descriptor and a pipe. tee() is the last part of the trilogy. It duplicates one pipe to another, enabling forks in the way applications are connected with pipes.

Requirements

When using splice() with sockets, the network controller (NIC) should support DMA, otherwise splice() will not deliver a large performance improvement. The reason for this is that each page of the pipe will just fill up to frame size (1460 bytes of the available 4096 bytes per page). splice() does not support UDP sockets.

Not all filesystem types support splice(). Also, AF_UNIX sockets do not support splice().

See also

Related Research Articles

ext2, or second extended file system, is a file system for the Linux kernel. It was initially designed by French software developer Rémy Card as a replacement for the extended file system (ext). Having been designed according to the same principles as the Berkeley Fast File System from BSD, it was the first commercial-grade filesystem for Linux.

Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

In Unix and Unix-like computer operating systems, a file descriptor is a process-unique identifier (handle) for a file or other input/output resource, such as a pipe or network socket.

In computing, a named pipe is an extension to the traditional pipe concept on Unix and Unix-like systems, and is one of the methods of inter-process communication (IPC). The concept is also found in OS/2 and Microsoft Windows, although the semantics differ substantially. A traditional pipe is "unnamed" and lasts only as long as the process. A named pipe, however, can last as long as the system is up, beyond the life of the process. It can be deleted if no longer used. Usually a named pipe appears as a file, and generally processes attach to it for IPC.

The GNU coding standards are a set of rules and guidelines for writing programs that work consistently within the GNU system. The GNU Coding Standards were written by Richard Stallman and other GNU Project volunteers. The standards document is part of the GNU Project and is available from the GNU website. Though it focuses on writing free software for GNU in C, much of it can be applied more generally. In particular, the GNU Project encourages its contributors to always try to follow the standards—whether or not their programs are implemented in C.

"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another or in which unnecessary data copies are avoided. This is frequently used to save CPU cycles and memory bandwidth in many time consuming tasks, such as when transmitting a file at high speed over a network, etc., thus improving the performance of programs (processes) executed by a computer.

In computing, ioctl is a system call for device-specific input/output operations and other operations which cannot be expressed by regular file semantics. It takes a parameter specifying a request code; the effect of a call depends completely on the request code. Request codes are often device-specific. For instance, a CD-ROM device driver which can instruct a physical device to eject a disc would provide an ioctl request code to do so. Device-independent request codes are sometimes used to give userspace access to kernel functions which are only used by core system software or still under development.

For most file systems, a program initializes access to a file in a file system using the open system call. This allocates resources associated to the file, and returns a handle that the process will use to refer to that file. In some cases the open is performed by the first access.

Netlink is a socket family used for inter-process communication (IPC) between both the kernel and userspace processes, and between different userspace processes, in a way similar to the Unix domain sockets available on certain Unix-like operating systems, including its original incarnation as a Linux kernel interface, as well as in the form of a later implementation on FreeBSD. Similarly to the Unix domain sockets, and unlike INET sockets, Netlink communication cannot traverse host boundaries. However, while the Unix domain sockets use the file system namespace, Netlink sockets are usually addressed by process identifiers (PIDs).

In computer science, the event loop is a programming construct or design pattern that waits for and dispatches events or messages in a program. The event loop works by making a request to some internal or external "event provider", then calls the relevant event handler. The event loop is also sometimes referred to as the message dispatcher, message loop, message pump, or run loop.

The Berkeley Packet Filter is a network tap and packet filter which permits computer network packets to be captured and filtered at the operating system level. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received, and allows a userspace process to supply a filter program that specifies which packets it wants to receive. For example, a tcpdump process may want to receive only packets that initiate a TCP connection. BPF returns only packets that pass the filter that the process supplies. This avoids copying unwanted packets from the operating system kernel to the process, greatly improving performance. The filter program is in the form of instructions for a virtual machine, which are interpreted, or compiled into machine code by a just-in-time (JIT) mechanism and executed, in the kernel.

select is a system call and application programming interface (API) in Unix-like and POSIX-compliant operating systems for examining the status of file descriptors of open input/output channels. The select system call is similar to the poll facility introduced in UNIX System V and later operating systems. However, with the c10k problem, both select and poll have been superseded by the likes of kqueue, epoll, /dev/poll and I/O completion ports.

In Unix-like operating systems, a device file or device node or special file is an interface to a device driver that appears in a file system as if it were an ordinary file. There are also special files in DOS, OS/2, and Windows. These special files allow an application program to interact with a device by using its device driver via standard input/output system calls. Using standard system calls simplifies many programming tasks, and leads to consistent user-space I/O mechanisms regardless of device features and functions.

epoll is a Linux kernel system call for a scalable I/O event notification mechanism, first introduced in version 2.5.45 of the Linux kernel. Its function is to monitor multiple file descriptors to see whether I/O is possible on any of them. It is meant to replace the older POSIX select(2) and poll(2) system calls, to achieve better performance in more demanding applications, where the number of watched file descriptors is large (unlike the older system calls, which operate in O(n) time, epoll operates in O(1) time).

In modern POSIX compliant operating systems, a program that needs to access data from a file stored in a file system uses the read system call. The file is identified by a file descriptor that is normally obtained from a previous call to open. This system call reads in data in bytes, the number of which is specified by the caller, from the file and stores then into a buffer supplied by the calling process.

The write is one of the most basic routines provided by a Unix-like operating system kernel. It writes data from a buffer declared by the user to a given device, such as a file. This is the primary way to output data from a program by directly using a system call. The destination is identified by a numeric code. The data to be written, for instance a piece of text, is defined by a pointer and a size, given in number of bytes.

In Unix-like operating systems, dup and dup2 system calls create a copy of a given file descriptor. This new descriptor actually does not behave like a copy, but like an alias of the old one.

Checkpoint/Restore In Userspace (CRIU), is a software tool for the Linux operating system. Using this tool, it is possible to freeze a running application and checkpoint it to persistent storage as a collection of files. One can then use the files to restore and run the application from the point it was frozen at. The distinctive feature of the CRIU project is that it is mainly implemented in user space, rather than in the kernel.

CIFSD is an open-source in-kernel CIFS/SMB server created by Namjae Jeon for the Linux kernel. Initially the goal is to provide improved file I/O performance, but the bigger goal is to have some new features which are much easier to develop and maintain inside the kernel and expose the layers fully. Directions can be attributed to sections where Samba is moving to a few modules inside the kernel to have features like Remote direct memory access (RDMA) to work with actual performance gain.

References

  1. 1 2 "Linux: Explaining splice() and tee()". kerneltrap.org. 2006-04-21. Archived from the original on 2013-05-21. Retrieved 2014-04-27.
  2. "Archived copy". Archived from the original on 2016-03-04. Retrieved 2016-02-28.{{cite web}}: CS1 maint: archived copy as title (link)