Epoll

Last updated

epoll is a Linux kernel system call for a scalable I/O event notification mechanism, first introduced in version 2.5.44 of the Linux kernel. [1] Its function is to monitor multiple file descriptors to see whether I/O is possible on any of them. It is meant to replace the older POSIX select(2) and poll(2) system calls, to achieve better performance in more demanding applications, where the number of watched file descriptors is large (unlike the older system calls, which operate in O(n) time, epoll operates in O(1) time [2] ).

Contents

epoll is similar to FreeBSD's kqueue, in that it consists of a set of user-space functions, each taking a file descriptor argument denoting the configurable kernel object, against which they cooperatively operate. epoll uses a red–black tree (RB-tree) data structure to keep track of all file descriptors that are currently being monitored. [3]

API

intepoll_create1(intflags);

Creates an epoll object and returns its file descriptor. The flags parameter allows epoll behavior to be modified. It has only one valid value, EPOLL_CLOEXEC. epoll_create() is an older variant of epoll_create1() and is deprecated as of Linux kernel version 2.6.27 and glibc version 2.9. [4]

intepoll_ctl(intepfd,intop,intfd,structepoll_event*event);

Controls (configures) which file descriptors are watched by this object, and for which events. op can be ADD, MODIFY or DELETE.

intepoll_wait(intepfd,structepoll_event*events,intmaxevents,inttimeout);

Waits for any of the events registered for with epoll_ctl, until at least one occurs or the timeout elapses. Returns the occurred events in events, up to maxevents at once.

Triggering modes

epoll provides both edge-triggered and level-triggered modes. In edge-triggered mode, a call to epoll_wait will return only when a new event is enqueued with the epoll object, while in level-triggered mode, epoll_wait will return as long as the condition holds.

For instance, if a pipe registered with epoll has received data, a call to epoll_wait will return, signaling the presence of data to be read. Suppose, the reader only consumed part of data from the buffer. In level-triggered mode, further calls to epoll_wait will return immediately, as long as the pipe's buffer contains data to be read. In edge-triggered mode, however, epoll_wait will return only once new data is written to the pipe. [1]

Criticism

Bryan Cantrill pointed out that epoll had mistakes that could have been avoided, had it learned from its predecessors: input/output completion ports, event ports (Solaris) and kqueue. [5] However, a large part of his criticism was addressed by epoll's EPOLLONESHOT and EPOLLEXCLUSIVE options. EPOLLONESHOT was added in version 2.6.2 of the Linux kernel mainline, released in February 2004. EPOLLEXCLUSIVE was added in version 4.5, released in March 2016. [6]

See also

Related Research Articles

In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. This allows multiple processes to share a single central processing unit (CPU), and is an essential feature of a multitasking operating system. In a traditional CPU, each process - a program in execution - utilizes the various CPU registers to store data and hold the current state of the running process. However, in a multitasking operating system, the operating system switches between processes or threads to allow the execution of multiple processes simultaneously. For every switch, the operating system must save the state of the currently running process, followed by loading the next process state, which will run on the CPU. This sequence of operations that stores the state of the running process and the loading of the following running process is called a context switch.

Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

In Unix and Unix-like computer operating systems, a file descriptor is a process-unique identifier (handle) for a file or other input/output resource, such as a pipe or network socket.

stat (system call) Unix system call

stat is a Unix system call that returns file attributes about an inode. The semantics of stat vary between operating systems. As an example, Unix command ls uses this system call to retrieve information on files that includes:

In computer science, the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win. All processes will compete for resources, possibly freezing the computer, until the herd is calmed down again.

File locking is a mechanism that restricts access to a computer file, or to a region of a file, by allowing only one user or process to modify or delete it at a specific time and to prevent reading of the file while it's being modified or deleted.

In computer science, asynchronous I/O is a form of input/output processing that permits other processing to continue before the transmission has finished. A name used for asynchronous I/O in the Windows API is overlapped I/O.

libevent is a software library that provides asynchronous event notification. The libevent API provides a mechanism to execute a callback function when a specific event occurs on a file descriptor or after a timeout has been reached. libevent also supports callbacks triggered by signals and regular timeouts.

inotify is a Linux kernel subsystem created by John McCutchan, which monitors changes to the filesystem, and reports those changes to applications. It can be used to automatically update directory views, reload configuration files, log changes, backup, synchronize, and upload. The inotifywait and inotifywatch commands allow using the inotify subsystem from the command line. One major use is in desktop search utilities like Beagle, where its functionality permits reindexing of changed files without scanning the filesystem for changes every few minutes, which would be very inefficient.

In computing, ioctl is a system call for device-specific input/output operations and other operations which cannot be expressed by regular system calls. It takes a parameter specifying a request code; the effect of a call depends completely on the request code. Request codes are often device-specific. For instance, a CD-ROM device driver which can instruct a physical device to eject a disc would provide an ioctl request code to do so. Device-independent request codes are sometimes used to give userspace access to kernel functions which are only used by core system software or still under development.

For most file systems, a program initializes access to a file in a file system using the open system call. This allocates resources associated to the file, and returns a handle that the process will use to refer to that file. In some cases the open is performed by the first access.

In computer science, the event loop is a programming construct or design pattern that waits for and dispatches events or messages in a program. The event loop works by making a request to some internal or external "event provider", then calls the relevant event handler. The event loop is also sometimes referred to as the message dispatcher, message loop, message pump, or run loop.

sync is a standard system call in the Unix operating system, which commits all data in the kernel filesystem to non-volatile storage buffers, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

<span class="mw-page-title-main">Minix 3</span> Unix-like operating system

Minix 3 is a small, Unix-like operating system. It is published under a BSD-3-Clause license and is a successor project to the earlier versions, Minix 1 and 2.

splice is a Linux-specific system call that moves data between a file descriptor and a pipe without a round trip to user space. The related system call vmsplice moves or copies data between a pipe and user space. Ideally, splice and vmsplice work by remapping pages and do not actually copy any data, which may improve I/O performance. As linear addresses do not necessarily correspond to contiguous physical addresses, this may not be possible in all cases and on all hardware combinations.

select is a system call and application programming interface (API) in Unix-like and POSIX-compliant operating systems for examining the status of file descriptors of open input/output channels. The select system call is similar to the poll facility introduced in UNIX System V and later operating systems. However, with the c10k problem, both select and poll have been superseded by the likes of kqueue, epoll, /dev/poll and I/O completion ports.

Kqueue is a scalable event notification interface introduced in FreeBSD 4.1 in July 2000, also supported in NetBSD, OpenBSD, DragonFly BSD, and macOS. Kqueue was originally authored in 2000 by Jonathan Lemon, then involved with the FreeBSD Core Team. Kqueue makes it possible for software like nginx to solve the c10k problem.

The write is one of the most basic routines provided by a Unix-like operating system kernel. It writes data from a buffer declared by the user to a given device, such as a file. This is the primary way to output data from a program by directly using a system call. The destination is identified by a numeric code. The data to be written, for instance a piece of text, is defined by a pointer and a size, given in number of bytes.

Enduro/X is an open-source middleware platform for distributed transaction processing. It is built on proven APIs such as X/Open group's XATMI and XA. The platform is designed for building real-time microservices based applications with a clusterization option. Enduro/X functions as an extended drop-in replacement for Oracle Tuxedo. The platform uses in-memory POSIX Kernel queues which insures high interprocess communication throughput.

poll is a POSIX system call to wait for one or more file descriptors to become ready for use.

References

  1. 1 2 "epoll(7) - Linux manual page". Man7.org. 2012-04-17. Retrieved 2014-03-01.
  2. Oleksiy Kovyrin (2006-04-13). "Using epoll() For Asynchronous Network Programming". Kovyrin.net. Retrieved 2014-03-01.
  3. "The Implementation of epoll (1)". idndx.com. September 2014.{{cite web}}: CS1 maint: url-status (link)
  4. Love, Robert (2013). Linux System Programming (Second ed.). O’Reilly. pp. 97, 98. ISBN   978-1-449-33953-1.
  5. Archived at Ghostarchive and the Wayback Machine : "Ubuntu Slaughters Kittens | BSD Now 103". YouTube .
  6. "Epoll is fundamentally broken 1/2". idea.popcount.org. 2017-02-20. Retrieved 2017-10-06.