Everything is a file

Last updated

"Everything is a file" is an approach to interface design in Unix derivatives. While this turn of phrase does not as such figure as a Unix design principle or philosophy, it is a common way to analyse designs, and informs the design of new interfaces in a way that prefers, in rough order of import:

Contents

  1. representing objects as file descriptors in favour of alternatives like abstract handles or names,
  2. operating on the objects with standard input/output operations returning byte streams to be interpreted by applications (rather than explicitly structured data), and
  3. allowing the usage or creation of objects by opening or creating files in the global filesystem name space.

The lines between the common interpretations of "file" and "file descriptor" are often blurred when analysing Unix, and nameability of files is the least important part of this principle; thus, it is sometimes described as "Everything is a file descriptor". [1] [2] [3]

This approach is interpreted differently with time, philosophy of each system, and the domain to which it's applied. The rest of this article demonstrates notable examples of some of those interpretations, and their repercussions.

Objects as file descriptors

Under Unix, a directory can be opened like a regular file, containing fixed-size records of (i-node, filename), but directories cannot be written to directly, and are modified by the kernel as a side-effect of creating and removing files within the directory. [4]

Some interfaces only follow a subset of these guidelines, for example pipes do not exist on the filesystem — pipe() creates a pair of unnameable file descriptors. [5] The later invention of named pipes (FIFOs) by POSIX fills this gap.

This does not mean that the only operations on an object are reading and writing: ioctl() and similar interfaces allow for object-specific operations (like controlling tty characteristics), directory file descriptors can be used to alter path look-ups (with a growing number of *at() system call variants like openat() [6] ) or to change the working directory to the one represented by the file descriptor, [7] in both cases preventing race conditions and being faster than the alternative of looking up the entire path. [8]

Socket file descriptors require configuration (setting the remote address and connecting) after creation before being used for I/O. A server socket may not be used for I/O directly at all — in connection-based protocols, bind() assigns a local address to a socket, and listen() uses that socket to wait until a remote process connects, then returns a new socket file descriptor representing that direct bidirectional connection.

This approach allows management of objects used by a program in a standardised manner, just like any other file — after binding to an address privileges may be dropped, the server socket may be distributed among many processes by fork() ing (respectively closed in subprocesses that should not have access), or the individual connections' sockets may be given as standard input/output to specialised handlers for those connections, as in the super-server/CGI/inetd paradigms.

Many interfaces present in early Unixes that do not use file descriptors became duplicated in later designs: the alarm()/setitimer() system calls schedule the delivery of a signal after the specified time elapses; this timer is inherited by children, and persists after exec() . The POSIX timer_create() API serves a similar function, but destroys the timer in child processes and on exec(); these timers identified by opaque handles. Both interfaces always deliver their completions asynchronously, and cannot be poll() ed/ select() ed, making their integration into a complex event loop more difficult.

The timerfd design (originally found in Linux), turns each timer object into a file descriptor, which can be individually observed with poll()&c. and whose inheritance to child processes can be controlled with the standard close()/CLOEXEC/CLOFORK controls.

While the POSIX API has timer_getoverrun() that returns how many times the timer elapsed, this is returned as the result of read() from a timerfd. This operation blocks, so waiting until a timerfd elapses is as easy as reading from it. There is no way to atomically do this with classic Unix or POSIX timers. The timer can be inspected non-blockingly by performing a non-blocking read (a standard I/O operation).

Objects in the filesystem namespace

Special file types

Device special files are a defining characteristic of Unix: initially, opening a regular file with i-node number ≤40 (traditionally stored under /dev) instead returned a file descriptor corresponding to a device, and handled by the device driver. The magic i-node number scheme later became codified into files with type S_IFBLK/S_IFCHR.

Opening special files is beholden to the same file-system permissions checks as opening regular files, allowing common access control — chown dmr /usr/dmr /dev/rk0; chmod o= /usr/dmr /dev/rk0 changes the ownership and file access mode of both the directory /usr/dmr and device /dev/rk0.

For block devices (hard disks and tape drives), due to their size, this meant unique semantics: they were block-addressed (see [9] ), and programs needed to be written specifically to work correctly with them. This is described as "extremely unfortunate", and later interfaces alleviate this. [a]

In many cases, magnetic tapes continue to have unique semantics: some tapes can be partitioned into "files" and the driver signals an end-of-file condition after the end of a partition is reached, so cp /dev/nrst0 file1; cp /dev/nrst0 file2 will create file1 and file2 consisting of two consecutive partitions of the tape — the driver provides an abstraction layer that presents a tape file descriptor as-if it were a regular file to fit into the Everything is a file paradigm. Specialised programs like mt are used to move between partitions on a tape like this,

Named pipes (FIFOs) appear as S_IFIFO-type files in the filesystem, can be renamed, and may be opened like regular files.

Under Unix derivatives, Unix-domain sockets appear as S_IFSOCK-type files in the filesystem, can be renamed, but cannot be open()ed — one must create the correct type of socket file descriptor and connect() explicitly. Under Plan 9, sockets in the filesystem may be opened like regular files.

As a replacement for dedicated system calls

Modern systems contain high-performance I/O event notification facilities — kqueue (BSD derivatives), epoll (Linux), IOCP (Windows NT, Solaris), /dev/poll (Solaris) — the control object is generally created (kqueue(), epoll_create()) and configured (kevent(), epoll_ctl()) with dedicated system calls. A /dev/poll instance is created by opening the file "/dev/poll" directly, writing configured objects to observe, and ioctl()s for additional configuration.

Memory may be allocated by requesting an anonymous memory mapping — one that doesn't correspond to any file. On modern systems this can be done by specifying no file and MAP_ANONYMOUS; in UNIX System V Release 4, this was done by opening /dev/zero, and mmap() ping it.

API filesystems

Operating system APIs can be implemented as regular system calls, or as synthetic file-systems. In the former case, system state can only be inspected by specially-written programs shipped with the system, and any additional processing desired by the user needs to either filter and parse the output of those programs, execute them to write the desired state, or must be implemented in the native system programming language.

In the latter case, system state is presented as-if it were regular files and directories [12] — on systems with a procfs, information about running processes can be obtained by looking at, canonically, /proc, which contains directories named after the PIDs running on the system, containing files like stat (status) with process metadata, cwd, exe, and rootsymbolic links to the process' working directory, executable image, and root directory — or directories like fd which contains symbolic links to the files the process has opened, named after the file descriptors.

Because these attributes are presented as files and symbolic links, standard utilities work on them, and one can, say, inspect the identity of the process with grep Uid /proc/1392400/status, go to the same directory as a process is in with cd /proc/1392400/cwd, look what files a process has open with ls -l /proc/1392400/fd, then open a file that process has open with less /proc/1392400/fd/8. This improves ergonomics over parsing this data from the output of a utility. [13] [14]

Under Linux, symbolic links under procfs are "magic": they can actually behave like cross-filesystem hard links to the files they point to. This behaviour allows recovery of files removed from the filesystem but still open by a process, and permanently persisting files created by O_TMPFILE in the filesystem (which otherwise cannot be named).

4.4BSD-derived sysctls are key/value mappings managed by the sysctl program, which lists all variables with sysctl -a, the value of one variable with sysctl net.inet.ip.forwarding, and sets it with sysctl -w net.inet.ip.forwarding=1. Under Linux, the equivalent mechanism is provided by procfs under the /proc/sys tree: the respective operations can be done with find /proc/sys/grep -r ^ /proc/sys, cat /proc/sys/net/ipv4/ip_forward, and echo 1 > /proc/sys/net/ipv4/ip_forward.

For convenience or standards conformance, dedicated inspection tools (like ps and sysctl) may still be provided, using these filesystems as data sources/sinks.

sysfs [15] and debugfs [16] are similar Linux interfaces for further configuring the kernel: writing mem to /sys/power/state will trigger a suspend-to-RAM procedure, [17] and writing 2 to /sys/module/iwlwifi/parameters/led_mode will start blinking the Wi-Fi LED on activity.

These are synthetic file-systems because the contents of each file are not stored anywhere verbatim: when the file is read, the appropriate kernel data structures are serialised into the reading process' input buffer, and when the file is written to, the output buffer is parsed. [15] This means that the file abstraction is broken, since the file metadata isn't valid: depending on the filesystem, each file reports a size of 0 or PAGE_SIZE, even though reading the data will yield a different number of bytes.

Notes

  1. First in Version 4 Unix by adding special seek() modes that multiply the offset by 512 in the kernel, [10] finally in Version 7 Unix by providing lseek() with a 32-bit argument. [11]

See also

Related Research Articles

<span class="mw-page-title-main">GNU Hurd</span> Operating system kernel designed as a replacement for Unix

GNU Hurd is a collection of microkernel servers written as part of GNU, for the GNU Mach microkernel. It has been under development since 1990 by the GNU Project of the Free Software Foundation, designed as a replacement for the Unix kernel, and released as free software under the GNU General Public License. When the Linux kernel proved to be a viable solution, development of GNU Hurd slowed, at times alternating between stasis and renewed activity and interest.

A Berkeley (BSD) socket is an application programming interface (API) for Internet domain sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

The Filesystem Hierarchy Standard (FHS) is a reference describing the conventions used for the layout of Unix-like systems. It has been made popular by its use in Linux distributions, but it is used by other Unix-like systems as well. It is maintained by the Linux Foundation. The latest version is 3.0, released on 3 June 2015.

Capability-based security is a concept in the design of secure computing systems, one of the existing security models. A capability is a communicable, unforgeable token of authority. It refers to a value that references an object along with an associated set of access rights. A user program on a capability-based operating system must use a capability to access an object. Capability-based security refers to the principle of designing user programs such that they directly share capabilities with each other according to the principle of least privilege, and to the operating system infrastructure necessary to make such transactions efficient and secure. Capability-based security is to be contrasted with an approach that uses traditional UNIX permissions and Access Control Lists.

In Unix and Unix-like computer operating systems, a file descriptor is a process-unique identifier (handle) for a file or other input/output resource, such as a pipe or network socket.

stat (system call) Unix system call

stat is a Unix system call that returns file attributes about an inode. The semantics of stat vary between operating systems. As an example, Unix command ls uses this system call to retrieve information on files that includes:

Filesystem in Userspace (FUSE) is a software interface for Unix and Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a bridge to the actual kernel interfaces.

The seven standard Unix file types are regular, directory, symbolic link, FIFO special, block special, character special, and socket as defined by POSIX. Different OS-specific implementations allow more types than what POSIX requires. A file's type can be identified by the ls -l command, which displays the type in the first character of the file-system permissions field.

The proc filesystem (procfs) is a special filesystem in Unix-like operating systems that presents information about processes and other system information in a hierarchical file-like structure, providing a more convenient and standardized method for dynamically accessing process data held in the kernel than traditional tracing methods or direct access to kernel memory. Typically, it is mapped to a mount point named /proc at boot time. The proc file system acts as an interface to internal data structures about running processes in the kernel. In Linux, it can also be used to obtain information about the kernel and to change certain kernel parameters at runtime (sysctl).

sysfs is a pseudo file system provided by the Linux kernel that exports information about various kernel subsystems, hardware devices, and associated device drivers from the kernel's device model to user space through virtual files. In addition to providing information about various devices and kernel subsystems, exported virtual files are also used for their configuration.

sysctl Unix-like software that manages kernel attributes

sysctl is a software mechanism in some Unix-like operating systems that reads and modifies the attributes of the system kernel such as its version number, maximum limits, and security settings. It is available both as a system call for compiled programs, and an administrator command for interactive use and scripting. Linux additionally exposes sysctl as a virtual file system.

In computer networking, STREAMS is the native framework in Unix System V for implementing character device drivers, network protocols, and inter-process communication. In this framework, a stream is a chain of coroutines that pass messages between a program and a device driver. STREAMS originated in Version 8 Research Unix, as Streams.

In computing, ioctl is a system call for device-specific input/output operations and other operations which cannot be expressed by regular file semantics. It takes a parameter specifying a request code; the effect of a call depends completely on the request code. Request codes are often device-specific. For instance, a CD-ROM device driver which can instruct a physical device to eject a disc would provide an ioctl request code to do so. Device-independent request codes are sometimes used to give userspace access to kernel functions which are only used by core system software or still under development.

A Unix architecture is a computer operating system system architecture that embodies the Unix philosophy. It may adhere to standards such as the Single UNIX Specification (SUS) or similar POSIX IEEE standard. No single published standard describes all Unix architecture computer operating systems — this is in part a legacy of the Unix wars.

In client-server computing, a Unix domain socket is a Berkeley socket that allows data to be exchanged between two processes executing on the same Unix or Unix-like host computer. This is similar to an Internet domain socket that allows data to be exchanged between two processes executing on different host computers.

binfmt_misc is a capability of the Linux kernel which allows arbitrary executable file formats to be recognized and passed to certain user space applications, such as emulators and virtual machines. It is one of a number of binary format handlers in the kernel that are involved in preparing a user-space program to run.

In computer science, a synthetic file system or a pseudo file system is a hierarchical interface to non-file objects that appear as if they were regular files in the tree of a disk-based or long-term-storage file system. These non-file objects may be accessed with the same system calls or utility programs as regular files and directories. The common term for both regular files and the non-file objects is node.

In Unix-like operating systems, a device file, device node, or special file is an interface to a device driver that appears in a file system as if it were an ordinary file. There are also special files in DOS, OS/2, and Windows. These special files allow an application program to interact with a device by using its device driver via standard input/output system calls. Using standard system calls simplifies many programming tasks, and leads to consistent user-space I/O mechanisms regardless of device features and functions.

ptrace is a system call found in Unix and several Unix-like operating systems. By using ptrace one process can control another, enabling the controller to inspect and manipulate the internal state of its target. ptrace is used by debuggers and other code-analysis tools, mostly as aids to software development.

<span class="mw-page-title-main">Shared memory</span> Computer memory that can be accessed by multiple processes

In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors.

References

  1. "Linus Torvalds - 'everything is a file descriptor or a process'". Yarchive.net. Retrieved 2015-08-28.
  2. "Ghosts of Unix Past". Lwn.net. Retrieved 2015-08-28.
  3. Kernighan, Brian (October 18, 2019). UNIX - A History and a Memoir. Independently published (October 18, 2019). p. 76ff. ISBN   978-1695978553.
  4. Ken Thompson and Dennis Ritchie (3 November 1971). "DIRECTORY (V)" (PDF). UNIX Programmer's Manual. Bell Laboratories.
  5. Ken Thompson and Dennis Ritchie (February 1973). "PIPE (II)". UNIX Programmer's Manual (Third ed.). Bell Laboratories../man2/pipe.2
  6. "open, openat — open file". IEEE Std 1003.1-2024, The Open Group Base Specifications Issue 8. The IEEE and The Open Group. 2024.
  7. "fchdir — change working directory". IEEE Std 1003.1-2024, The Open Group Base Specifications Issue 8. The IEEE and The Open Group. 2024.
  8. "D. Portability Considerations (Informative), D.2 Portability Capabilities, D.2.3 Access to Data". IEEE Std 1003.1-2024, The Open Group Base Specifications Issue 8. The IEEE and The Open Group. 2024.
  9. Ken Thompson and Dennis Ritchie (3 November 1971). "/DEV/RF0 (IV)" (PDF). UNIX Programmer's Manual. Bell Laboratories.
  10. Ken Thompson and Dennis Ritchie (November 1973). "PIPE (II)". UNIX Programmer's Manual (Fourth ed.). Bell Laboratories../man2/pipe.2, and the Addressing on the tape files, like that on the RK and RF disks, is block-oriented. stanza is gone.
  11. "LSEEK(2)". UNIX Programmer's Manual (Seventh ed.). Bell Laboratories. January 1979.usr/man/man2/lseek.2
  12. Benvenuti, Christian (2006). "3. User-Space-to-Kernel Interface". Understanding Linux network internals (Nachdr. ed.). Beijing Köln: O'Reilly. p. 58. ISBN   9780596002558.
  13. Xiao, Yang; Li, Frank Haizhon; Chen, Hui (2011). Handbook of security and networks. Hackensack (NJ): World scientific. p. 160. ISBN   9789814273039.
  14. "27. Upgrading and customizing the kernel". Red Hat Linux Networking and System Administration. John Wiley & Sons. 2007. p. 662. ISBN   9780471777311.
  15. 1 2 Mochel, Patrick; Murphy, Mike (16 August 2011). "sysfs - The filesystem for exporting kernel objects — The Linux Kernel documentation". kernel.org. Archived from the original on 13 March 2024. Retrieved 15 June 2024.
  16. "sysfs, procfs, sysctl, debugfs and other similar kernel interfaces". John's Blog. 2013-11-20. Retrieved 2024-06-15.
  17. Wysocki, Rafael J. "System Power Management Sleep States". kernel.org. Retrieved 15 June 2024.