Process substitution

Last updated May 21, 2024

In computing, process substitution is a form of inter-process communication that allows the input or output of a command to appear as a file. The command is substituted in-line, where a file name would normally occur, by the command shell. This allows programs that normally only accept files to directly read from or write to another program.

History

Process substitution was available as a compile-time option for ksh88, the 1988 version of the KornShell from Bell Labs.^[1] The rc shell provides the feature as "pipeline branching" in Version 10 Unix, released in 1990.^[2] The Bash shell provided process substitution no later than version 1.14, released in 1994.^[3]

Example

The following examples use KornShell syntax.

The Unix diff command normally accepts the names of two files to compare, or one file name and standard input. Process substitution allows one to compare the output of two programs directly:

$ diff<(sortfile1)<(sortfile2)

The <(command) expression tells the command interpreter to run command and make its output appear as a file. The command can be any arbitrarily complex shell command.

Without process substitution, the alternatives are:

Save the output of the command(s) to a temporary file, then read the temporary file(s).

$ sortfile2>/tmp/file2.sorted $ sortfile1|diff-/tmp/file2.sorted $ rm/tmp/file2.sorted

Create a named pipe (also known as a FIFO), start one command writing to the named pipe in the background, then run the other command with the named pipe as input.
```
$ mkfifo/tmp/sort2.fifo $ sortfile2>/tmp/sort2.fifo&$ sortfile1|diff-/tmp/sort2.fifo $ rm/tmp/sort2.fifo 
```

Both alternatives are more cumbersome.

Process substitution can also be used to capture output that would normally go to a file, and redirect it to the input of a process. The Bash syntax for writing to a process is >(command). Here is an example using the tee , wc and gzip commands that counts the lines in a file with wc -l and compresses it with gzip in one pass:

$ tee>(wc-l>&2)<bigfile|gzip>bigfile.gz

Advantages

The main advantages of process substitution over its alternatives are:

Simplicity: The commands can be given in-line; there is no need to save temporary files or create named pipes first.
Performance: Reading directly from another process is often faster than having to write a temporary file to disk, then read it back in. This also saves disk space.
Parallelism: The substituted process can be running concurrently with the command reading its output or writing its input, taking advantage of multiprocessing to reduce the total time for the computation.

Mechanism

Under the hood, process substitution has two implementations. On systems which support /dev/fd (most Unix-like systems) it works by calling the pipe() system call, which returns a file descriptor $fd for a new anonymous pipe, then creating the string /dev/fd/$fd, and substitutes that on the command line. On systems without /dev/fd support, it calls mkfifo with a new temporary filename to create a named pipe, and substitutes this filename on the command line. To illustrate the steps involved, consider the following simple command substitution on a system with /dev/fd support:

$ difffile1<(sortfile2)

The steps the shell performs are:

Create a new anonymous pipe. This pipe will be accessible with something like /dev/fd/63; you can see it with a command like echo <(true).
Execute the substituted command in the background (sort file2 in this case), piping its output to the anonymous pipe.
Execute the primary command, replacing the substituted command with the path of the anonymous pipe. In this case, the full command might expand to something like diff file1 /dev/fd/63.
When execution is finished, close the anonymous pipe.

For named pipes, the execution differs solely in the creation and deletion of the pipe; they are created with mkfifo (which is given a new temporary file name) and removed with unlink. All other aspects remain the same.

Limitations

The "files" created are not seekable, which means the process reading or writing to the file cannot perform random access; it must read or write once from start to finish. Programs that explicitly check the type of a file before opening it may refuse to work with process substitution, because the "file" resulting from process substitution is not a regular file. Additionally, up to Bash 4.4 (released September 2016), it was not possible to obtain the exit code of a process substitution command from the shell that created the process substitution. ^[4]

Related Research Articles

Bash, short for Bourne-Again SHell, is a shell program and command language supported by the Free Software Foundation and first developed for the GNU Project by Brian Fox. Designed as a 100% free software alternative for the Bourne shell, it was initially released in 1989. Its moniker is a play on words, referencing both its predecessor, the Bourne shell, and the concept of renewal.

A shell script is a computer program designed to be run by a Unix shell, a command-line interpreter. The various dialects of shell scripts are considered to be scripting languages. Typical operations performed by shell scripts include file manipulation, program execution, and printing text. A script which sets up the environment, runs the program, and does any necessary cleanup or logging, is called a wrapper.

<span class="mw-page-title-main">Maildir</span> E-mail format

The Maildir e-mail format is a common way of storing email messages on a file system, rather than in a database. Each message is assigned a file with a unique name, and each mail folder is a file system directory containing these files. Maildir was designed by Daniel J. Bernstein circa 1995, with a major goal of eliminating the need for program code to handle file locking and unlocking through use of the local filesystem. Maildir design reflects the fact that the only operations valid for an email message is that it be created, deleted or have its status changed in some way.

The C shell is a Unix shell created by Bill Joy while he was a graduate student at University of California, Berkeley in the late 1970s. It has been widely distributed, beginning with the 2BSD release of the Berkeley Software Distribution (BSD) which Joy first distributed in 1978. Other early contributors to the ideas or the code were Michael Ubell, Eric Allman, Mike O'Brien and Jim Kulp.

xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.

<span class="mw-page-title-main">Redirection (computing)</span> Form of interprocess communication

In computing, redirection is a form of interprocess communication, and is a function common to most command-line interpreters, including the various Unix shells that can redirect standard streams to user-specified locations. The concept of redirection is quite old, dating back to the earliest operating systems (OS). A discussion of the design goals for redirection can be found already in the 1971 description of the input-output subsystem of the Multics OS. However, prior to the introduction of UNIX OS with its "pipes", redirection in operating systems was hard or even impossible to do.

In Unix and Unix-like computer operating systems, a file descriptor is a process-unique identifier (handle) for a file or other input/output resource, such as a pipe or network socket.

In computing, a named pipe is an extension to the traditional pipe concept on Unix and Unix-like systems, and is one of the methods of inter-process communication (IPC). The concept is also found in OS/2 and Microsoft Windows, although the semantics differ substantially. A traditional pipe is "unnamed" and lasts only as long as the process. A named pipe, however, can last as long as the system is up, beyond the life of the process. It can be deleted if no longer used. Usually a named pipe appears as a file, and generally processes attach to it for IPC.

<span class="mw-page-title-main">Pipeline (Unix)</span> Mechanism for inter-process communication using message passing

In Unix-like computer operating systems, a pipeline is a mechanism for inter-process communication using message passing. A pipeline is a set of processes chained together by their standard streams, so that the output text of each process (stdout) is passed directly as input (stdin) to the next one. The second process is started as the first process is still executing, and they are executed concurrently. The concept of pipelines was championed by Douglas McIlroy at Unix's ancestral home of Bell Labs, during the development of Unix, shaping its toolbox philosophy. It is named by analogy to a physical pipeline. A key feature of these pipelines is their "hiding of internals". This in turn allows for more clarity and simplicity in the system.

The seven standard Unix file types are regular, directory, symbolic link, FIFO special, block special, character special, and socket as defined by POSIX. Different OS-specific implementations allow more types than what POSIX requires. A file's type can be identified by the ls -l command, which displays the type in the first character of the file-system permissions field.

Multi-Environment Real-Time (MERT), later renamed UNIX Real-Time (UNIX-RT), is a hybrid time-sharing and real-time operating system developed in the 1970s at Bell Labs for use in embedded minicomputers. A version named Duplex Multi Environment Real Time (DMERT) was the operating system for the AT&T 3B20D telephone switching minicomputer, designed for high availability; DMERT was later renamed Unix RTR.

In computing, a here document is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace in the text.

In Unix-like operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

In computing, tee is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.

test is a command-line utility found in Unix, Plan 9, and Unix-like operating systems that evaluates conditional expressions. test was turned into a shell builtin command in 1981 with UNIX System III and at the same time made available under the alternate name [.

A filter is a computer program or subroutine to process a stream, producing another stream. While a single filter can be used individually, they are frequently strung together to form a pipeline.

In computing, sort is a standard command line program of Unix and Unix-like operating systems, that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input. By default, the entire input is taken as sort key. Blank space is the default field separator. The command supports a number of command-line options that can vary by implementation. For instance the "-r" flag will reverse the sort order.

In Unix and Unix-like operating systems, job control refers to control of jobs by a shell, especially interactively, where a "job" is a shell's representation for a process group. Basic job control features are the suspending, resuming, or terminating of all processes in the job/process group; more advanced features can be performed by sending signals to the job. Job control is of particular interest in Unix due to its multiprocessing, and should be distinguished from job control generally, which is frequently applied to sequential execution.

The script command is a Unix utility that records a terminal session. It dates back to the 1979 3.0 Berkeley Software Distribution (BSD).

In computing, command substitution is a facility that allows a command to be run and its output to be pasted back on the command line as arguments to another command. Command substitution first appeared in the Bourne shell, introduced with Version 7 Unix in 1979, and has remained a characteristic of all later Unix shells. The feature has since been adopted in other programming languages as well, including Perl, PHP, Ruby and Microsoft's Powershell under Windows. It also appears in Microsoft's CMD.EXE in the FOR command and the ( ) command.

References

↑ Rosenblatt, Bill; Robbins, Arnold (April 2002). "Appendix A.2". Learning the Korn Shell (2nd ed.). O'Reilly & Associates. ISBN 0-596-00195-9.
↑ Duff, Tom (1990). Rc — A Shell for Plan 9 and UNIX Systems. CiteSeerX 10.1.1.41.3287 .
↑ Ramey, Chet (August 18, 1994). Bash 1.14 release notes. Free Software Foundation. Available in the Gnu source archive of version 1.14.7 as of 12 February 2016.
↑ "ProcessSubstitution". Greg's Wiki. 22 Sep 2016. Retrieved 2021-02-06.