Pipeline (Unix)

Last updated
A pipeline of three program processes run on a text terminal Pipeline.svg
A pipeline of three program processes run on a text terminal

In Unix-like computer operating systems, a pipeline is a mechanism for inter-process communication using message passing. A pipeline is a set of processes chained together by their standard streams, so that the output text of each process ( stdout ) is passed directly as input ( stdin ) to the next one. The second process is started as the first process is still executing, and they are executed concurrently. The concept of pipelines was championed by Douglas McIlroy at Unix's ancestral home of Bell Labs, during the development of Unix, shaping its toolbox philosophy. [1] [2] It is named by analogy to a physical pipeline. A key feature of these pipelines is their "hiding of internals" (Ritchie & Thompson, 1974). This in turn allows for more clarity and simplicity in the system.

Contents

This article is about anonymous pipes, where data written by one process is buffered by the operating system until it is read by the next process, and this uni-directional channel disappears when the processes are completed. This differs from named pipes, where messages are passed to or from a pipe that is named by making it a file, and remains after the processes are completed. The standard shell syntax for anonymous pipes is to list multiple commands, separated by vertical bars ("pipes" in common Unix verbiage):

command1|command2|command3 

For example, to list files in the current directory ( ls ), retain only the lines of ls output containing the string "key" ( grep ), and view the result in a scrolling page ( less ), a user types the following into the command line of a terminal:

ls-l|grepkey|less 

The command ls -l is executed as a process, the output (stdout) of which is piped to the input (stdin) of the process for grep key; and likewise for the process for less. Each process takes input from the previous process and produces output for the next process via standard streams . Each | tells the shell to connect the standard output of the command on the left to the standard input of the command on the right by an inter-process communication mechanism called an (anonymous) pipe, implemented in the operating system. Pipes are unidirectional; data flows through the pipeline from left to right.

Example

Below is an example of a pipeline that implements a kind of spell checker for the web resource indicated by a URL. An explanation of what it does follows.

curl"https://en.wikipedia.org/wiki/Pipeline_(Unix)"|sed's/[^a-zA-Z ]/ /g'|tr'A-Z ''a-z\n'|grep'[a-z]'|sort-u|comm-23-<(sort/usr/share/dict/words)|less 
  1. curl obtains the HTML contents of a web page (could use wget on some systems).
  2. sed replaces all characters (from the web page's content) that are not spaces or letters, with spaces. (Newlines are preserved.)
  3. tr changes all of the uppercase letters into lowercase and converts the spaces in the lines of text to newlines (each 'word' is now on a separate line).
  4. grep includes only lines that contain at least one lowercase alphabetical character (removing any blank lines).
  5. sort sorts the list of 'words' into alphabetical order, and the -u switch removes duplicates.
  6. comm finds lines in common between two files, -23 suppresses lines unique to the second file, and those that are common to both, leaving only those that are found only in the first file named. The - in place of a filename causes comm to use its standard input (from the pipe line in this case). sort /usr/share/dict/words sorts the contents of the words file alphabetically, as comm expects, and <( ... ) outputs the results to a temporary file (via process substitution), which comm reads. The result is a list of words (lines) that are not found in /usr/share/dict/words.
  7. less allows the user to page through the results.

Pipelines in command line interfaces

All widely used Unix shells have a special syntax construct for the creation of pipelines. In all usage one writes the commands in sequence, separated by the ASCII vertical bar character | (which, for this reason, is often called "pipe character"). The shell starts the processes and arranges for the necessary connections between their standard streams (including some amount of buffer storage).

Error stream

By default, the standard error streams ("stderr") of the processes in a pipeline are not passed on through the pipe; instead, they are merged and directed to the console. However, many shells have additional syntax for changing this behavior. In the csh shell, for instance, using |& instead of | signifies that the standard error stream should also be merged with the standard output and fed to the next process. The Bash shell can also merge standard error with |& since version 4.0 [3] or using 2>&1, as well as redirect it to a different file.

Pipemill

In the most commonly used simple pipelines the shell connects a series of sub-processes via pipes, and executes external commands within each sub-process. Thus the shell itself is doing no direct processing of the data flowing through the pipeline.

However, it's possible for the shell to perform processing directly, using a so-called mill or pipemill (since a while command is used to "mill" over the results from the initial command). This construct generally looks something like:

command|whileread-rvar1var2...;do# process each line, using variables as parsed into var1, var2, etc# (note that this may be a subshell: var1, var2 etc will not be available# after the while loop terminates; some shells, such as zsh and newer# versions of Korn shell, process the commands to the left of the pipe# operator in a subshell)done

Such pipemill may not perform as intended if the body of the loop includes commands, such as cat and ssh, that read from stdin : [4] on the loop's first iteration, such a program (let's call it the drain) will read the remaining output from command, and the loop will then terminate (with results depending on the specifics of the drain). There are a couple of possible ways to avoid this behavior. First, some drains support an option to disable reading from stdin (e.g. ssh -n). Alternatively, if the drain does not need to read any input from stdin to do something useful, it can be given < /dev/null as input.

As all components of a pipe are run in parallel, a shell typically forks a subprocess (a subshell) to handle its contents, making it impossible to propagate variable changes to the outside shell environment. To remedy this issue, the "pipemill" can instead be fed from a here document containing a command substitution, which waits for the pipeline to finish running before milling through the contents. Alternatively, a named pipe or a process substitution can be used for parallel execution. GNU bash also has a lastpipe option to disable forking for the last pipe component. [5]

Creating pipelines programmatically

Pipelines can be created under program control. The Unix pipe() system call asks the operating system to construct a new anonymous pipe object. This results in two new, opened file descriptors in the process: the read-only end of the pipe, and the write-only end. The pipe ends appear to be normal, anonymous file descriptors, except that they have no ability to seek.

To avoid deadlock and exploit parallelism, the Unix process with one or more new pipes will then, generally, call fork() to create new processes. Each process will then close the end(s) of the pipe that it will not be using before producing or consuming any data. Alternatively, a process might create new threads and use the pipe to communicate between them.

Named pipes may also be created using mkfifo() or mknod() and then presented as the input or output file to programs as they are invoked. They allow multi-path pipes to be created, and are especially effective when combined with standard error redirection, or with tee .

Implementation

In most Unix-like systems, all processes of a pipeline are started at the same time, with their streams appropriately connected, and managed by the scheduler together with all other processes running on the machine. An important aspect of this, setting Unix pipes apart from other pipe implementations, is the concept of buffering: for example a sending program may produce 5000 bytes per second, and a receiving program may only be able to accept 100 bytes per second, but no data is lost. Instead, the output of the sending program is held in the buffer. When the receiving program is ready to read data, the next program in the pipeline reads from the buffer. If the buffer is filled, the sending program is stopped (blocked) until at least some data is removed from the buffer by the receiver. In Linux, the size of the buffer is 65,536 bytes (64KiB). An open source third-party filter called bfr is available to provide larger buffers if required.

Network pipes

Tools like netcat and socat can connect pipes to TCP/IP sockets.

History

The pipeline concept was invented by Douglas McIlroy [6] and first described in the man pages of Version 3 Unix. [7] [8] McIlroy noticed that much of the time command shells passed the output file from one program as input to another.

His ideas were implemented in 1973 when ("in one feverish night", wrote McIlroy) Ken Thompson added the pipe() system call and pipes to the shell and several utilities in Version 3 Unix. "The next day", McIlroy continued, "saw an unforgettable orgy of one-liners as everybody joined in the excitement of plumbing." McIlroy also credits Thompson with the | notation, which greatly simplified the description of pipe syntax in Version 4. [9] [7]

Although developed independently, Unix pipes are related to, and were preceded by, the 'communication files' developed by Ken Lochner [10] in the 1960s for the Dartmouth Time Sharing System. [11]

In Tony Hoare's communicating sequential processes (CSP) McIlroy's pipes are further developed. [12]

The robot in the icon for Apple's Automator, which also uses a pipeline concept to chain repetitive commands together, holds a pipe in homage to the original Unix concept.

Other operating systems

This feature of Unix was borrowed by other operating systems, such as MS-DOS and the CMS Pipelines package on VM/CMS and MVS, and eventually came to be designated the pipes and filters design pattern of software engineering.

See also

Related Research Articles

ed (software) Line-oriented text editor for Unix

ed is a line editor for Unix and Unix-like operating systems. It was one of the first parts of the Unix operating system that was developed, in August 1969. It remains part of the POSIX and Open Group standards for Unix-based operating systems, alongside the more sophisticated full-screen editor vi.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

<span class="mw-page-title-main">Bourne shell</span> Command-line interpreter for operating systems

The Bourne shell (sh) is a shell command-line interpreter for computer operating systems.

<span class="mw-page-title-main">C shell</span> Unix shell

The C shell is a Unix shell created by Bill Joy while he was a graduate student at University of California, Berkeley in the late 1970s. It has been widely distributed, beginning with the 2BSD release of the Berkeley Software Distribution (BSD) which Joy first distributed in 1978. Other early contributors to the ideas or the code were Michael Ubell, Eric Allman, Mike O'Brien and Jim Kulp.

In computer programming, standard streams are preconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin), standard output (stdout) and standard error (stderr). Originally I/O happened via a physically connected system console, but standard streams abstract this. When a command is executed via an interactive shell, the streams are typically connected to the text terminal on which the shell is running, but can be changed with redirection or a pipeline. More generally, a child process inherits the standard streams of its parent process.

comm Standard UNIX utility for comparing files

The comm command in the Unix family of computer operating systems is a utility that is used to compare two files for common and distinct lines. comm is specified in the POSIX standard. It has been widely available on Unix-like operating systems since the mid to late 1980s.

In software, an XML pipeline is formed when XML processes, especially XML transformations and XML validations, are connected.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

<span class="mw-page-title-main">CMS Pipelines</span>

CMS Pipelines is a feature of the VM/CMS operating system that allows the user to create and use a pipeline. The programs in a pipeline operate on a sequential stream of records. A program writes records that are read by the next program in the pipeline. Any program can be combined with any other because reading and writing is done through a device independent interface.

In computing, a named pipe is an extension to the traditional pipe concept on Unix and Unix-like systems, and is one of the methods of inter-process communication (IPC). The concept is also found in OS/2 and Microsoft Windows, although the semantics differ substantially. A traditional pipe is "unnamed" and lasts only as long as the process. A named pipe, however, can last as long as the system is up, beyond the life of the process. It can be deleted if no longer used. Usually a named pipe appears as a file, and generally processes attach to it for IPC.

The Thompson shell was the first Unix shell, introduced in the first version of Unix in 1971, and was written by Ken Thompson. It was a simple command interpreter, not designed for scripting, but nonetheless introduced several innovative features to the command-line interface and led to the development of the later Unix shells.

In software engineering, a pipeline consists of a chain of processing elements, arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline. Usually some amount of buffering is provided between consecutive elements. The information that flows in these pipelines is often a stream of records, bytes, or bits, and the elements of a pipeline may be called filters; this is also called the pipe(s) and filters design pattern. Connecting elements into a pipeline is analogous to function composition.

In Unix-like operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

In computing, tee is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.

A filter is a computer program or subroutine to process a stream, producing another stream. While a single filter can be used individually, they are frequently strung together to form a pipeline.

sort (Unix) Standard UNIX utility

In computing, sort is a standard command line program of Unix and Unix-like operating systems, that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input. By default, the entire input is taken as sort key. Blank space is the default field separator. The command supports a number of command-line options that can vary by implementation. For instance the "-r" flag will reverse the sort order.

In Unix and Unix-like operating systems, job control refers to control of jobs by a shell, especially interactively, where a "job" is a shell's representation for a process group. Basic job control features are the suspending, resuming, or terminating of all processes in the job/process group; more advanced features can be performed by sending signals to the job. Job control is of particular interest in Unix due to its multiprocessing, and should be distinguished from job control generally, which is frequently applied to sequential execution.

Toybox is a free and open-source software implementation of over 200 Unix command line utilities such as ls, cp, and mv. The Toybox project was started in 2006, and became a 0BSD licensed BusyBox alternative. Toybox is used for most of Android's command-line tools in all currently supported Android versions, and is also used to build Android on Linux and macOS. All of the tools are tested on Linux, and many of them also work on BSD and macOS.

In computing, process substitution is a form of inter-process communication that allows the input or output of a command to appear as a file. The command is substituted in-line, where a file name would normally occur, by the command shell. This allows programs that normally only accept files to directly read from or write to another program.

cat (Unix) Unix command utility

cat is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files . It has been ported to a number of operating systems.

References

  1. Mahoney, Michael S. "The Unix Oral History Project: Release.0, The Beginning". McIlroy: It was one of the only places where I very nearly exerted managerial control over Unix, was pushing for those things, yes.
  2. "Prophetic Petroglyphs". cm.bell-labs.com. Archived from the original on 8 May 1999. Retrieved 22 May 2022.
  3. "Bash release notes". tiswww.case.edu. Retrieved 2017-06-14.
  4. "Shell Loop Interaction with SSH". 6 March 2012. Archived from the original on 6 March 2012.
  5. John1024. "How can I store the "find" command results as an array in Bash". Stack Overflow.{{cite web}}: CS1 maint: numeric names: authors list (link)
  6. "The Creation of the UNIX Operating System". Bell Labs. Archived from the original on September 14, 2004.
  7. 1 2 McIlroy, M. D. (1987). A Research Unix reader: annotated excerpts from the Programmer's Manual, 1971–1986 (PDF) (Technical report). CSTR. Bell Labs. 139.
  8. Thompson K, Ritchie DM (February 1973). UNIX Programmer’s Manual Third Edition (PDF) (Technical report) (3rd ed.). Bell Labs. p. 178.
  9. "Pipes: A Brief Introduction". The Linux Information Project. August 23, 2006 [Created April 29, 2004]. Retrieved January 7, 2024.
  10. "Dartmouth Timesharing" (DOC). Rochester Institute of Technology . Retrieved January 7, 2024.
  11. "Data". cm.bell-labs.com. Archived from the original on 20 February 1999. Retrieved 22 May 2022.
  12. Cox, Russ. "Bell Labs and CSP Threads". Swtchboard. Retrieved January 7, 2024.