Pipeline (software)

Last updated

In software engineering, a pipeline consists of a chain of processing elements (processes, threads, coroutines, functions, etc.), arranged so that the output of each element is the input of the next. The concept is analogous to a physical pipeline. Usually some amount of buffering is provided between consecutive elements. The information that flows in these pipelines is often a stream of records, bytes, or bits, and the elements of a pipeline may be called filters. This is also called the pipe(s) and filters design pattern. Connecting elements into a pipeline is analogous to function composition.

Contents

Narrowly speaking, a pipeline is linear and one-directional, though sometimes the term is applied to more general flows. For example, a primarily one-directional pipeline may have some communication in the other direction, known as a return channel or backchannel, as in the lexer hack, or a pipeline may be fully bi-directional. Flows with one-directional trees and directed acyclic graph topologies behave similarly to linear pipelines. The lack of cycles in such flows makes them simple, and thus they may be loosely referred to as "pipelines".

Implementation

Pipelines are often implemented in a multitasking OS, by launching all elements at the same time as processes, and automatically servicing the data read requests by each process with the data written by the upstream process. This can be called a multiprocessed pipeline. In this way, the scheduler will naturally switch the CPU among the processes so as to minimize its idle time. In other common models, elements are implemented as lightweight threads or as coroutines to reduce the OS overhead often involved with processes. Depending on the OS, threads may be scheduled directly by the OS or by a thread manager. Coroutines are always scheduled by a coroutine manager of some form.

Read and write requests are usually blocking operations. This means that the execution of the source process, upon writing, is suspended until all data can be written to the destination process. Likewise, the execution of the destination process, upon reading, is suspended until at least some of the requested data can be obtained from the source process. This cannot lead to a deadlock, where both processes would wait indefinitely for each other to respond, since at least one of the processes will soon have its request serviced by the operating system, and continue to run.

For performance, most operating systems implementing pipes use pipe buffers, which allow the source process to provide more data than the destination process is currently able or willing to receive. Under most Unixes and Unix-like operating systems, a special command is also available, typically called "buffer", that implements a pipe buffer of potentially much larger and configurable size. This command can be useful if the destination process is significantly slower than the source process, but it is desired that the source process complete its task as soon as possible. E.g., if the source process consists of a command which reads an audio track from a CD and the destination process consists of a command which compresses the waveform audio data to a format like MP3. In this case, buffering the entire track in a pipe buffer would allow the CD drive to spin down more quickly, and enable the user to remove the CD from the drive before the encoding process has finished.

Such a buffer command can be implemented using system calls for reading and writing data. Wasteful busy waiting can be avoided by using facilities such as poll or select or multithreading.

Some notable examples of pipeline software systems include:

VM/CMS and z/OS

CMS Pipelines is a port of the pipeline idea to VM/CMS and z/OS systems. It supports much more complex pipeline structures than Unix shells, with steps taking multiple input streams and producing multiple output streams. (Such functionality is supported by the Unix kernel, but few programs use it as it makes for complicated syntax and blocking modes, although some shells do support it via arbitrary file descriptor assignment).

Traditional application programs on IBM mainframe operating systems have no standard input and output streams to allow redirection or piping. Instead of spawning processes with external programs, CMS Pipelines features a lightweight dispatcher to concurrently execute instances of more than 200 built-in programs that implement typical UNIX utilities and interface to devices and operating system services. In addition to the built-in programs, CMS Pipelines defines a framework to allow user-written REXX programs with input and output streams that can be used in the pipeline.

Data on IBM mainframes typically resides in a record-oriented filesystem and connected I/O devices operate in record mode rather than stream mode. As a consequence, data in CMS Pipelines is handled in record mode. For text files, a record holds one line of text. In general, CMS Pipelines does not buffer the data but passes records of data in a lock-step fashion from one program to the next. This ensures a deterministic flow of data through a network of interconnected pipelines.

Object pipelines

Beside byte stream-based pipelines, there are also object pipelines. In an object pipeline, processing elements output objects instead of text. PowerShell includes an internal object pipeline that transfers .NET objects between functions within the PowerShell runtime. Channels, found in the Limbo programming language, are other examples of this metaphor.

Pipelines in GUIs

Graphical environments such as RISC OS and ROX Desktop also use pipelines. Rather than providing a save dialog box containing a file manager to let the user specify where a program should write data, RISC OS and ROX provide a save dialog box containing an icon (and a field to specify the name). The destination is specified by dragging and dropping the icon. The user can drop the icon anywhere an already-saved file could be dropped, including onto icons of other programs. If the icon is dropped onto a program's icon, it is loaded and the contents that would otherwise have been saved are passed in on the new program's standard input stream.

For instance, a user browsing the world-wide web might come across a .gz compressed image which they want to edit and re-upload. Using GUI pipelines, they could drag the link to their de-archiving program, drag the icon representing the extracted contents to their image editor, edit it, open the save as dialog, and drag its icon to their uploading software.

Conceptually, this method could be used with a conventional save dialog box, but this would require the user's programs to have an obvious and easily accessible location in the filesystem. As this is often not the case, GUI pipelines are rare.

Other considerations

The name "pipeline" comes from a rough analogy with physical plumbing in that a pipeline usually [1] allows information to flow in only one direction, like water often flows in a pipe.

Pipes and filters can be viewed as a form of functional programming, using byte streams as data objects. More specifically, they can be seen as a particular form of monad for I/O. [2]

The concept of pipeline is also central to the Cocoon web development framework or to any XProc (the W3C Standards) implementations, where it allows a source stream to be modified before eventual display.

This pattern encourages the use of text streams as the input and output of programs. This reliance on text has to be accounted when creating graphic shells to text programs.

See also

Notes

  1. There are exceptions, such as "broken pipe" signals.
  2. "Monadic I/O and UNIX shell programming".

Related Research Articles

<span class="mw-page-title-main">Windowing system</span> Software that manages separately different parts of display screens

In computing, a windowing system is a software suite that manages separately different parts of display screens. It is a type of graphical user interface (GUI) which implements the WIMP paradigm for a user interface.

In computer programming, standard streams are preconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin), standard output (stdout) and standard error (stderr). Originally I/O happened via a physically connected system console, but standard streams abstract this. When a command is executed via an interactive shell, the streams are typically connected to the text terminal on which the shell is running, but can be changed with redirection or a pipeline. More generally, a child process inherits the standard streams of its parent process.

In software, an XML pipeline is formed when XML processes, especially XML transformations and XML validations, are connected.

<span class="mw-page-title-main">Redirection (computing)</span> Form of interprocess communication

In computing, redirection is a form of interprocess communication, and is a function common to most command-line interpreters, including the various Unix shells that can redirect standard streams to user-specified locations. The concept of redirection is quite old, dating back to the earliest operating systems (OS). A discussion of the design goals for redirection can be found already in the 1971 description of the input-output subsystem of the Multics OS. However, prior to the introduction of UNIX OS with its "pipes", redirection in operating systems was hard or even impossible to do.

<span class="mw-page-title-main">CMS Pipelines</span>

CMS Pipelines is a feature of the VM/CMS operating system that allows the user to create and use a pipeline. The programs in a pipeline operate on a sequential stream of records. A program writes records that are read by the next program in the pipeline. Any program can be combined with any other because reading and writing is done through a device independent interface.

In computing, a named pipe is an extension to the traditional pipe concept on Unix and Unix-like systems, and is one of the methods of inter-process communication (IPC). The concept is also found in OS/2 and Microsoft Windows, although the semantics differ substantially. A traditional pipe is "unnamed" and lasts only as long as the process. A named pipe, however, can last as long as the system is up, beyond the life of the process. It can be deleted if no longer used. Usually a named pipe appears as a file, and generally processes attach to it for IPC.

<span class="mw-page-title-main">Pipeline (Unix)</span> Mechanism for inter-process communication using message passing

In Unix-like computer operating systems, a pipeline is a mechanism for inter-process communication using message passing. A pipeline is a set of processes chained together by their standard streams, so that the output text of each process (stdout) is passed directly as input (stdin) to the next one. The second process is started as the first process is still executing, and they are executed concurrently. The concept of pipelines was championed by Douglas McIlroy at Unix's ancestral home of Bell Labs, during the development of Unix, shaping its toolbox philosophy. It is named by analogy to a physical pipeline. A key feature of these pipelines is their "hiding of internals". This in turn allows for more clarity and simplicity in the system.

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

The seven standard Unix file types are regular, directory, symbolic link, FIFO special, block special, character special, and socket as defined by POSIX. Different OS-specific implementations allow more types than what POSIX requires. A file's type can be identified by the ls -l command, which displays the type in the first character of the file-system permissions field.

In computer science, asynchronous I/O is a form of input/output processing that permits other processing to continue before the I/O operation has finished. A name used for asynchronous I/O in the Windows API is overlapped I/O.

In computing, tee is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.

A filter is a computer program or subroutine to process a stream, producing another stream. While a single filter can be used individually, they are frequently strung together to form a pipeline.

<span class="mw-page-title-main">Stream (computing)</span> Sequence of data items available over time

In computer science, a stream is a sequence of potentially unlimited data elements made available over time. A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches. Streams are processed differently from batch data.

On IBM mainframes, BatchPipes is a batch job processing utility which runs under the MVS/ESA operating system and later versions—OS/390 and z/OS.

In computing, a sink, or data sink generally refers to the destination of data flow.

In computer programming, flow-based programming (FBP) is a programming paradigm that defines applications as networks of black box processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. FBP is thus naturally component-oriented.

<span class="mw-page-title-main">PowerShell</span> Cross-platform command-line interface and scripting language for system and network administration

PowerShell is a task automation and configuration management program from Microsoft, consisting of a command-line shell and the associated scripting language. Initially a Windows component only, known as Windows PowerShell, it was made open-source and cross-platform on August 18, 2016, with the introduction of PowerShell Core. The former is built on the .NET Framework, the latter on .NET.

<span class="mw-page-title-main">GNU Emacs</span> GNU version of the Emacs text editor

GNU Emacs is a free software text editor. It was created by GNU Project founder Richard Stallman, based on the Emacs editor developed for Unix operating systems. GNU Emacs has been a central component of the GNU project and a flagship project of the free software movement. Its tag line is "the extensible self-documenting text editor."

<span class="mw-page-title-main">Command-line interface</span> Computer interface that uses text

A command-line interface (CLI) is a means of interacting with a computer program by inputting lines of text called command-lines. Command-line interfaces emerged in the mid-1960s, on computer terminals, as an interactive and more user-friendly alternative to the non-interactive interface available with punched cards.

The POSIX terminal interface is the generalized abstraction, comprising both an application programming interface for programs, and a set of behavioural expectations for users of a terminal, as defined by the POSIX standard and the Single Unix Specification. It is a historical development from the terminal interfaces of BSD version 4 and Seventh Edition Unix.