Pipeline (computing)

Last updated

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

Contents

Concept and motivation

Pipelining is a commonly used concept in everyday life. For example, in the assembly line of a car factory, each specific task—such as installing the engine, installing the hood, and installing the wheels—is often done by a separate work station. The stations carry out their tasks in parallel, each on a different car. Once a car has had one task performed, it moves to the next station. Variations in the time needed to complete the tasks can be accommodated by "buffering" (holding one or more cars in a space between the stations) and/or by "stalling" (temporarily halting the upstream stations), until the next station becomes available.

Suppose that assembling one car requires three tasks that take 20, 10, and 15 minutes, respectively. Then, if all three tasks were performed by a single station, the factory would output one car every 45 minutes. By using a pipeline of three stations, the factory would output the first car in 45 minutes, and then a new one every 20 minutes.

As this example shows, pipelining does not decrease the latency, that is, the total time for one item to go through the whole system. It does however increase the system's throughput, that is, the rate at which new items are processed after the first one.

In computing

In computing, a pipeline or data pipeline [1] is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

Computer-related pipelines include:

Design considerations

Balancing the stages

Since the throughput of a pipeline cannot be better than that of its slowest element, the designer should try to divide the work and resources among the stages so that they all take the same time to complete their tasks. In the car assembly example above, if the three tasks took 15 minutes each, instead of 20, 10, and 15 minutes, the latency would still be 45 minutes, but a new car would then be finished every 15 minutes, instead of 20.

Buffering

Under ideal circumstances, if all processing elements are synchronized and take the same amount of time to process, then each item can be received by each element just as it is released by the previous one, in a single clock cycle. That way, the items will flow through the pipeline at a constant speed, like waves in a water channel. In such "wave pipelines", [2] no synchronization or buffering is needed between the stages, besides the storage needed for the data items.

More generally, buffering between the pipeline stages is necessary when the processing times are irregular, or when items may be created or destroyed along the pipeline. For example, in a graphics pipeline that processes triangles to be rendered on the screen, an element that checks the visibility of each triangle may discard the triangle if it is invisible, or may output two or more triangular pieces of the element if they are partly hidden. Buffering is also needed to accommodate irregularities in the rates at which the application feeds items to the first stage and consumes the output of the last one.

The buffer between two stages may be simply a hardware register with suitable synchronization and signalling logic between the two stages. When a stage A stores a data item in the register, it sends a "data available" signal to the next stage B. Once B has used that data, it responds with a "data received" signal to A. Stage A halts, waiting for this signal, before storing the next data item into the register. Stage B halts, waiting for the "data available" signal, if it is ready to process the next item but stage A has not provided it yet.

If the processing times of an element are variable, the whole pipeline may often have to stop, waiting for that element and all the previous ones to consume the items in their input buffers. The frequency of such pipeline stalls can be reduced by providing space for more than one item in the input buffer of that stage. Such a multiple-item buffer is usually implemented as a first-in, first-out queue. The upstream stage may still have to be halted when the queue gets full, but the frequency of those events will decrease as more buffer slots are provided. Queueing theory can tell the number of buffer slots needed, depending on the variability of the processing times and on the desired performance.

Nonlinear pipelines

If some stage takes (or may take) much longer than the others, and cannot be speed up, the designer can provide two or more processing elements to carry out that task in parallel, with a single input buffer and a single output buffer. As each element finishes processing its current data item, it delivers it to the common output buffer, and takes the next data item from the common input buffer. This concept of "non-linear" or "dynamic" pipeline is exemplified by shops or banks that have two or more cashiers serving clients from a single waiting queue.

Dependencies between items

In some applications, the processing of an item Y by a stage A may depend on the results or effect of processing a previous item X by some later stage B of the pipeline. In that case, stage A cannot correctly process item Y until item X has cleared stage B.

This situation occurs very often in instruction pipelines. For example, suppose that Y is an arithmetic instruction that reads the contents of a register that was supposed to have been modified by an earlier instruction X. Let A be the stage that fetches the instruction operands, and B be the stage that writes the result to the specified register. If stage A tries to process instruction Y before instruction X reaches stage B, the register may still contain the old value, and the effect of Y would be incorrect.

In order to handle such conflicts correctly, the pipeline must be provided with extra circuitry or logic that detects them and takes the appropriate action. Strategies for doing so include:

Rather than halt while waiting for X to be finished, stage A may guess whether the branch will be taken or not, and fetch the next instruction Y based on that guess. If the guess later turns out to be incorrect (hopefully rarely), the system would have to backtrack and resume with the correct choice. Namely, all the changes that were made to the machine's state by stage A and subsequent stages based on that guess would have to be undone, the instructions following X already in the pipeline would have to be flushed, and stage A would have to restart with the correct instruction pointer. This branch prediction strategy is a special case of speculative execution.

Typical software implementations

To be effectively implemented, data pipelines need a CPU scheduling strategy to dispatch work to the available CPU cores, and the usage of data structures on which the pipeline stages will operate on. For example, UNIX derivatives may pipeline commands connecting various processes' standard IO, using the pipes implemented by the operating system. Some operating systems [ example needed ] may provide UNIX-like syntax to string several program runs in a pipeline, but implement the latter as simple serial execution, rather than true pipelining—namely, by waiting for each program to finish before starting the next one.[ citation needed ]

Lower level approaches may rely on the threads provided by the operating system to schedule work on the stages: both thread pool-based implementations or on a one-thread-per-stage are viable, and exist. [3]

Other strategies relying on cooperative multitasking exist, that do not need multiple threads of execution and hence additional CPU cores, such as using a round-robin scheduler with a coroutine-based framework. In this context, each stage may be instantiated with its own coroutine, yielding control back to the scheduler after finishing its round task. This approach may need careful control over the process' stages to avoid them abuse their time slice.

Costs and drawbacks

A pipelined system typically requires more resources (circuit elements, processing units, computer memory, etc.) than one that executes one batch at a time, because its stages cannot share those resources, and because buffering and additional synchronization logic may be needed between the elements.

Moreover, the transfer of items between separate processing elements may increase the latency, especially for long pipelines.

The additional complexity cost of pipelining may be considerable if there are dependencies between the processing of different items, especially if a guess-and-backtrack strategy is used to handle them. Indeed, the cost of implementing that strategy for complex instruction sets has motivated some radical proposals to simplify computer architecture, such as RISC and VLIW. Compilers also have been burdened with the task of rearranging the machine instructions so as to improve the performance of instruction pipelines.

New technologies

It's true that in recent years the demands on applications and their underlying hardware have been significant. For example, building pipelines with single node applications that trawl through the data row by row is no longer feasible with the volume and variety of big data. However, with the advent of data analytics engines such as Hadoop, or more recently Apache Spark, it's been possible to distribute large datasets across multiple processing nodes, allowing applications to reach heights of efficiency several hundred times greater than was thought possible before. The effect of this today is that even a mid-level PC using distributed processing in this fashion can handle the building and running of big data pipelines. [4]

See also

Related Research Articles

<span class="mw-page-title-main">Central processing unit</span> Central computer component which executes instructions

A central processing unit (CPU), also called a central processor, main processor, or just processor, is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

The control unit (CU) is a component of a computer's central processing unit (CPU) that directs the operation of the processor. A CU typically uses a binary decoder to convert coded instructions into timing and control signals that direct the operation of the other units.

In computing, the instruction register (IR) or current instruction register (CIR) is the part of a CPU's control unit that holds the instruction currently being executed or decoded. In simple processors, each instruction to be executed is loaded into the instruction register, which holds it while it is decoded, prepared and ultimately executed, which can take several steps.

<span class="mw-page-title-main">Process (computing)</span> Particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

<span class="mw-page-title-main">Instruction cycle</span> Basic operation cycle of a computer

The instruction cycle is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. It is composed of three main stages: the fetch stage, the decode stage, and the execute stage.

<span class="mw-page-title-main">CDC 7600</span> 1967 supercomputer

The CDC 7600 was designed by Seymour Cray to be the successor to the CDC 6600, extending Control Data's dominance of the supercomputer field into the 1970s. The 7600 ran at 36.4 MHz and had a 65 Kword primary memory using magnetic core and variable-size secondary memory. It was generally about ten times as fast as the CDC 6600 and could deliver about 10 MFLOPS on hand-compiled code, with a peak of 36 MFLOPS. In addition, in benchmark tests in early 1970 it was shown to be slightly faster than its IBM rival, the IBM System/360, Model 195. When the system was released in 1967, it sold for around $5 million in base configurations, and considerably more as options and features were added.

<span class="mw-page-title-main">CDC STAR-100</span>

The CDC STAR-100 is a vector supercomputer that was designed, manufactured, and marketed by Control Data Corporation (CDC). It was one of the first machines to use a vector processor to improve performance on appropriate scientific applications. It was also the first supercomputer to use integrated circuits and the first to be equipped with one million words of computer memory.

In computer engineering, out-of-order execution is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

In software engineering, a pipeline consists of a chain of processing elements, arranged so that the output of each element is the input of the next. The concept is analogous to a physical pipeline. Usually some amount of buffering is provided between consecutive elements. The information that flows in these pipelines is often a stream of records, bytes, or bits, and the elements of a pipeline may be called filters. This is also called the pipe(s) and filters design pattern which is monolithic. Its advantages are simplicity and low cost while its disadvantages are lack of elasticity, fault tolerance and scalability. Connecting elements into a pipeline is analogous to function composition.

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.

<span class="mw-page-title-main">Reservation station</span>

A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling.

<span class="mw-page-title-main">Arithmetic logic unit</span> Combinational digital circuit

In computing, an arithmetic logic unit (ALU) is a combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point numbers. It is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs, and graphics processing units (GPUs).

Latency oriented processor architecture is the microarchitecture of a microprocessor designed to serve a serial computing thread with a low latency. This is typical of most central processing units (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases. Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total throughput of the system, rather than the service latencies for all individual threads that they work on.

References

  1. Data Pipeline Development Archived 2018-05-24 at the Wayback Machine Published by Dativa, retrieved 24 May, 2018
  2. O. Hauck; Sorin A. Huss; M. Garg (1999). "Two-phase asynchronous wave-pipelines and their application to a 2D-DCT". Proceedings. Fifth International Symposium on Advanced Research in Asynchronous Circuits and Systems. pp. 219–228. doi:10.1109/ASYNC.1999.761536. ISBN   0-7695-0031-5. S2CID   206515615.
  3. "MTDP". GitHub . September 2022.
  4. What is a Data Pipeline? Published by Data Pipelines, retrieved 11 March 2021

Bibliography