Software pipelining

Last updated

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.

Contents

It is important to distinguish software pipelining, which is a target code technique for overlapping loop iterations, from modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops. Software pipelining has been known to assembly language programmers of machines with instruction-level parallelism since such architectures existed. Effective compiler generation of such code dates to the invention of modulo scheduling by Rau and Glaeser. [1] Lam showed that special hardware is unnecessary for effective modulo scheduling. Her technique, modulo variable expansion is widely used in practice. [2] Gao et al. formulated optimal software pipelining in integer linear programming, culminating in validation of advanced heuristics in an evaluation paper. [3] This paper has a good set of references on the topic.

Example

Consider the following loop:

for i = 1 to bignumber   A(i)   B(i)   C(i) end

In this example, let A(i), B(i), C(i) be instructions, each operating on data i, that are dependent on each other. In other words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register, B could perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no dependence between operations for different values of i. In other words, A(2) can begin before A(1) finishes.

Without software pipelining, the operations execute in the following sequence:

A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...

Assume that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the looping control flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long as it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 9 cycles to complete: 3 clock cycles for A(1), 3 clock cycles for B(1), and 3 clock cycles for C(1).

Now consider the following sequence of instructions with software pipelining:

A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...

It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for the moment that bignumber is divisible by 3):

for i = 1 to (bignumber - 2) step 3   A(i)   A(i+1)   A(i+2)   B(i)   B(i+1)   B(i+2)   C(i)   C(i+1)   C(i+2) end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the total number of iterations will be divisible by the number of iterations we unroll. See the article on loop unrolling for more on solutions to this problem, but note that software pipelining prevents the use of Duff's device.[ citation needed ]

In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a high latency. For example, the following code:

for i = 1 to bignumber   A(i) ; 3 cycle latency   B(i) ; 3   C(i) ; 12(perhaps a floating point operation)   D(i) ; 3   E(i) ; 3   F(i) ; 3 end

would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache performance, see code bloat ). Even worse, the prologue (code before the loop for handling the case of bignumber not divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization ineffectual.

By contrast, here is the software pipelining for our example (the prologue and epilogue will be explained later):

prologue for i = 1 to (bignumber - 6)   A(i+6)   B(i+5)   C(i+4)   D(i+2) ; note that we skip i+3   E(i+1)   F(i) end epilogue

Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is:

Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1)
Iteration 2: A(8) B(7) C(6) D(4) E(3) F(2)
Iteration 3: A(9) B(8) C(7) D(5) E(4) F(3)
Iteration 4: A(10) B(9) C(8) D(6) E(5) F(4)
Iteration 5: A(11) B(10) C(9) D(7) E(6) F(5)
Iteration 6: A(12) B(11) C(10) D(8) E(7) F(6)
Iteration 7: A(13) B(12) C(11) D(9) E(8) F(7)

However, unlike the original loop, the pipelined version avoids the bottleneck at instruction C. Note that there are 12 instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7) are used for other instructions instead of being wasted.

The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our example above:

; loop prologue (arranged on lines for clarity) A(1) A(2), B(1) A(3), B(2), C(1) A(4), B(3), C(2) ; cannot start D(1) yet A(5), B(4), C(3), D(1) A(6), B(5), C(4), D(2), E(1)

Each line above corresponds to an iteration of the main pipelined loop, but without the instructions for iterations that have not yet begun. Similarly, the epilogue progressively removes instructions for iterations that have completed:

; loop epilogue (arranged on lines for clarity) B(bignumber), C(bignumber-1), D(bignumber-3), E(bignumber-4), F(bignumber-5) C(bignumber), D(bignumber-2), E(bignumber-3), F(bignumber-4) D(bignumber-1), E(bignumber-2), F(bignumber-3) D(bignumber), E(bignumber-1), F(bignumber-2) E(bignumber), F(bignumber-1) F(bignumber)

Difficulties of implementation

The requirement of a prologue and epilogue is one of the major difficulties of implementing software pipelining. Note that the prologue in this example is 18 instructions, 3 times as large as the loop itself. The epilogue would also be 18 instructions. In other words, the prologue and epilogue together are 6 times as large as the loop itself. While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.

A further difficulty is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of register X and register Y and put the result in register Z", where X, Y, and Z are numbers taken from other registers or memory. This has often been cited as a reason that software pipelining cannot be effectively implemented on conventional architectures.

In fact, Monica Lam presents an elegant solution to this problem in her thesis, A Systolic Array Optimizing Compiler (1989) ( ISBN   0-89838-300-5). She calls it modulo variable expansion. The trick is to replicate the body of the loop after it has been scheduled, allowing different registers to be used for different values of the same variable when they have to be live at the same time. For the simplest possible example, let's suppose that A(i) and B(i) can be issued in parallel and that the latency of the former is 2 cycles. The pipelined body could then be:

A(i+2); B(i)

Register allocation of this loop body runs into the problem that the result of A(i+2) must stay live for two iterations. Using the same register for the result of A(i+2) and the input of B(i) will result in incorrect results.

However, if we replicate the scheduled loop body, the problem is solved:

 A(i+2); B(i)  A(i+3); B(i+1)

Now a separate register can be allocated to the results of A(i+2) and A(i+3). To be more concrete:

 r1 = A(i+2); B(i) = r1  r2 = A(i+3); B(i+1) = r2  i = i + 2 // Just to be clear

On the assumption that each instruction bundle reads its input registers before writing its output registers, this code is correct. At the start of the replicated loop body, r1 holds the value of A(i+2) from the previous replicated loop iteration. Since i has been incremented by 2 in the meantime, this is actually the value of A(i) in this replicated loop iteration.

Of course, code replication increases code size and cache pressure just as the prologue and epilogue do. Nevertheless, for loops with large trip counts on architectures with enough instruction level parallelism, the technique easily performs well enough to be worth any increase in code size.

IA-64 implementation

Intel's IA-64 architecture provides an example of an architecture designed with the difficulties of software pipelining in mind. Some of the architectural support for software pipelining includes:

Related Research Articles

In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power consumption.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

<span class="mw-page-title-main">Digital signal processor</span> Specialized microprocessor optimized for digital signal processing

A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on MOS integrated circuit chips. They are widely used in audio signal processing, telecommunications, digital image processing, radar, sonar and speech recognition systems, and in common consumer electronic devices such as mobile phones, disk drives and high-definition television (HDTV) products.

In computer science, program optimization, code optimization, or software optimization, is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. In general, a computer program may be optimized so that it executes more rapidly, or to make it capable of operating with less memory storage or other resources, or draw less power.

In the C programming language, Duff's device is a way of manually implementing loop unrolling by interleaving two syntactic constructs of C: the do-while loop and a switch statement. Its discovery is credited to Tom Duff in November 1983, when Duff was working for Lucasfilm and used it to speed up a real-time animation program.

In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.

In computer science, instruction scheduling is a compiler optimization used to improve instruction-level parallelism, which improves performance on machines with instruction pipelines. Put more simply, it tries to do the following without changing the meaning of the code:

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated with loops. It plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a scientific program is spent on loops; as such, many compiler optimization techniques have been developed to make them faster.

In computer science, loop dependence analysis is a process which can be used to find dependencies within iterations of a loop with the goal of determining different relationships between statements. These dependent relationships are tied to the order in which different statements access memory locations. Using the analysis of these relationships, execution of the loop can be organized to allow multiple processors to work on different portions of the loop in parallel. This is known as parallel processing. In general, loops can consume a lot of processing time when executed as serial code. Through parallel processing, it is possible to reduce the total execution time of a program through sharing the processing load among multiple processors.

In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

The polyhedral model is a mathematical framework for programs that perform large numbers of operations -- too large to be explicitly enumerated -- thereby requiring a compact representation. Nested loop programs are the typical, but not the only example, and the most common use of the model is for loop nest optimization in program optimization. The polyhedral method treats each loop iteration within nested loops as lattice points inside mathematical objects called polyhedra, performs affine transformations or more general non-affine transformations such as tiling on the polytopes, and then converts the transformed polytopes into equivalent, but optimized, loop nests through polyhedra scanning.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

Tracing just-in-time compilation is a technique used by virtual machines to optimize the execution of a program at runtime. This is done by recording a linear sequence of frequently executed operations, compiling them to native machine code and executing them. This is opposed to traditional just-in-time (JIT) compilers that work on a per-method basis.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

Zero-overhead looping is a feature of some processor instruction sets whose hardware can repeat the body of a loop automatically, rather than requiring software instructions which take up cycles to do so. Zero-overhead loops are common in digital signal processors and some CISC instruction sets.

References

  1. B.R. Rau and C.D. Glaeser, "Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing", In Proceedings of the Fourteenth Annual Workshop on Microprogramming (MICRO-14), December 1981, pages 183-198
  2. M. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7).
  3. J. Ruttenberg, G.R. Gao, A. Stoutchinin, and W. Lichtenstein, "Software pipelining showdown: optimal vs. heuristic methods in a production compiler", In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, June 1996, pages 1-11. Also published as ACM SIGPLAN Notices 31(5).