Duff's device

Last updated

In the C programming language, Duff's device is a way of manually implementing loop unrolling by interleaving two syntactic constructs of C: the do-while loop and a switch statement. Its discovery is credited to Tom Duff in November 1983, when Duff was working for Lucasfilm and used it to speed up a real-time animation program.

Contents

Loop unrolling attempts to reduce the overhead of conditional branching needed to check whether a loop is done, by executing a batch of loop bodies per iteration. To handle cases where the number of iterations is not divisible by the unrolled-loop increments, a common technique among assembly language programmers is to jump directly into the middle of the unrolled loop body to handle the remainder. [1] Duff implemented this technique in C by using C's case label fall-through feature to jump into the unrolled body. [2]

Original version

Duff's problem was to copy 16-bit unsigned integers ("shorts" in most C implementations) from an array into a memory-mapped output register, denoted in C by a pointer. His original code, in C, looked as follows: [3] [4]

send(to,from,count)registershort*to,*from;registercount;{do{/* count > 0 assumed */*to=*from++;}while(--count>0);}

This code assumes that initial count > 0. Since the output location is a memory-mapped register, the pointer to is not incremented as would be required for a memory-to-memory copy.

If count were always divisible by eight, unrolling this loop eight-fold would produce the following:

send(to,from,count)registershort*to,*from;registercount;{registern=count/8;do{*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;}while(--n>0);}

Duff realized that to handle cases where count is not divisible by eight, the assembly programmer's technique of jumping into the loop body could be implemented by interlacing the structures of a switch statement and a loop, putting the switch's case labels at the points of the loop body that correspond to the remainder of count/8: [1]

send(to,from,count)registershort*to,*from;registercount;{registern=(count+7)/8;switch(count%8){case0:do{*to=*from++;case7:*to=*from++;case6:*to=*from++;case5:*to=*from++;case4:*to=*from++;case3:*to=*from++;case2:*to=*from++;case1:*to=*from++;}while(--n>0);}}

Duff's device can similarly be applied with any other size for the unrolled loop, not just eight as in the example above.

Mechanism

Based on an algorithm used widely by programmers coding in assembly for minimizing the number of tests and branches during a copy, Duff's device appears out of place when implemented in C. The device is valid C by virtue of two attributes in C:

  1. Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the body of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.
  2. The ability to jump into the middle of a loop in C.

This leads to what the Jargon File calls "the most dramatic use yet seen of fall through in C". [5] C's default fall-through in case statements has long been one of its most controversial features; Duff himself said that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against." [5]

Although valid in C, Duff's device goes against common C guidelines, such as the MISRA guidelines. Some compilers (e.g. CompCert) are restricted to such guidelines and thus reject Duff's device unless specifically instructed otherwise.

Simplified explanation

A functionally equivalent version
with switch and while disentangled
send(to,from,count)registershort*to,*from;registercount;{registern=(count+7)/8;switch(count%8){case0:*to=*from++;case7:*to=*from++;case6:*to=*from++;case5:*to=*from++;case4:*to=*from++;case3:*to=*from++;case2:*to=*from++;case1:*to=*from++;}while(--n>0){*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;*to=*from++;}}

The basic idea of loop unrolling is that the number of instructions executed in a loop can be reduced by reducing the number of loop tests, sometimes reducing the amount of time spent in the loop. For example, in the case of a loop with only a single instruction in the block code, the loop test will typically be performed for every iteration of the loop, that is every time the instruction is executed. If, instead, eight copies of the same instruction are placed in the loop, then the test will be performed only every eight iterations, and this may gain time by avoiding seven tests. However, this only handles a multiple of eight iterations, requiring something else to handle any remainder of iterations. [1]

Duff's device provides a solution by first performing the remainder of iterations, followed by iterating as many times as necessary the multiple of eight similar instructions. To determine the number of remainder iterations, the code first calculates the total number of iterations modulo eight. According to this remainder, the program execution will then jump to a case statement followed by exactly the number of iterations needed. Once this is done, everything is straightforward: the code continues by doing iterations of groups of eight instructions; this has become possible since the remaining number of iterations is a multiple of eight. [1]

Duff's device provides a compact loop unrolling by using the case keyword both inside and outside the loop. This is unusual because the contents of a case statement are traditionally thought of as a block of code nested inside the case statement, and a reader would typically expect it to end before the next case statement. According to the specifications of C language, this is not necessary; indeed, case statements can appear anywhere inside the switch code block, and at any depth; the program execution will simply jump to the next statement, wherever it may be.

Performance

Many compilers will optimize the switch into a branch table just as would be done in an assembly implementation.

The primary increase in speed versus a simple, straightforward loop, comes from loop unwinding that reduces the number of performed branches, which are computationally expensive due to the need to flushand hence stallthe instruction pipeline. The switch statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, eight short moves are unrolled, so the switch handles an extra 17 shorts automatically).

This automatic handling of the remainder may not be the best solution on all systems and compilers  in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and branch prediction on some architectures. [6] When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance and a noticeable reduction in size of the executable. [7] Therefore, when considering this code as a program optimization, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler, as well as weighing the risk that the optimized code will later be used on different platforms where it is not the fastest code.

For the purpose of memory-to-memory copies (which, as mentioned above, was not the original use of Duff's device), the standard C library provides the function memcpy ; [8] it will not perform worse than a memory-to-memory copy version of this code, and may contain architecture-specific optimizations that make it significantly faster. [9] [10]

See also

Related Research Articles

<span class="mw-page-title-main">PDP-8</span> Minicomputer product line

The PDP-8 is a family of 12-bit minicomputers that was produced by Digital Equipment Corporation (DEC). It was the first commercially successful minicomputer, with over 50,000 units being sold over the model's lifetime. Its basic design follows the pioneering LINC but has a smaller instruction set, which is an expanded version of the PDP-5 instruction set. Similar machines from DEC are the PDP-12 which is a modernized version of the PDP-8 and LINC concepts, and the PDP-14 industrial controller system.

Structured programming is a programming paradigm aimed at improving the clarity, quality, and development time of a computer program by making extensive use of the structured control flow constructs of selection (if/then/else) and repetition, block structures, and subroutines.

An optimizing compiler is a compiler designed to generate code that is optimized in aspects such as minimizing program execution time, memory use, storage size, and power consumption.

In computer science, control flow is the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated. The emphasis on explicit control flow distinguishes an imperative programming language from a declarative programming language.

Coroutines are computer program components that allow execution to be suspended and resumed, generalizing subroutines for cooperative multitasking. Coroutines are well-suited for implementing familiar program components such as cooperative tasks, exceptions, event loops, iterators, infinite lists and pipes.

In computer science, a generator is a routine that can be used to control the iteration behaviour of a loop. All generators are also iterators. A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.

In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.

A branch is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order. Branch may also refer to the act of switching execution to a different instruction sequence as a result of executing a branch instruction. Branch instructions are used to implement control flow in program loops and conditionals.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.

In computer science, a tail call is a subroutine call performed as the final action of a procedure. If the target of a tail is the same subroutine, the subroutine is said to be tail recursive, which is a special case of direct recursion. Tail recursion is particularly useful, and is often easy to optimize in implementations.

KDF9 was an early British 48-bit computer designed and built by English Electric. The first machine came into service in 1964 and the last of 29 machines was decommissioned in 1980 at the National Physical Laboratory. The KDF9 was designed for, and used almost entirely in, the mathematical and scientific processing fields – in 1967, nine were in use in UK universities and technical colleges. The KDF8, developed in parallel, was aimed at commercial processing workloads.

In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated with loops. It plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a scientific program is spent on loops; as such, many compiler optimization techniques have been developed to make them faster.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

A protothread is a low-overhead mechanism for concurrent programming.

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.

In computer programming, a branch table or jump table is a method of transferring program control (branching) to another part of a program using a table of branch or jump instructions. It is a form of multiway branch. The branch table construction is commonly used when programming in assembly language but may also be generated by compilers, especially when implementing optimized switch statements whose values are densely packed together.

The PDP-11 architecture is a 16-bit CISC instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). It is implemented by central processing units (CPUs) and microprocessors used in PDP-11 minicomputers. It was in wide use during the 1970s, but was eventually overshadowed by the more powerful VAX architecture in the 1980s.

<span class="mw-page-title-main">Goto</span> One-way control statement in computer programming

Goto is a statement found in many computer programming languages. It performs a one-way transfer of control to another line of code; in contrast a function call normally returns control. The jumped-to locations are usually identified using labels, though some languages use line numbers. At the machine code level, a goto is a form of branch or jump statement, in some cases combined with a stack adjustment. Many languages support the goto statement, and many do not.

Zero-overhead looping is a feature of some processor instruction sets whose hardware can repeat the body of a loop automatically, rather than requiring software instructions which take up cycles to do so. Zero-overhead loops are common in digital signal processors and some CISC instruction sets.

<span class="mw-page-title-main">WD16</span> Microprocessor produced by Western Digital

The WD16 is a 16-bit microprocessor introduced by Western Digital in October 1976. It is based on the MCP-1600 chipset, a general-purpose design that was also used to implement the DEC LSI-11 low-end minicomputer and the Pascal MicroEngine processor. The three systems differed primarily in their microcode, giving each system a unique instruction set architecture (ISA).

References

  1. 1 2 3 4 Holly, Ralf (August 1, 2005). "A Reusable Duff Device". Dr. Dobb's. Dr. Dobb's Journal . Retrieved September 18, 2015.
  2. Duff, Tom (August 29, 1988). "Subject: Re: Explanation, please!". Lysator . Retrieved November 3, 2015.
  3. "The amazing Duff's Device by Tom Duff!". doc.cat-v.org. Retrieved 2017-06-08.
  4. Cox, Russ (2008-01-30). "research!rsc: On Duff's Device and Coroutines". research.swtch.com. Retrieved 2017-01-24.
  5. 1 2 The Jargon File
  6. James Ralston's USENIX 2003 Journal [ permanent dead link ]
  7. Tso, Ted (August 22, 2000). "Re: [PATCH] Re: Move of input drivers, some word needed from you". lkml.indiana.edu. Linux kernel mailing list . Retrieved August 22, 2014. Jim Gettys has a wonderful explanation of this effect in the X server. It turns out that with branch predictions and the relative speed of CPU vs. memory changing over the past decade, loop unrolling is pretty much pointless. In fact, by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot, because the elimination of all that excess code meant that the X server wasn't thrashing the cache lines as much.
  8. "memcpy - cppreference.com". En.cppreference.com. Retrieved 2014-03-06.
  9. Wall, Mike (2002-03-19). "Using Block Prefetch for Optimized Memory Performance" (PDF). mit.edu. Archived from the original (PDF) on 2017-08-30. Retrieved 2012-09-22.
  10. Fog, Agner (2012-02-29). "Optimizing subroutines in assembly language" (PDF). Copenhagen University College of Engineering. pp. 100 ff. Retrieved 2012-09-22.

Further reading