Duff's Device
   HOME

TheInfoList



OR:

In the C programming language, Duff's device is a way of manually implementing
loop unrolling Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation c ...
by interleaving two syntactic constructs of C: the - loop and a
switch statement In computer programming languages, a switch statement is a type of selection control mechanism used to allow the value of a variable or expression to change the control flow of program execution via search and map. Switch statements function ...
. Its discovery is credited to
Tom Duff Thomas Douglas Selkirk Duff (born December 8, 1952) is a Canadian computer programmer. Life and career Early life Duff was born in Toronto, Ontario, Canada, and was named for his putative ancestor, the fifth Earl of Selkirk. He grew up in Tor ...
in November 1983, when Duff was working for
Lucasfilm Lucasfilm Ltd. LLC is an American film and television production company founded by filmmaker George Lucas in December 10, 1971 in San Rafael, California, and later moved to San Francisco in 2005. It is best known for creating and producing th ...
and used it to speed up a real-time animation program. Loop unrolling attempts to reduce the overhead of
conditional branching In computer science, conditionals (that is, conditional statements, conditional expressions and conditional constructs) are programming language constructs that perform different computations or actions or return different values depending on t ...
needed to check whether a loop is done, by executing a batch of loop bodies per iteration. To handle cases where the number of iterations is not divisible by the unrolled-loop increments, a common technique among
assembly language In computing, assembly language (alternatively assembler language or symbolic machine code), often referred to simply as assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence bet ...
programmers is to jump directly into the middle of the unrolled loop body to handle the remainder. Duff implemented this technique in C by using C's case label fall-through feature to jump into the unrolled body.


Original version

Duff's problem was to copy 16-bit unsigned integers ("shorts" in most C implementations) from an array into a memory-mapped output register, denoted in C by a pointer. His original code, in C, looked as follows: send(to, from, count) register short *to, *from; register count; This code assumes that initial . Since the output location is a memory-mapped register, the pointer is not incremented as would be required for a memory-to-memory copy. If were always divisible by eight, unrolling this loop eight-fold would produce the following: send(to, from, count) register short *to, *from; register count; Duff realized that to handle cases where is not divisible by eight, the assembly programmer's technique of jumping into the loop body could be implemented by interlacing the structures of a switch statement and a loop, putting the switch's labels at the points of the loop body that correspond to the remainder of : send(to, from, count) register short *to, *from; register count; Duff's device can be applied with any size for the unrolled loop, not just eight as in the example above.


Mechanism

Based on an algorithm used widely by programmers coding in assembly for minimizing the number of tests and branches during a copy, Duff's device appears out of place when implemented in C. The device is valid C by virtue of two attributes in C: # Relaxed specification of the statement in the language's definition. At the time of the device's invention this was the first edition of ''
The C Programming Language ''The C Programming Language'' (sometimes termed ''K&R'', after its authors' initials) is a computer programming book written by Brian Kernighan and Dennis Ritchie, the latter of whom originally designed and implemented the C programming langu ...
'' which requires only that the body of the be a syntactically valid (compound) statement within which labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a statement, the flow of control will ''fall through'' from a statement controlled by one label to that controlled by the next, this means that the code specifies a succession of copies from sequential source addresses to the memory-mapped output port. # The ability to jump into the middle of a loop in C. This leads to what the ''
Jargon File The Jargon File is a glossary and usage dictionary of slang used by computer programmers. The original Jargon File was a collection of terms from technical cultures such as the MIT Computer Science and Artificial Intelligence Laboratory, MIT AI Lab ...
'' calls "the most dramatic use yet seen of fall through in C". C's default fall-through in case statements has long been one of its most controversial features; Duff himself said that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."The Jargon File
/ref> Although valid in C, Duff's device goes against common C guidelines, such as the MISRA guidelines. Some compilers (e.g. CompCert) are restricted to such guidelines and thus reject Duff's device unless specifically instructed otherwise.


Simplified explanation

The basic idea of
loop unrolling Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation c ...
is that the number of instructions executed in a loop can be reduced by reducing the number of loop tests, sometimes reducing the amount of time spent in the loop. For example, in the case of a loop with only a single instruction in the block code, the loop test will typically be performed for every iteration of the loop, that is every time the instruction is executed. If, instead, eight copies of the same instruction are placed in the loop, then the test will be performed only every eight iterations, and this may gain time by avoiding seven tests. However, this only handles a multiple of eight iterations, requiring something else to handle any
remainder In mathematics, the remainder is the amount "left over" after performing some computation. In arithmetic, the remainder is the integer "left over" after dividing one integer by another to produce an integer quotient ( integer division). In a ...
of iterations. Duff's device provides a solution by first performing the remainder of iterations, followed by iterating as many times as necessary the multiple of eight similar instructions. To determine the number of remainder iterations, the code first calculates the total number of iterations
modulo In computing and mathematics, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another, the latter being called the '' modulus'' of the operation. Given two positive numbers and , mo ...
eight. According to this remainder, the program execution will then '' jump'' to a case statement followed by ''exactly the number of iterations needed''. Once this is done, everything is straightforward: the code continues by doing iterations of groups of eight instructions; this has become possible since the remaining number of iterations is a multiple of eight. Duff's device provides a compact loop unrolling by using the case keyword ''both inside and outside the loop''. This is unusual because the contents of a case statement are traditionally thought of as a block of code nested inside the case statement, and a reader would typically expect it to end before the next case statement. According to the specifications of C language, this is not necessary; indeed, case statements can appear anywhere inside the
switch In electrical engineering, a switch is an electrical component that can disconnect or connect the conducting path in an electrical circuit, interrupting the electric current or diverting it from one conductor to another. The most common type o ...
code block, and at any depth; the program execution will simply jump to the next statement, wherever it may be.


Performance

Many compilers will optimize the switch into a
branch table A branch, also called a ramus in botany, is a Plant stem, stem that grows off from another stem, or when structures like veins in leaves are divided into smaller veins. History and etymology In Old English, there are numerous words for bra ...
just as would be done in an assembly implementation. The primary increase in speed versus a simple, straightforward loop, comes from loop unwinding that reduces the number of performed branches, which are computationally expensive due to the need to flushand hence stallthe
instruction pipeline In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming Mac ...
. The switch statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, eight short moves are unrolled, so the switch handles an extra 1–7 shorts automatically). This automatic handling of the remainder may not be the best solution on all systems and compilers in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and
branch prediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow ...
on some architectures.James Ralston's USENIX 2003 Journal
/ref> When numerous instances of Duff's device were removed from the
XFree86 XFree86 is an implementation of the X Window System. It was originally written for Unix-like operating systems on IBM PC compatibles and was available for many other operating systems and platforms. It is free software, free and Open-source softw ...
Server in version 4.0, there was an improvement in performance and a noticeable reduction in size of the executable. Therefore, before applying any
program optimization In computer science, program optimization, code optimization, or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. In general, a computer program may be op ...
, it should be benchmarked or its compiled output should be explored, to verify that it performs as expected on the target architecture, optimization level, and compiler. Additionally, the risk of the optimized code deployed on different platforms where it may not remain the fastest option should be considered. For the purpose of memory-to-memory copies (which, as mentioned above, was not the original use of Duff's device), the
standard C library The C standard library, sometimes referred to as libc, is the standard library for the C programming language, as specified in the ISO C standard.ISO/ IEC (2018). '' ISO/IEC 9899:2018(E): Programming Languages - C §7'' Starting from the origin ...
provides the function memcpy; it will not perform worse than a memory-to-memory copy version of this code, and may contain architecture-specific optimizations that make it significantly faster.


See also

* Computed GOTO *
Coroutine Coroutines are computer program components that allow execution to be suspended and resumed, generalizing subroutines for cooperative multitasking. Coroutines are well-suited for implementing familiar program components such as cooperative task ...
Duff's device can be used to implement coroutines in C/C++ (see Tatham external link) * Jensen's device


References


Further reading

*


External links


Description and original mail by Duff at Lysator


utilizes the same switch/case trick * Adam Dunkels'
Protothreads - Lightweight, Stackless Threads in C
also uses nested switch/case statements (see als
The lightest lightweight threads, Protothreads


is a related technique: intertwined switch/case and if/else statements C (programming language) Articles with example C code Computer programming folklore Programming language folklore Source code