HOME

TheInfoList



OR:

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as
space–time tradeoff A space–time trade-off or time–memory trade-off in computer science is a case where an algorithm or program trades increased space usage with decreased time. Here, ''space'' refers to the data storage consumed in performing a given task (RA ...
. The transformation can be undertaken manually by the programmer or by an
optimizing compiler In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power cons ...
. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; ''cf.'' Duff's device. The goal of loop unwinding is to increase a program's speed by reducing or eliminating instructions that control the loop, such as
pointer arithmetic In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer ''ref ...
and "end of loop" tests on each iteration; reducing branch penalties; as well as hiding latencies, including the delay in reading data from memory. To eliminate this
computational overhead In computer science, overhead is any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to perform a specific task. It is a special case of engineering overhead. Overhead can be a decidi ...
, loops can be re-written as a repeated sequence of similar independent statements. Loop unrolling is also part of certain
formal verification In the context of hardware and software systems, formal verification is the act of proving or disproving the correctness of intended algorithms underlying a system with respect to a certain formal specification or property, using formal met ...
techniques, in particular bounded
model checking In computer science, model checking or property checking is a method for checking whether a finite-state model of a system meets a given specification (also known as correctness). This is typically associated with hardware or software system ...
.


Advantages

The overhead in "tight" loops often consists of instructions to increment a pointer or index to the next element in an array (
pointer arithmetic In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer ''ref ...
), as well as "end of loop" tests. If an optimizing compiler or assembler is able to pre-calculate offsets to each ''individually referenced'' array variable, these can be built into the
machine code In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a ve ...
instructions directly, therefore requiring no additional arithmetic operations at run time. * Significant gains can be realized if the reduction in executed instructions compensates for any performance reduction caused by any increase in the size of the program. * Branch penalty is minimized. * If the statements in the loop are independent of each other (i.e. where statements that occur earlier in the loop do not affect statements that follow them), the statements can potentially be executed in
parallel Parallel is a geometric term of location which may refer to: Computing * Parallel algorithm * Parallel computing * Parallel metaheuristic * Parallel (software), a UNIX utility for running programs in parallel * Parallel Sysplex, a cluster o ...
. * Can be implemented dynamically if the number of array elements is unknown at compile time (as in Duff's device). Optimizing compilers will sometimes perform the unrolling automatically, or upon request.


Disadvantages

* Increased program code size, which can be undesirable, particularly for embedded applications. Can also cause an increase in instruction cache misses, which may adversely affect performance. * Unless performed transparently by an optimizing compiler, the code may become less readable. * If the code in the body of the loop involves function calls, it may not be possible to combine unrolling with
inlining In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function. Inline expansion is similar to macro expansion, but occurs during compilation, without cha ...
, since the increase in code size might be excessive. Thus there can be a trade-off between the two optimizations. * Possible increased register usage in a single iteration to store temporary variables, which may reduce performance, though much will depend on possible optimizations. * Apart from very small and simple code, unrolled loops that contain branches are even slower than recursions.


Static/manual loop unrolling

Manual (or static) loop unrolling involves the programmer analyzing the loop and interpreting the iterations into a sequence of instructions which will reduce the loop overhead. This is in contrast to dynamic unrolling which is accomplished by the compiler.


Simple manual example in C

A procedure in a computer program is to delete 100 items from a collection. This is normally accomplished by means of a ''for''-loop which calls the function ''delete(item_number)''. If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared to those for the ''delete(x)'' function, unwinding can be used to speed it up. As a result of this modification, the new program has to make only 20 iterations, instead of 100. Afterwards, only 20% of the jumps and conditional branches need to be taken, and represents, over many iterations, a potentially significant decrease in the loop administration overhead. To produce the optimal benefit, no variables should be specified in the unrolled code that require
pointer arithmetic In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer ''ref ...
. This usually requires " base plus offset" addressing, rather than indexed referencing. On the other hand, this ''manual'' loop unrolling expands the source code size from 3 lines to 7, that have to be produced, checked, and debugged, and the compiler may have to allocate more registers to store variables in the expanded loop iteration . In addition, the loop control variables and number of operations inside the unrolled loop structure have to be chosen carefully so that the result is indeed the same as in the original code (assuming this is a later optimization on already working code). For example, consider the implications if the iteration count were not divisible by 5. The manual amendments required also become somewhat more complicated if the test conditions are variables. See also Duff's device.


Early complexity

In the simple case, the loop control is merely an administrative overhead that arranges the productive statements. The loop itself contributes nothing to the results desired, merely saving the programmer the tedium of replicating the code a hundred times which could have been done by a pre-processor generating the replications, or a text editor. Similarly, if-statements and other flow control statements could be replaced by code replication, except that
code bloat In computer programming, code bloat is the production of program code (source code or machine code) that is perceived as unnecessarily long, slow, or otherwise wasteful of resources. Code bloat can be caused by inadequacies in the programming la ...
can be the result. Computer programs easily track the combinations, but programmers find this repetition boring and make mistakes. Consider: But of course, the code performed need not be the invocation of a procedure, and this next example involves the index variable in computation: which, if compiled, might produce a lot of code (''print'' statements being notorious) but further optimization is possible. This example makes reference only to ''x(i)'' and ''x(i - 1)'' in the loop (the latter only to develop the new value ''x(i)'') therefore, given that there is no later reference to the array ''x'' developed here, its usages could be replaced by a simple variable. Such a change would however mean a simple variable ''whose value is changed'' whereas if staying with the array, the compiler's analysis might note that the array's values are constant, each derived from a previous constant, and therefore carries forward the constant values so that the code becomes print 2, 2; print 3, 6; print 4, 24; ...etc. In general, the content of a loop might be large, involving intricate array indexing. These cases are probably best left to optimizing compilers to unroll. Replicating innermost loops might allow many possible optimisations yet yield only a small gain unless n is large.


Unrolling WHILE loops

Consider a pseudocode WHILE loop similar to the following: In this case, unrolling is faster because the ENDWHILE (a jump to the start of the loop) will be executed 66% less often. Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.


Dynamic unrolling

Since the benefits of loop unrolling are frequently dependent on the size of an array—which may often not be known until run time—
JIT Jit (also known as jiti, jit-jive and the Harare beat) is a style of popular Zimbabwean dance music. It features a swift rhythm played on drums and accompanied by a guitar. Jit evolved out many diverse influences, including domestic chimurenga, ...
compilers (for example) can determine whether to invoke a "standard" loop sequence or instead generate a (relatively short) sequence of individual instructions for each element. This flexibility is one of the advantages of just-in-time techniques versus static or manual optimization in the context of loop unrolling. In this situation, it is often with relatively small values of ''n'' where the savings are still useful—requiring quite small (if any) overall increase in program size (that might be included just once, as part of a standard library).
Assembly language In computer programming, assembly language (or assembler language, or symbolic machine code), often referred to simply as Assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence b ...
programmers (including optimizing compiler writers) are also able to benefit from the technique of dynamic loop unrolling, using a method similar to that used for efficient
branch table In computer programming, a branch table or jump table is a method of transferring program control ( branching) to another part of a program (or a different program that may have been dynamically loaded) using a table of branch or jump instruction ...
s. Here, the advantage is greatest where the maximum offset of any referenced field in a particular array is less than the maximum offset that can be specified in a machine instruction (which will be flagged by the assembler if exceeded).


Assembler example (IBM/360 or Z/Architecture)

This example is for
IBM/360 The IBM System/360 (S/360) is a family of mainframe computer systems that was announced by IBM on April 7, 1964, and delivered between 1965 and 1978. It was the first family of computers designed to cover both commercial and scientific applic ...
or
Z/Architecture z/Architecture, initially and briefly called ESA Modal Extensions (ESAME), is IBM's 64-bit complex instruction set computer (CISC) instruction set architecture, implemented by its mainframe computers. IBM introduced its first z/Architect ...
assemblers and assumes a field of 100 bytes (at offset zero) is to be copied from array ''FROM'' to array ''TO''—both having 50 entries with element lengths of 256 bytes each. * The return address is in R14. * Initialize registers R15, R0, R1, and R2 from data defined at the end of * the program starting with label INIT/MAXM1. LM R15,R2,INIT Set R15 = maximum number of MVC * instructions (MAXM1 = 16), * R0 = number of entries of array, * R1 = address of 'FROM' array, and * R2 = address of 'TO' array. * * The loop starts here. LOOP EQU * Define LOOP label. * At this point, R15 will always contain the number 16 (MAXM1). SR R15,R0 Subtract the remaining number of * entries in the array (R0) from R15. BNP ALL If R15 is not positive, meaning we * have more than 16 remaining entries * in the array, jump to do the entire * MVC sequence and then repeat. * * Calculate an offset (from start of MVC sequence) for unconditional branch to * the 'unwound' MVC loop below. * If the number of remaining entries in the arrays is zero, R15 will be 16, so * all the MVC instructions will be bypassed. MH R15,=AL2(ILEN) Multiply R15 by the length of one * MVC instruction. B ALL(R15) Jump to ALL+R15, the address of the * calculated specific MVC instruction * with drop through to the rest of them. * * MVC instruction 'table'. * First entry has maximum allowable offset with single register = hexadecimal F00 * (15*256) in this example. * All 16 of the following MVC ('move character') instructions use base-plus-offset * addressing and each to/from offset decreases by the length of one array element * (256). This avoids pointer arithmetic being required for each element up to a * maximum permissible offset within the instruction of hexadecimal FFF * (15*256+255). The instructions are in order of decreasing offset, so the last * element in the set is moved first. ALL MVC 15*256(100,R2),15*256(R1) Move 100 bytes of 16th entry from * array 1 to array 2 (with * drop-through). ILEN EQU *-ALL Set ILEN to the length of the previous * MVC instruction. MVC 14*256(100,R2),14*256(R1) Move 100 bytes of 15th entry. MVC 13*256(100,R2),13*256(R1) Move 100 bytes of 14th entry. MVC 12*256(100,R2),12*256(R1) Move 100 bytes of 13th entry. MVC 11*256(100,R2),11*256(R1) Move 100 bytes of 12th entry. MVC 10*256(100,R2),10*256(R1) Move 100 bytes of 11th entry. MVC 09*256(100,R2),09*256(R1) Move 100 bytes of 10th entry. MVC 08*256(100,R2),08*256(R1) Move 100 bytes of 9th entry. MVC 07*256(100,R2),07*256(R1) Move 100 bytes of 8th entry. MVC 06*256(100,R2),06*256(R1) Move 100 bytes of 7th entry. MVC 05*256(100,R2),05*256(R1) Move 100 bytes of 6th entry. MVC 04*256(100,R2),04*256(R1) Move 100 bytes of 5th entry. MVC 03*256(100,R2),03*256(R1) Move 100 bytes of 4th entry. MVC 02*256(100,R2),02*256(R1) Move 100 bytes of 3rd entry. MVC 01*256(100,R2),01*256(R1) Move 100 bytes of 2nd entry. MVC 00*256(100,R2),00*256(R1) Move 100 bytes of 1st entry. * S R0,MAXM1 Reduce the number of remaining entries * to process. BNPR R14 If no more entries to process, return * to address in R14. AH R1,=AL2(16*256) Increment 'FROM' array pointer beyond * first set. AH R2,=AL2(16*256) Increment 'TO' array pointer beyond * first set. L R15,MAXM1 Reload the maximum number of MVC * instructions per batch into R15 * (destroyed by the calculation in the * first instruction of the loop). B LOOP Execute loop again. * * Static constants and variables (these could be passed as parameters, except * MAXM1). INIT DS 0A 4 addresses (pointers) to be * pre-loaded with the 'LM' instruction * in the beginning of the program. MAXM1 DC A(16) Maximum number of MVC instructions * executed per batch. N DC A(50) Number of actual entries in array (a * variable, set elsewhere). DC A(FROM) Address of start of array 1 * ("pointer"). DC A(TO) Address of start of array 2 * ("pointer"). * * Static arrays (these could be dynamically acquired). FROM DS 50CL256 Array of 50 entries of 256 bytes each. TO DS 50CL256 Array of 50 entries of 256 bytes each. In this example, approximately 202 instructions would be required with a "conventional" loop (50 iterations), whereas the above dynamic code would require only about 89 instructions (or a saving of approximately 56%). If the array had consisted of only two entries, it would still execute in approximately the same time as the original unwound loop. The increase in
code In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communicati ...
size is only about 108 bytes even if there are thousands of entries in the array. Similar techniques can of course be used where multiple instructions are involved, as long as the combined instruction length is adjusted accordingly. For example, in this same example, if it is required to clear the rest of each array entry to nulls immediately after the 100 byte field copied, an additional clear instruction, XC xx*256+100(156,R1),xx*256+100(R2), can be added immediately after every MVC in the sequence (where xx matches the value in the MVC above it). It is, of course, perfectly possible to generate the above code "inline" using a single assembler macro statement, specifying just four or five operands (or alternatively, make it into a library subroutine, accessed by a simple call, passing a list of parameters), making the optimization readily accessible.


C example

The following example demonstrates dynamic loop unrolling for a simple program written in C. Unlike the assembler example above, pointer/index arithmetic is still generated by the compiler in this example because a variable (i) is still used to address the array element. Full optimization is only possible if absolute indexes are used in the replacement statements. #include /* The number of entries processed per loop iteration. */ /* Note that this number is a 'constant constant' reflecting the code below. */ #define BUNCHSIZE (8) int main(void) Code duplication could be avoided by writing the two parts together as in Duff's device.


C to MIPS assembly language loop unrolling example

The following example will compute a
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...
of two 100-entry vectors A and B of type double. Here is the code in C: double dotProduct = 0; for (int i = 0; i < 100; i++)


Converting to MIPS assembly language

The following is MIPS assembly code that will compute the dot product of two 100-entry vectors, A and B, before implementing loop unrolling. The code below omits the loop initializations: * Initialize loop count ($7) to 100. * Initialize dot product ($f10) to 0. * Initialize A /code> pointer ($5) to the base address of A. * Initialize B /code> pointer ($6) to the base address of B. Note that the size of one element of the arrays (a double) is 8 bytes. loop3: l.d $f10, 0($5) ; $f10 ← A l.d $f12, 0($6) ; $f12 ← B mul.d $f10, $f10, $f12 ; $f10 ← A B add.d $f8, $f8, $f10 ; $f8 ← $f8 + A B addi $5, $5, 8 ; increment pointer for A by the size ; of a double. addi $6, $6, 8 ; increment pointer for B by the size ; of a double. addi $7, $7, -1 ; decrement loop count test: bgtz $7, loop3 ; Continue if loop count > 0


Unrolling the Loop in MIPS

The following is the same as above, but with loop unrolling implemented at a factor of 4. Note again that the size of one element of the arrays (a double) is 8 bytes; thus the 0, 8, 16, 24 displacements and the 32 displacement on each loop. loop3: l.d $f10, 0($5) ; iteration with displacement 0 l.d $f12, 0($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 8($5) ; iteration with displacement 8 l.d $f12, 8($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 16($5) ; iteration with displacement 16 l.d $f12, 16($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 24($5) ; iteration with displacement 24 l.d $f12, 24($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 addi $5, $5, 32 addi $6, $6, 32 addi $7, $7, -4 test: bgtz $7, loop3 ; Continue loop if $7 > 0


See also

*
Don't repeat yourself "Don't repeat yourself" (DRY) is a principle of software development aimed at reducing repetition of software patterns, replacing it with abstractions or using data normalization to avoid redundancy. The DRY principle is stated as "Every piece of ...
* Duff's device * Instruction level parallelism *
Just-in-time compilation In computing, just-in-time (JIT) compilation (also dynamic translation or run-time compilations) is a way of executing computer code that involves compilation during execution of a program (at run time) rather than before execution. This may co ...
* Loop fusion * Loop splitting *
Parallel computing Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different f ...


References


Further reading

*


External links


Chapter 7, pages 8 to 10
of
Michael Abrash Michael Abrash is a programmer and technical writer specializing in code optimization and 80x86 assembly language. He wrote the 1990 book ''Zen of Assembly Language Volume 1: Knowledge'' and a monthly column in '' Dr. Dobb's Journal'' in the ea ...
's ''Graphics Programming Black Book'' is about loop unrolling, with an example in x86 assembly.
Generalized Loop Unrolling
gives a concise introduction.
Optimizing subroutines in assembly language
Agner Fog's optimizations handbook with the ''loop unrolling'' technique (2012). {{Compiler optimizations Compiler optimizations Articles with example C code Parallel computing