Loop nest optimization

Last updated August 30, 2024

In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead reduction of the loop nests. (Nested loops occur when one loop is inside of another loop.) One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.

Overview

Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused. The partitioning of loop iteration space leads to partitioning of a large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.

An ordinary loop

for(i=0;i<N;++i){...}

can be blocked with a block size B by replacing it with

for(j=0;j<N;j+=B){for(i=j;i<min(N,j+B);++i){....}}

where min() is a function returning the minimum of its arguments.

Example: matrix-vector multiplication

The following is an example of matrix vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.

inti,j,a[100][100],b[100],c[100];intn=100;for(i=0;i<n;i++){c[i]=0;for(j=0;j<n;j++){c[i]=c[i]+a[i][j]*b[j];}}

After loop tiling is applied using 2 * 2 blocks, the code looks like:

inti,j,x,y,a[100][100],b[100],c[100];intn=100;for(i=0;i<n;i+=2){c[i]=0;c[i+1]=0;for(j=0;j<n;j+=2){for(x=i;x<min(i+2,n);x++){for(y=j;y<min(j+2,n);y++){c[x]=c[x]+a[x][y]*b[y];}}}}

The original loop iteration space is n by n. The accessed chunk of array a[i, j] is also n by n. When n is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example, i = 1, j = 1 to n) may cross cache lines, causing cache misses.

Tiling size

It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests (loop interchange) also plays an important role in achieving better cache performance. Explicit blocking requires choosing a tile size based on these factors. By contrast, cache-oblivious algorithms are designed to make efficient use of cache without explicit blocking.

Example: matrix multiplication

Many large mathematical operations on computers end up spending much of their time doing matrix multiplication. The operation is:

C = A×B

where A, B, and C are N×N arrays. Subscripts, for the following description, are in form C[row][column].

The basic loop is:

inti,j,k;for(i=0;i<N;++i){for(j=0;j<N;++j){C[i][j]=0;for(k=0;k<N;++k)C[i][j]+=A[i][k]*B[k][j];}}

There are three problems to solve:

Floating point additions take some number of cycles to complete. In order to keep an adder with multiple cycle latency busy, the code must update multiple accumulators in parallel.
Machines can typically do just one memory operation per multiply-add, so values loaded must be reused at least twice.
Typical PC memory systems can only sustain one 8-byte doubleword per 10–30 double-precision multiply–adds, so values loaded into the cache must be reused many times.

The original loop calculates the result for one entry in the result matrix at a time. By calculating a small block of entries simultaneously, the following loop reuses each loaded value twice, so that the inner loop has four loads and four multiply–adds, thus solving problem #2. By carrying four accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time (problem #1). However, the code does not address the third problem. (Nor does it address the cleanup work necessary when N is odd. Such details will be left out of the following discussion.)

for(i=0;i<N;i+=2){for(j=0;j<N;j+=2){acc00=acc01=acc10=acc11=0;for(k=0;k<N;k++){acc00+=B[k][j+0]*A[i+0][k];acc01+=B[k][j+1]*A[i+0][k];acc10+=B[k][j+0]*A[i+1][k];acc11+=B[k][j+1]*A[i+1][k];}C[i+0][j+0]=acc00;C[i+0][j+1]=acc01;C[i+1][j+0]=acc10;C[i+1][j+1]=acc11;}}

This code has had both the i and j iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.

This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!

A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the ISA. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.

Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This register pressure is why vendors of RISC CPUs, who intended to build machines more parallel than the general purpose x86 and 68000 CPUs, adopted 32-entry floating-point register files.

The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of A is loaded, and the entire matrix B is loaded. For the entire calculation, C is stored once (that's good), A is loaded into the cache once (assuming a stripe of A fits in the cache with a stripe of B), but B is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N³/ib doubleword loads from main memory. In the code above, ib is 2.

The next step to reduce the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8 GHz Pentium 4 system used for this example, the balance number is 16.5. The second code example above cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is blocked over i. (Technically, this is actually the second time i is blocked, as the first time was the factor of 2.)

for(ii=0;ii<N;ii+=ib){for(j=0;j<N;j+=2){for(i=ii;i<ii+ib;i+=2){acc00=acc01=acc10=acc11=0;for(k=0;k<N;k++){acc00+=B[k][j+0]*A[i+0][k];acc01+=B[k][j+1]*A[i+0][k];acc10+=B[k][j+0]*A[i+1][k];acc11+=B[k][j+1]*A[i+1][k];}C[i+0][j+0]=acc00;C[i+0][j+1]=acc01;C[i+1][j+0]=acc10;C[i+1][j+1]=acc11;}}}

With this code, ib can be set to any desired parameter, and the number of loads of the B matrix will be reduced by that factor. This freedom has a cost: N×ib slices of the A matrix are being kept in the cache. As long as that fits, this code will not be limited by the memory system.

So what size matrix fits? The example system, a 2.8 GHz Pentium 4, has a 16KB primary data cache. With ib=20, the slice of the A matrix in this code will be larger than the primary cache when N > 100. For problems larger than that, another trick is needed.

That trick is reducing the size of the stripe of the B matrix by blocking the k loop so that the stripe is of size ib × kb. Blocking the k loop means that the C array will be loaded and stored N/kb times, for a total of ⁠ $2*N^{3}/kb$ ⁠ memory transfers. B is still transferred N/ib times, for ⁠ $N^{3}/ib$ ⁠ transfers. So long as

2*N/kb + N/ib < N/balance

the machine's memory system will keep up with the floating point unit and the code will run at maximum performance. The 16KB cache of the Pentium 4 is not quite big enough: if ib=24 and kb=64 were chosen instead, 12KB of the cache would be used—avoiding completely filling it, which is desirable so the C and B arrays have to have some room to flow through. These numbers come within 20% of the peak floating-point speed of the processor.

Here is the code with loop k blocked.

for(ii=0;ii<N;ii+=ib){for(kk=0;kk<N;kk+=kb){for(j=0;j<N;j+=2){for(i=ii;i<ii+ib;i+=2){if(kk==0)acc00=acc01=acc10=acc11=0;else{acc00=C[i+0][j+0];acc01=C[i+0][j+1];acc10=C[i+1][j+0];acc11=C[i+1][j+1];}for(k=kk;k<kk+kb;k++){acc00+=B[k][j+0]*A[i+0][k];acc01+=B[k][j+1]*A[i+0][k];acc10+=B[k][j+0]*A[i+1][k];acc11+=B[k][j+1]*A[i+1][k];}C[i+0][j+0]=acc00;C[i+0][j+1]=acc01;C[i+1][j+0]=acc10;C[i+1][j+1]=acc11;}}}}

The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably split the kk == 0 iteration off from the rest of the kk iterations, to remove the if statement from the i loop. This is one of the values of such a compiler: while it is straightforward to code the simple cases of this optimization, keeping all the details correct as the code is replicated and transformed is an error-prone process.

The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the Pentium 4 has 256KB (or more, depending on the model) high-bandwidth level-2 cache as well as the level-1 cache. There is a choice:

Adjust the block sizes for the level-2 cache. This will stress the processor's ability to keep many instructions in flight simultaneously, and there is a good chance it will be unable to achieve full bandwidth from the level-2 cache.
Block the loops again, again for the level-2 cache sizes. With a total of three levels of blocking (for the register file, for the L1 cache, and the L2 cache), the code will minimize the required bandwidth at each level of the memory hierarchy. Unfortunately, the extra levels of blocking will incur still more loop overhead, which for some problem sizes on some hardware may be more time-consuming than any shortcomings in the hardware's ability to stream data from the L2 cache.

Rather than specifically tune for one particular cache size, as in the first example, a cache-oblivious algorithm is designed to take advantage of any available cache, no matter what its size is. This automatically takes advantage of two or more levels of memory hierarchy, if available. Cache-oblivious algorithms for matrix multiplication are known.

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

An optimizing compiler is a compiler designed to generate code that is optimized in aspects such as minimizing program execution time, memory use, storage size, and power consumption.

In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as traversing the elements in a one-dimensional array.

In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse but a common criterion is that the number of non-zero elements is roughly equal to the number of rows or columns. By contrast, if most of the elements are non-zero, the matrix is considered dense. The number of zero-valued elements divided by the total number of elements is sometimes referred to as the sparsity of the matrix.

BASIC-PLUS is an extended dialect of the BASIC programming language that was developed by Digital Equipment Corporation (DEC) for use on its RSTS/E time-sharing operating system for the PDP-11 series of 16-bit minicomputers in the early 1970s through the 1980s.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

A bit array is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated with loops. It plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a scientific program is spent on loops; as such, many compiler optimization techniques have been developed to make them faster.

In compiler theory, loop interchange is the process of exchanging the order of two iteration variables used by a nested loop. The variable used in the inner loop switches to the outer loop, and vice versa. It is often done to ensure that the elements of a multi-dimensional array are accessed in the order in which they are present in memory, improving locality of reference.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize $over unconstrained values of the real-vector where is a differentiable scalar function.$

Automatically Tuned Linear Algebra Software (ATLAS) is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and FORTRAN 77.

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

In-place matrix transposition, also called in-situ matrix transposition, is the problem of transposing an N×M matrix in-place in computer memory, ideally with O(1) (bounded) additional storage, or at most with additional storage much less than NM. Typically, the matrix is assumed to be stored in row-major or column-major order.

Iterative Stencil Loops (ISLs) or Stencil computations are a class of numerical data processing solution which update array elements according to some fixed pattern, called a stencil. They are most commonly found in computer simulations, e.g. for computational fluid dynamics in the context of scientific and engineering applications. Other notable examples include solving partial differential equations, the Jacobi kernel, the Gauss–Seidel method, image processing and cellular automata. The regular structure of the arrays sets stencil techniques apart from other modeling methods such as the Finite element method. Most finite difference codes which operate on regular grids can be formulated as ISLs.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

Computer software is said to exhibit scalable locality if it can continue to make use of processors that out-pace their memory systems, to solve ever larger problems. This term is a high-performance uniprocessor analog of the use of scalable parallelism to refer to software for which increasing numbers of processors can be employed for larger problems.

References

↑ Steven Muchnick; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation . Morgan Kaufmann. ISBN 978-1-55860-320-2. tiling.
↑ João M.P. Cardoso; Pedro C. Diniz (2 April 2011). Compilation Techniques for Reconfigurable Architectures. Springer Science & Business Media. ISBN 978-0-387-09671-1.

External links

Streams benchmark results, showing the overall balance between floating point operations and memory operations for many different computers
"CHiLL: Composable High-Level Loop Transformation Framework" ^{[ permanent dead link ]}

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[MuchnickAssociates1997-1] Steven Muchnick; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation . Morgan Kaufmann. ISBN 978-1-55860-320-2. tiling.

[CardosoDiniz2011-2] João M.P. Cardoso; Pedro C. Diniz (2 April 2011). Compilation Techniques for Reconfigurable Architectures. Springer Science & Business Media. ISBN 978-0-387-09671-1.

[1]

[2]

v t e Compiler optimizations
Basic block	Peephole optimization Local value numbering
Loop	Automatic parallelization Automatic vectorization Induction variable Loop fusion Loop-invariant code motion Loop inversion Loop interchange Loop nest optimization Loop splitting Loop unrolling Loop unswitching Software pipelining Strength reduction
Data-flow analysis	Available expression Common subexpression elimination Constant folding Dead store elimination Induction variable recognition and elimination Live-variable analysis Use-define chain
SSA-based	Global value numbering Sparse conditional constant propagation
Code generation	Instruction scheduling Instruction selection Register allocation Rematerialization
Functional	Deforestation Tail-call elimination
Global	Interprocedural optimization
Other	Bounds-checking elimination Compile-time function execution Dead-code elimination Expression templates Inline expansion Jump threading Partial evaluation Profile-guided optimization
Static analysis	Alias analysis Array-access analysis Control-flow analysis Data-flow analysis Dependence analysis Escape analysis Pointer analysis Shape analysis Value range analysis