Locality of reference

Last updated

In computer science, locality of reference, also known as the principle of locality, [1] is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. [2] There are two basic types of reference locality  temporal and spatial locality. Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration. Spatial locality (also termed data locality [3] ) refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as, traversing the elements in a one-dimensional array.

Contents

Locality is a type of predictable behavior that occurs in computer systems. Systems that exhibit strong locality of reference are great candidates for performance optimization through the use of techniques such as the caching, prefetching for memory and advanced branch predictors at the pipelining stage of a processor core.

Types of locality

There are several different types of locality of reference:

In order to benefit from the very frequently occurring temporal and spatial locality, most of the information storage systems are hierarchical. The equidistant locality is usually supported by the diverse nontrivial increment instructions of the processors. For branch locality, the contemporary processors have sophisticated branch predictors, and on the basis of this prediction the memory manager of the processor tries to collect and preprocess the data of the plausible alternatives.

Relevance

There are several reasons for locality. These reasons are either goals to achieve or circumstances to accept, depending on the aspect. The reasons below are not disjoint; in fact, the list below goes from the most general case to special cases:

General usage

If most of the time the substantial portion of the references aggregate into clusters, and if the shape of this system of clusters can be well predicted, then it can be used for performance optimization. There are several ways to benefit from locality using optimization techniques. Common techniques are:

Spatial and temporal locality usage

Hierarchial memory

Hierarchical memory is a hardware optimization that takes the benefits of spatial and temporal locality and can be used on several levels of the memory hierarchy. Paging obviously benefits from temporal and spatial locality. A cache is a simple example of exploiting temporal locality, because it is a specially designed, faster but smaller memory area, generally used to keep recently referenced data and data near recently referenced data, which can lead to potential performance increases.

Data elements in a cache do not necessarily correspond to data elements that are spatially close in the main memory; however, data elements are brought into cache one cache line at a time. This means that spatial locality is again important: if one element is referenced, a few neighboring elements will also be brought into cache. Finally, temporal locality plays a role on the lowest level, since results that are referenced very closely together can be kept in the machine registers. Some programming languages (such as C) allow the programmer to suggest that certain variables be kept in registers.

Data locality is a typical memory reference feature of regular programs (though many irregular memory access patterns exist). It makes the hierarchical memory layout profitable. In computers, memory is divided into a hierarchy in order to speed up data accesses. The lower levels of the memory hierarchy tend to be slower, but larger. Thus, a program will achieve greater performance if it uses memory while it is cached in the upper levels of the memory hierarchy and avoids bringing other data into the upper levels of the hierarchy that will displace data that will be used shortly in the future. This is an ideal, and sometimes cannot be achieved.

Typical memory hierarchy (access times and cache sizes are approximations of typical values used as of 2013 for the purpose of discussion; actual values and actual numbers of levels in the hierarchy vary):

Modern machines tend to read blocks of lower memory into the next level of the memory hierarchy. If this displaces used memory, the operating system tries to predict which data will be accessed least (or latest) and move it down the memory hierarchy. Prediction algorithms tend to be simple to reduce hardware complexity, though they are becoming somewhat more complicated.

Matrix multiplication

A common example is matrix multiplication:

1 foriin0..n2 forjin0..m3 forkin0..p4 C[i][j]=C[i][j]+A[i][k]*B[k][j];

By switching the looping order for j and k, the speedup in large matrix multiplications becomes dramatic, at least for languages that put contiguous array elements in the last dimension. This will not change the mathematical result, but it improves efficiency. In this case, "large" means, approximately, more than 100,000 elements in each matrix, or enough addressable memory such that the matrices will not fit in L1 and L2 caches.

1 foriin0..n2 forkin0..p3 forjin0..m4 C[i][j]=C[i][j]+A[i][k]*B[k][j];

The reason for this speedup is that in the first case, the reads of A[i][k] are in cache (since the k index is the contiguous, last dimension), but B[k][j] is not, so there is a cache miss penalty on B[k][j]. C[i][j] is irrelevant, because it can be hoisted out of the inner loop -- the loop variable there is k.

1 foriin0..n2 forjin0..m3 temp=C[i][j]4 forkin0..p5 temp=temp+A[i][k]*B[k][j];6 C[i][j]=temp

In the second case, the reads and writes of C[i][j] are both in cache, the reads of B[k][j] are in cache, and the read of A[i][k] can be hoisted out of the inner loop.

1 foriin0..n2 forkin0..p3 temp=A[i][k]4 forjin0..m5 C[i][j]=C[i][j]+temp*B[k][j];

Thus, the second example has no cache miss penalty in the inner loop while the first example has a cache penalty.

On a year 2014 processor, the second case is approximately five times faster than the first case, when written in C and compiled with gcc -O3. (A careful examination of the disassembled code shows that in the first case, GCC uses SIMD instructions and in the second case it does not, but the cache penalty is much worse than the SIMD gain.)[ citation needed ]

Temporal locality can also be improved in the above example by using a technique called blocking. The larger matrix can be divided into evenly sized sub-matrices, so that the smaller blocks can be referenced (multiplied) several times while in memory.

 1 for(ii=0;ii<SIZE;ii+=BLOCK_SIZE) 2 for(kk=0;kk<SIZE;kk+=BLOCK_SIZE) 3 for(jj=0;jj<SIZE;jj+=BLOCK_SIZE) 4 maxi=min(ii+BLOCK_SIZE,SIZE); 5 for(i=ii;i<maxi;i++) 6 maxk=min(kk+BLOCK_SIZE,SIZE); 7 for(k=kk;k<maxk;k++) 8 maxj=min(jj+BLOCK_SIZE,SIZE); 9 for(j=jj;j<maxj;j++)10 C[i][j]=C[i][j]+A[i][k]*B[k][j];

The temporal locality of the above solution is provided because a block can be used several times before moving on, so that it is moved in and out of memory less often. Spatial locality is improved because elements with consecutive memory addresses tend to be pulled up the memory hierarchy together.

See also

Related Research Articles

Cache (computing) Computing component that transparently stores data so that future requests for that data can be served faster

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory requirement, and power consumption.

Memory hierarchy computer architecture that separates storage into a hierarchy based on response time

In computer architecture, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies. Memory hierarchy affects performance in computer architectural design, algorithm predictions, and lower level programming constructs involving locality of reference.

In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or other loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels.

Z-order curve function which maps multidimensional data to one dimension while preserving locality of the data points

In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used such as binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a CPU cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit blocking, as in loop nest optimization, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated with loops. It plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a scientific program is spent on loops; as such, many compiler optimization techniques have been developed to make them faster.

In compiler theory, loop interchange is the process of exchanging the order of two iteration variables used by a nested loop. The variable used in the inner loop switches to the outer loop, and vice versa. It is often done to ensure that the elements of a multi-dimensional array are accessed in the order in which they are present in memory, improving locality of reference.

Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other useful data to be evicted from the cache into lower levels of the memory hierarchy, degrading performance. For example, in a multi-core processor, one core may replace the blocks fetched by other cores into shared cache, or prefetched blocks may replace demand-fetched blocks from the cache.

Data parallelism

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

In-place matrix transposition, also called in-situ matrix transposition, is the problem of transposing an N×M matrix in-place in computer memory, ideally with O(1) (bounded) additional storage, or at most with additional storage much less than NM. Typically, the matrix is assumed to be stored in row-major order or column-major order.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices.

Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs : arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.

In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, and also have implications for the approach to parallelism and distribution of workload in shared memory systems. Further, cache coherency issues can affect multiprocessor performance, which means that certain memory access patterns place a ceiling on parallelism.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

Cache hierarchy

Cache hierarchy, or multi-level caches, refers to a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly-requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit (CPU) cores.

The CPU cache is a piece of hardware which reduces the access time to the data in the memory by keeping some part of the frequently used data of the main memory in itself. It is smaller and faster than the main memory.

Parallel external memory

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References

  1. Not to be confused with the principle of locality in physics.
  2. William., Stallings (2010). Computer organization and architecture : designing for performance (8th ed.). Upper Saddle River, NJ: Prentice Hall. ISBN   9780136073734. OCLC   268788976.
  3. 1 2 "NIST Big Data Interoperability Framework: Volume 1", [https://doi.org/10.6028/NIST.SP.1500-1r2 urn:doi:10.6028/NIST.SP.1500-1r2
  4. Aho, Lam, Sethi, and Ullman. "Compilers: Principles, Techniques & Tools" 2nd ed. Pearson Education, Inc. 2007
  5. "A Survey Of Cache Bypassing Techniques", JLPEA, vol. 6, no. 2, 2016

Bibliography