Cache performance measurement and metric

Last updated

A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.

Contents

The performance of a computer system depends on the performance of all individual units—which include execution units like integer, branch and floating point, I/O units, bus, caches and memory systems. The gap between processor speed and main memory speed has grown exponentially. Until 2001–05, CPU speed, as measured by clock frequency, grew annually by 55%, whereas memory speed only grew by 7%. [1] This problem is known as the memory wall. The motivation for a cache and its hierarchy is to bridge this speed gap and overcome the memory wall.

The critical component in most high-performance computers is the cache. Since the cache exists to bridge the speed gap, its performance measurement and metrics are important in designing and choosing various parameters like cache size, associativity, replacement policy, etc. Cache performance depends on cache hits and cache misses, which are the factors that create constraints to system performance. Cache hits are the number of accesses to the cache that actually find that data in the cache, and cache misses are those accesses that don't find the block in the cache. These cache hits and misses contribute to the term average access time (AAT) also known as AMAT ( average memory access time ), which, as the name suggests, is the average time it takes to access the memory. This is one major metric for cache performance measurement, because this number becomes highly significant and critical as processor speed increases.

Another useful metric to test the performance is Power law of cache misses . It gives you the number of misses when you change the size of the cache, given that the number of misses for one of the cache sizes is known. Similarly, when you want to test the performance of the cache in terms of misses across different associativities, Stack distance profiling is used.

Introduction to types of cache misses

Processor performance increase due to cache hierarchy depends on the number of accesses to the cache that satisfy block requests from the cache (cache hits) versus those that do not. Unsuccessful attempts to read or write data from the cache (cache misses) result in lower level or main memory access, which increases latency. There are three basic types of cache misses known as the 3Cs [2] and some other less popular cache misses.

Compulsory misses

Each memory block when first referenced causes a compulsory miss. This implies that the number of compulsory misses is the number of distinct memory blocks ever referenced. They are sometimes called cold misses too. Cold misses cannot be avoided unless the block is prefetched.

It has been observed that an increase in block size to a certain extent to exploit spatial locality leads to a decrease in cold misses. Increasing block size leads to prefetching of nearby words in a block and preventing future cold misses. Increasing the block size too much can lead to prefetching of useless data, thus increasing the number of cold misses.

Conflict misses

Conflict misses occur when the data required was in the cache previously, but got evicted. These evictions occur because another request was mapped to the same cache line. Generally, conflict misses are measured by subtracting the number of misses in a cache with limited associativity by the number of misses of a fully associative cache of the same size and cache block size.

Since conflict misses can be attributed to the lack of sufficient associativity, increasing the associativity to a certain extent (8‐way associativity almost as effective as fully‐associative) decreases the amount of conflict misses, however, such an approach increases the cache access time and consumes a lot more power than a set associative cache.

Capacity misses

A capacity miss occurs due to the limited size of a cache and not the cache's mapping function. When the working set, i.e., the data that is currently important to the program, is bigger than the cache, capacity misses occur frequently. Out of the 3Cs capacity misses are the hardest to identify, and can be thought of as non-compulsory misses in a fully associative cache. In a single processor system, the misses that exist after subtracting the number of compulsory misses and conflict misses can be categorized as capacity misses.

Since capacity misses can be attributed to the limited size of a cache, a simple way to reduce the number of such misses is to increase the cache size. Although this method is very intuitive, it leads to a longer access time and an increase in cache area and its power consumption. Also, after a certain cache size, the number of misses saturate and do not decrease even on increasing the cache size.

Effect of manipulating basic cache parameters on cache misses. [2]
ParametersCompulsory missesConflict missesCapacity misses
Larger cache sizeNo effectNo effectDecrease
Larger block sizeDecreaseUncertain effectUncertain effect
Larger associativityNo effectDecreaseNo effect

The above three kinds of misses only address uni-processor misses.

Coherence misses

The 3Cs group of cache misses can be extended to 4Cs when a multi-processor system with cache is involved, the fourth C being coherence misses. The coherence miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the thread's cache has been invalidated by a write from another thread. [3] Coherence in a multi-processor system is maintained if only one copy of a memory block is present or all the copies have the same value. Even if all the copies of memory block do not have the same value, it doesn't necessarily lead to a coherence miss. A coherence miss occurs when threads execute loads such that they observe the different values of the memory block. [4]

The coherence problem is complex and affects the scalability of parallel programs. A global order of all memory accesses to the same location must exist across the system to tackle this problem.

Coverage misses

The 4Cs group of cache misses can be further extended to 5Cs when the multi-processor system includes a coherence directory organized as a cache, i.e., that can replace entries. This fifth C stands for Coverage. [5] The coverage miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the processor's cache has been invalidated as a consequence of a directory eviction. If the directory is not able to track a cache line due to its limited capacity, the line must be invalidated from the processors' cache to maintain Coherence.

System activities such as interrupts, context switches and system calls lead to the process being suspended and its cache state being altered. When the process execution is resumed, it suffers cache misses to restore the cache state that was altered. These misses are called system-related misses. [2]

Furthermore, cache misses due to context switching may be placed into two categories described below.

Replaced misses

When a context switch occurs, the cache state is modified and some of its blocks are replaced. The misses on access to these blocks are called replaced misses.

Reordered misses

Some blocks in the cache may not be replaced due to context switching but their recency is changed. Reordered misses are said to occur when misses occur due to change in recency and not due to the blocks being replaced. However, when the suspended process resumes execution, reordered blocks don't lead to context switch misses when no other misses cause these reordered blocks to be evicted.

System-related misses become significant when context switching occurs regularly. Increasing the cache size leads to a decrease in capacity and conflict misses but it has been observed that it leads to an increase in system-related misses if the cache is still smaller than the working set of the processes sharing the cache. Hence reducing the number of system-related misses presents a challenge.

Average memory access time

These cache misses directly correlate to the increase in cycles per instruction (CPI). However the amount of effect the cache misses have on the CPI also depends on how much of the cache miss can be overlapped with computations due to the ILP ( Instruction-level parallelism ) and how much of it can be overlapped with other cache misses due to Memory-level parallelism. [2] If we ignore both these effects, then the average memory access time becomes an important metric. It provides a measure of the performance of the memory systems and hierarchies. It refers to the average time it takes to perform a memory access. It is the addition of the execution time for the memory instructions and the memory stall cycles. The execution time is the time for a cache access, and the memory stall cycles include the time to service a cache miss and access lower levels of memory. If the access latency, miss rate and miss penalty are known, the average memory access time can be calculated with:

where is the access latency of level one cache, is the miss rate of level one cache and is the additional cycles a miss at a higher level takes to be serviced compared to a hit at higher level, and is calculated with:

this formula can be expanded further and used recursively for all the further levels in the memory hierarchy to get the . [6]

Power law of cache misses

The Power law of cache misses shows a trend in the capacity misses in a particular application of the program as affected by the cache size. This empirical observation led to the mathematical form of power law, which shows the relation between the miss rate and the cache size. It can be stated as

where M is the miss rate for a cache of size C and M0 is the miss rate of a baseline cache. The exponent α is workload-specific and typically ranges from 0.3 to 0.7, with an average of 0.5. The power law was validated on quite a few of real-world benchmarks. [7]

This relation shows that only a small fraction of cache misses can be eliminated for constant increase in cache size. This law holds true only for a certain finite range of cache sizes, up to which the miss rate doesn't flatten out. The miss rate eventually becomes stagnant at a certain, large enough cache size, and after that the relation does not give correct estimates.

Stack distance profile

The stack distance profile is a better representation of how the cache misses are affected by the cache size. The power law of cache misses just showed an rough approximation of the same. A stack distance profile captures the temporal reuse behavior of an application in a fully or set-associative cache. [8]

Applications that exhibit more temporal reuse behavior generally access data that is more recently used. Let us assume the associativity of a cache to be . To collect the stack distance profile information of this cache, assuming it has LRU replacement policy, counters are used starting from to and one additional counter , which keeps count of the misses. The counter increments when there is a hit in the way and the counter is incremented on every miss. The stack distance profile shows the trend of hits, decreasing from the most recently used data to the least recently used. Using this stack distance profile information, the cache miss for a cache with associativity and LRU replacement policy, where can be computed as

This profiling information has a limitation that it can only capture the temporal reuse across different associativities. For other purposes, the temporal reuse must be studied in greater detail.

See also

Notes

  1. Hennessy, J. and Patterson, D. (2003). Computer Architecture : a Quantitative Approach,3rd edition. Morgan-Kaufmann Publishers, Inc. ISBN   9781558607248.{{cite book}}: CS1 maint: multiple names: authors list (link)
  2. 1 2 3 4 Solihin, Yan (2015-11-17). Fundamentals of Parallel Multicore Architecture, 2016 edition. Chapman & Hall. ISBN   978-1482211184.
  3. "Modeling Cache Coherence Misses on Multicores" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  4. Sweden, Michel Dubois, University of Southern California, USA, Murali Annavaram, University of Southern California, USA, Per Stenström, Chalmers University of Technology (2012). Parallel computer organization and design. Cambridge: Cambridge University Press. ISBN   9781139051224.
  5. Ros, Alberto; Cuesta, Blas; Fernández-Pascual, Ricardo; Gómez, María E.; Acacio, Manuel E.; Robles, Antonio; García, José M.; Duato, José (2010). EMC2: Extending Magny-Cours Coherence for Large-Scale Servers. 17th International Conference on High Performance Computing (HiPC). pp. 1–10. doi:10.1109/HIPC.2010.5713176. ISBN   978-1-4244-8518-5.
  6. Patterson, John L. Hennessy, David A. (2011). Computer architecture : a quantitative approach (5th ed.). San Francisco, Calif.: Morgan Kaufmann. ISBN   978-0-12-383872-8.
  7. Hartstein, A.; Srinivasan, V.; Puzak, T. R.; Emma, P. G. (2006-01-01). "Cache miss behavior". Proceedings of the 3rd conference on Computing frontiers. CF '06. pp. 313–320. doi:10.1145/1128022.1128064. ISBN   978-1595933027. S2CID   17728397.
  8. Mattson, R.L.; Gecsei, J.; Slutz, D. R.; Traiger, I (1970). "Evaluation Techniques for Storage Hierarchies". IBM Systems Journal. 9 (2): 78–117. doi:10.1147/sj.92.0078.

Related Research Articles

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as traversing the elements in a one-dimensional array.

<span class="mw-page-title-main">Memory hierarchy</span> Computer memory architecture

In computer organisation, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies. Memory hierarchy affects performance in computer architectural design, algorithm predictions, and lower level programming constructs involving locality of reference.

In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as a single shared address space. The term "shared" does not mean that there is a single centralized memory, but that the address space is shared—i.e., the same physical address on two processors refers to the same location in memory. Distributed global address space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's private memory.

A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. It is a part of the chip's memory-management unit (MMU). A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.

In computer science, thrashing occurs in a system with virtual memory when a computer's real storage resources are overcommitted, leading to a constant state of paging and page faults, slowing most application-level processing. This causes the performance of the computer to degrade or collapse. The situation can continue indefinitely until either the user closes some running applications or the active processes free up additional virtual memory resources.

In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computing, cache replacement policies are optimizing instructions, or algorithms, that a computer program or a hardware-maintained structure can utilize in order to manage a cache of information stored on the computer. Caching improves performance by keeping recent or often-used data items in memory locations that are faster or computationally cheaper to access than normal memory stores. When the cache is full, the algorithm must choose which items to discard to make room for the new ones.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other useful data to be evicted from the cache into lower levels of the memory hierarchy, degrading performance. For example, in a multi-core processor, one core may replace the blocks fetched by other cores into shared cache, or prefetched blocks may replace demand-fetched blocks from the cache.

Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs : arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.

In computer engineering, directory-based cache coherence is a type of cache coherence mechanism, where directories are used to manage caches in place of bus snooping. Bus snooping methods scale poorly due to the use of broadcasting. These methods can be used to target both performance and scalability of directory systems.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

Multi-level caches can be designed in various ways depending on whether the content of one cache is present in other levels of caches. If all blocks in the higher level cache are also present in the lower level cache, then the lower level cache is said to be inclusive of the higher level cache. If the lower level cache contains only blocks that are not present in the higher level cache, then the lower level cache is said to be exclusive of the higher level cache. If the contents of the lower level cache are neither strictly inclusive nor exclusive of the higher level cache, then it is called non-inclusive non-exclusive (NINE) cache.

Cache placement policies are policies that determine where a particular memory block can be placed when it goes into a CPU cache. A block of memory cannot necessarily be placed at an arbitrary location in the cache; it may be restricted to a particular cache line or a set of cache lines by the cache's placement policy.

A power law is a mathematical relationship between two quantities in which one is directly proportional to some power of the other. The power law for cache misses was first established by C. K. Chow in his 1974 paper, supported by experimental data on hit ratios for stack processing by Richard Mattson in 1971. The power law of cache misses can be used to narrow down the cache sizes to practical ranges, given a tolerable miss rate, as one of the early steps while designing the cache hierarchy for a uniprocessor system.

<span class="mw-page-title-main">Cache hierarchy</span> Memory hierarchy concept applied to CPU caches with multiple levels

Cache hierarchy, or multi-level caches, refers to a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit (CPU) cores.

A victim cache is a small, usually fully associative cache placed in the refill path of a CPU cache that stores all the blocks evicted from that level of cache, originally proposed in 1990. In modern architectures, this function is typically performed by Level 3 or Level 4 caches.

<span class="mw-page-title-main">Parallel external memory</span>

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.