Cache inclusion policy

Last updated March 17, 2024

Multi-level caches can be designed in various ways depending on whether the content of one cache is present in other levels of caches. If all blocks in the higher level cache are also present in the lower level cache, then the lower level cache is said to be inclusive of the higher level cache. If the lower level cache contains only blocks that are not present in the higher level cache, then the lower level cache is said to be exclusive of the higher level cache. If the contents of the lower level cache are neither strictly inclusive nor exclusive of the higher level cache, then it is called non-inclusive non-exclusive (NINE) cache.^[1]^[2]

Inclusive Policy

Consider an example of a two level cache hierarchy where L2 can be inclusive, exclusive or NINE of L1. Consider the case when L2 is inclusive of L1. Suppose there is a processor read request for block X. If the block is found in L1 cache, then the data is read from L1 cache and returned to the processor. If the block is not found in the L1 cache, but present in the L2 cache, then the cache block is fetched from the L2 cache and placed in L1. If this causes a block to be evicted from L1, there is no involvement of L2. If the block is not found in either L1 or L2, then it is fetched from the main memory and placed in both L1 and L2. Now, if there is an eviction from L2, the L2 cache sends a back invalidation to the L1 cache, so that inclusion is not violated.

As illustrated in Figure 1, initially consider both L1 and L2 caches to be empty (a). Assume that the processor sends a read X request. It will be a miss in both L1 and L2 and hence the block is brought into both L1 and L2 from the main memory as shown in (b). Now, assume the processor issues a read Y request which is a miss in both L1 and L2. So, block Y is placed in both L1 and L2 as shown in (c). If block X has to be evicted from L1, then it is removed from L1 only as shown in (d). If block Y has to be evicted from L2, it sends a back invalidation request to L1 and hence block Y is evicted from L1 as shown in (e).

In order for inclusion to hold, certain conditions need to be satisfied. L2 associativity must be greater than or equal to L1 associativity irrespective of the number of sets. The number of L2 sets must be greater than or equal to the number of L1 sets irrespective of L2 associativity. All reference information from L1 is passed to L2 so that it can update its replacement bits.

One example of inclusive cache is Intel quad core processor with 4x256KB L2 caches and 8MB (inclusive) L3 cache.^[3]

Exclusive Policy

Consider the case when L2 is exclusive of L1. Suppose there is a processor read request for block X. If the block is found in L1 cache, then the data is read from L1 cache and returned to the processor. If the block is not found in the L1 cache, but present in the L2 cache, then the cache block is moved from the L2 cache to the L1 cache. If this causes a block to be evicted from L1, the evicted block is then placed into L2. This is the only way L2 gets populated. Here, L2 behaves like a victim cache. If the block is not found in either L1 or L2, then it is fetched from main memory and placed just in L1 and not in L2.

As illustrated in Figure 2, initially consider both L1 and L2 caches to be empty (a). Assume that the processor sends a read X request. It will be a miss in both L1 and L2 and hence the block is brought into L1 from the main memory as shown in (b). Now, again the processor issues a read Y request which is a miss in both L1 and L2. So, block Y is placed in L1 as shown in (c). If block X has to be evicted from L1, then it is removed from L1 and placed in L2 as shown in (d).

An example of exclusive cache is AMD Opteron with 512 KB (per core) L2 cache, exclusive of L1.^[3]

NINE Policy

Consider the case when L2 is non-inclusive non-exclusive of L1. Suppose there is a processor read request for block X. If the block is found in L1 cache, then the data is read from L1 cache and returned to the processor. If the block is not found in the L1 cache, but present in the L2 cache, then the cache block is fetched from the L2 cache and placed in L1. If this causes a block to be evicted from L1, there is no involvement of L2, which is the same as in the case of inclusive policy. If the block is not found in both L1 and L2, then it is fetched from main memory and placed in both L1 and L2. Now, if there is an eviction from L2, unlike inclusive policy, there is no back invalidation.

As illustrated in Figure 3, initially consider both L1 and L2 caches to be empty (a). Assume that the processor sends a read X request. It will be a miss in both L1 and L2 and hence the block is brought into both L1 and L2 from the main memory as shown in (b). Now, again the processor issues a read Y request which is a miss in both L1 and L2. So, block Y is placed in both L1 and L2 as shown in (c). If block X has to be evicted from L1, then it is removed from L1 only as shown in (d). If block Y has to be evicted from L2, it is evicted from L2 only as shown in (e).

An example of non-inclusive non-exclusive cache is AMD Opteron with non-inclusive L3 cache of 6 MB (shared).^[3]

Comparison

The merit of inclusive policy is that, in parallel systems with per-processor private cache if there is a cache miss other peer caches are checked for the block. If the lower level cache is inclusive of the higher level cache and it is a miss in the lower level cache, then the higher level cache need not be searched. This implies a shorter miss latency for an inclusive cache compared to exclusive and NINE.^[1]

A drawback of an inclusive policy is that the unique memory capacity of the cache is determined by the lower level cache. Unlike the case of exclusive cache, where the unique memory capacity is the combined capacity of all caches in the hierarchy.^[4] If the size of lower level cache is small and comparable with the size of higher level cache, there is more wasted cache capacity in inclusive caches. Although the exclusive cache has more unique memory capacity, it uses more bandwidth since it suffers from a higher rate of filling of new blocks (equal to the rate of higher level cache's misses) as compared to NINE cache which is filled with a new block only when it suffers a miss. Therefore, assessment of cost relative to benefit needs to be done while exploiting the choice between Inclusive, Exclusive and NINE caches.

Value Inclusion: It is not necessary for a block to have same data values when it is cached in both higher and lower level caches even though inclusion is maintained. But, if the data values are the same, value inclusion is maintained.^[1] This depends on the write policy in use, as write back policy does not notify the lower level cache of the changes made to the block in higher level cache. However, in case of write-through cache there is no such concern.

Related Research Articles

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as traversing the elements in a one-dimensional array.

Duron is a line of budget x86-compatible microprocessors manufactured by AMD and released on June 19, 2000. Duron was intended to be a lower-cost offering to complement AMD's then mainstream performance Athlon processor line, and it also competed with rival chipmaker Intel's Pentium III and Celeron processor offerings. The Duron brand name was retired in 2004, succeeded by the AMD's Sempron line of processors as their budget offering.

The Athlon 64 is a ninth-generation, AMD64-architecture microprocessor produced by Advanced Micro Devices (AMD), released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP. The Athlon 64 was the second processor to implement the AMD64 architecture and the first 64-bit processor targeted at the average consumer. Variants of the Athlon 64 have been produced for Socket 754, Socket 939, Socket 940, and Socket AM2. It was AMD's primary consumer CPU, and primarily competed with Intel's Pentium 4, especially the Prescott and Cedar Mill core revisions.

The MESI protocol is an Invalidate-based cache coherence protocol, and is one of the most common protocols that support write-back caches. It is also known as the Illinois protocol due to its development at the University of Illinois at Urbana-Champaign. Write back caches can save considerable bandwidth generally wasted on a write through cache. There is always a dirty state present in write back caches that indicates that the data in the cache is different from that in main memory. The Illinois Protocol requires a cache to cache transfer on a miss if the block resides in another cache. This protocol reduces the number of main memory transactions with respect to the MSI protocol. This marks a significant improvement in performance.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computing, the MSI protocol - a basic cache-coherence protocol - operates in multiprocessor systems. As with other cache coherency protocols, the letters of the protocol name identify the possible states in which a cache line can be.

The MOSI protocol is an extension of the basic MSI cache coherency protocol. It adds the Owned state, which indicates that the current processor owns this block, and will service requests from other processors for the block.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other useful data to be evicted from the cache into lower levels of the memory hierarchy, degrading performance. For example, in a multi-core processor, one core may replace the blocks fetched by other cores into shared cache, or prefetched blocks may replace demand-fetched blocks from the cache.

Adaptive Replacement Cache (ARC) is a page replacement algorithm with better performance than LRU. This is accomplished by keeping track of both frequently used and recently used pages plus a recent eviction history for both. The algorithm was developed at the IBM Almaden Research Center. In 2006, IBM was granted a patent for the adaptive replacement cache policy.

The Firefly cache coherence protocol is the schema used in the DEC Firefly multiprocessor workstation, developed by DEC Systems Research Center. This protocol is a 3 State Write Update Cache Coherence Protocol. Unlike the Dragon protocol, the Firefly protocol updates the Main Memory as well as the Local caches on Write Update Bus Transition. Thus the Shared Clean and Shared Modified States present in case of Dragon Protocol, are not distinguished between in case of Firefly Protocol.

The Dragon Protocol is an update based cache coherence protocol used in multi-processor systems. Write propagation is performed by directly updating all the cached values across multiple processors. Update based protocols such as the Dragon protocol perform efficiently when a write to a cache block is followed by several reads made by other processors, since the updated cache block is readily available across caches associated with all the processors.

The SPARC64 V (Zeus) is a SPARC V9 microprocessor designed by Fujitsu. The SPARC64 V was the basis for a series of successive processors designed for servers, and later, supercomputers.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

<span class="mw-page-title-main">Trace cache</span>

In computer architecture, a trace cache or execution trace cache is a specialized instruction cache which stores the dynamic stream of instructions known as trace. It helps in increasing the instruction fetch bandwidth and decreasing power consumption by storing traces of instructions that have already been fetched and decoded. A trace processor is an architecture designed around the trace cache and processes the instructions at trace level granularity. The formal mathematical theory of traces is described by trace monoids.

Cache placement policies are policies that determine where a particular memory block can be placed when it goes into a CPU cache. A block of memory cannot necessarily be placed at an arbitrary location in the cache; it may be restricted to a particular cache line or a set of cache lines by the cache's placement policy.

Cache hierarchy, or multi-level cache, is a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit (CPU) cores.

A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.

A victim cache is a small, usually fully associative cache placed in the refill path of a CPU cache that stores all the blocks evicted from that level of cache, originally proposed in 1990. In modern architectures, this function is typically performed by Level 3 or Level 4 caches.

References

1 2 3 Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. pp. 146–150. ISBN 9781482211184.
↑ Culler, David; Gupta, Anoop; Singh, Jaswinder Pal (1999). Parallel Computer Architecture: A Hardware/Software Approach . San Francisco: Morgan Kaufmann Publishers. pp. 369–372. ISBN 1558603433.
1 2 3 "Comparing Cache Architectures and Coherency Protocols on x86-64 Multicore SMP Systems". Proceedings of the 42nd International Symposium on Microarchitecture. MICRO’09.
↑ Ying Zheng; Davis, B.T.; Jordan, M. (2004). "Performance evaluation of exclusive cache hierarchies". IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004. pp. 89–96. doi:10.1109/ISPASS.2004.1291359. ISBN 0-7803-8385-0. S2CID 10784219.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:1-1] 1 2 3 Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. pp. 146–150. ISBN 9781482211184.

[2] Culler, David; Gupta, Anoop; Singh, Jaswinder Pal (1999). Parallel Computer Architecture: A Hardware/Software Approach . San Francisco: Morgan Kaufmann Publishers. pp. 369–372. ISBN 1558603433.

[:0-3] 1 2 3 "Comparing Cache Architectures and Coherency Protocols on x86-64 Multicore SMP Systems". Proceedings of the 42nd International Symposium on Microarchitecture. MICRO’09.

[4] Ying Zheng; Davis, B.T.; Jordan, M. (2004). "Performance evaluation of exclusive cache hierarchies". IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004. pp. 89–96. doi:10.1109/ISPASS.2004.1291359. ISBN 0-7803-8385-0. S2CID 10784219.

[1]

[2]

[3]

[4]