Victim cache

Last updated April 01, 2024

A victim cache is a small, typically fully associative cache placed in the refill path of a CPU cache. It stores all the blocks evicted from that level of cache and was originally proposed in 1990. In modern architectures, this function is typically performed by Level 3 or Level 4 caches.

Overview

Victim caching is a hardware technique to improve performance of caches proposed by Norman Jouppi. As mentioned in his paper:^[1]

Miss caching places a fully-associative cache between cache and its re-fill path. Misses in the cache that hit in the miss cache have a one cycle penalty, as opposed to a many cycle miss penalty without the miss cache. Victim Caching is an improvement to miss caching that loads the small fully-associative cache with victim of a miss and not the requested cache line. ^[1]

A victim cache is a hardware cache designed to reduce conflict misses and enhance hit latency for direct-mapped caches. It is utilized in the refill path of a Level 1 cache, where any cache-line evicted from the cache is cached in the victim cache. As a result, the victim cache is populated only when data is evicted from the Level 1 cache. When a miss occurs in the Level 1 cache, the missed entry is checked in the victim cache. If the access yields a hit, the contents of the Level 1 cache line and the corresponding victim cache line are swapped.

Though initially proposed by Jouppi to improve cache performance of a direct-mapped cache Level 1, modern day microprocessors with multi-level cache hierarchy employ Level 3 or Level 4 cache to act as victim cache for the cache lying above it in the memory hierarchy. Intel's Crystal Well ^[2] of its Haswell processors introduced an on-package Level 4 cache which serves as a victim cache to processor's Level 3 cache.^[3] A 4–12 MB Level 3 cache is used as a victim cache in POWER5 (IBM) microprocessors.

Background

As hardware architecture and technology advanced, processor performance and frequency increased at a much faster rate than memory cycle times, resulting in a significant performance gap. The challenge of rising memory latency compared to processor speed has been addressed by incorporating high-speed cache memory.

Direct-mapped caches have faster access time than set-associative caches. However, in direct-mapped caches, when multiple cache blocks in memory map to the same cache line, they end up evicting each other whenever one of them is accessed. This issue, known as the cache-conflict problem, arises due to the limited associativity of the cache. Increasing cache associativity can mitigate this problem, but there are implementation complexities and limitations to how much associativity can be increased. To address the cache conflict problem within the constraints of limited cache associativity, a victim cache is often employed.

Implementation

The behavior of a victim cache in its respective interaction with the corresponding level cache is given below:

Cache Hit: No action

Cache Miss, Victim Hit: The block is in the victim cache and the one in the cache are replaced with each other. This new entry in victim cache becomes the most recently used block.

Cache Miss, Victim Miss: The block is brought to cache from next level. The block evicted from the cache gets stored in Victim cache.

Example:

Suppose a direct-mapped L1 cache with blocks A, B pointing to the same set. It is linked to a 2 entry fully associative victim cache with blocks C, D in it.

The trace to be followed: A, B, A, B…

From the diagram, we can see that, in case of victim cache (VC) hit, blocks A and B are swapped. The least recently used block of VC remains as it is. Hence, it gives an illusion of associativity to the direct-mapped L1 cache, in turn reducing the conflict misses.

In case of two caches, L1 and L2 with exclusive cache policy (L2 does not cache same the memory locations as L1), L2 acts as the victim cache for L1.

Performance implication

While measuring performance improvement by using victim cache, Jouppi^[1] assumed a Level-1 direct-mapped cache augmented with a fully associative cache. For the test suite used by him, on an average 39% of the Level-1 data cache misses are found to be conflict misses, while on an average 29% of the Level-1 instruction misses are found to be conflict misses.^[1] Since conflict misses amount to large percentage of total misses, therefore providing additional associativity by augmenting the Level 1 cache with a victim cache is bound to improve total miss rate significantly.

^[4] Experimental results are deduced by considering a 32-Kb Direct-Mapped, 2-way and fully associative cache augmented with a 256 block (8 KB) victim cache and running on it 8 randomly selected SPEC95 Benchmarks. While the results cannot be generalized for all benchmarks, adding a victim cache provides a miss rate reduction ranging from 10% to 100% for all the cache configuration. The returns although seem to level off beyond victim cache size of 50 blocks, thus proving Jouppi's^[1] observation that victim cache benefits reach a plateau after the first few victim blocks.

Miss rate reduction for a 64 KB cache size are found to be significantly lower, proving that victim caching is not indefinitely scalable.^[4]

While comparing various cache configuration it was found that in certain cases adding a small victim cache can give performance benefit equivalent to that observed by multiplying the cache size by 2.^[4]

Related Research Articles

Athlon is the brand name applied to a series of x86-compatible microprocessors designed and manufactured by AMD. The original Athlon was the first seventh-generation x86 processor and the first desktop processor to reach speeds of one gigahertz (GHz). It made its debut as AMD's high-end processor brand on June 23, 1999. Over the years AMD has used the Athlon name with the 64-bit Athlon 64 architecture, the Athlon II, and Accelerated Processing Unit (APU) chips targeting the Socket AM1 desktop SoC architecture, and Socket AM4 Zen microarchitecture. The modern Zen-based Athlon with a Radeon Graphics processor was introduced in 2019 as AMD's highest-performance entry-level processor.

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

Duron is a line of budget x86-compatible microprocessors manufactured by AMD and released on June 19, 2000. Duron was intended to be a lower-cost offering to complement AMD's then mainstream performance Athlon processor line, and it also competed with rival chipmaker Intel's Pentium III and Celeron processor offerings. The Duron brand name was retired in 2004, succeeded by the AMD's Sempron line of processors as their budget offering.

Celeron is a discontinued series of low-end IA-32 and x86-64 computer microprocessor models targeted at low-cost personal computers, manufactured by Intel. The first Celeron-branded CPU was introduced on April 15, 1998, and was based on the Pentium II.

The Pentium II brand refers to Intel's sixth-generation microarchitecture ("P6") and x86-compatible microprocessors introduced on May 7, 1997. Containing 7.5 million transistors, the Pentium II featured an improved version of the first P6-generation core of the Pentium Pro, which contained 5.5 million transistors. However, its L2 cache subsystem was a downgrade when compared to the Pentium Pros. It is a single-core microprocessor.

<span class="mw-page-title-main">AMD K6-III</span> Microprocessor series by AMD

The K6-III was an x86 microprocessor line manufactured by AMD that launched on February 22, 1999. The launch consisted of both 400 and 450 MHz models and was based on the preceding K6-2 architecture. Its improved 256 KB on-chip L2 cache gave it significant improvements in system performance over its predecessor the K6-2. The K6-III was the last processor officially released for desktop Socket 7 systems, however later mobile K6-III+ and K6-2+ processors could be run unofficially in certain socket 7 motherboards if an updated BIOS was made available for a given board. The Pentium III processor from Intel launched 6 days later.

A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. It is a part of the chip's memory-management unit (MMU). A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

SPARC64 is a microprocessor developed by HAL Computer Systems and fabricated by Fujitsu. It implements the SPARC V9 instruction set architecture (ISA), the first microprocessor to do so. SPARC64 was HAL's first microprocessor and was the first in the SPARC64 brand. It operates at 101 and 118 MHz. The SPARC64 was used exclusively by Fujitsu in their systems; the first systems, the Fujitsu HALstation Model 330 and Model 350 workstations, were formally announced in September 1995 and were introduced in October 1995, two years late. It was succeeded by the SPARC64 II in 1996.

The Athlon 64 X2 is the first native dual-core desktop central processing unit (CPU) designed by Advanced Micro Devices (AMD). It was designed from scratch as native dual-core by using an already multi-CPU enabled Athlon 64, joining it with another functional core on one die, and connecting both via a shared dual-channel memory controller/north bridge and additional control logic. The initial versions are based on the E stepping model of the Athlon 64 and, depending on the model, have either 512 or 1024 KB of L2 cache per core. The Athlon 64 X2 can decode instructions for Streaming SIMD Extensions 3 (SSE3), except those few specific to Intel's architecture. The first Athlon 64 X2 CPUs were released in May 2005, in the same month as Intel's first dual-core processor, the Pentium D.

The Intel Core microarchitecture is a multi-core processor microarchitecture launched by Intel in mid-2006. It is a major evolution over the Yonah, the previous iteration of the P6 microarchitecture series which started in 1995 with Pentium Pro. It also replaced the NetBurst microarchitecture, which suffered from high power consumption and heat intensity due to an inefficient pipeline designed for high clock rate. In early 2004 the new version of NetBurst (Prescott) needed very high power to reach the clocks it needed for competitive performance, making it unsuitable for the shift to dual/multi-core CPUs. On May 7, 2004 Intel confirmed the cancellation of the next NetBurst, Tejas and Jayhawk. Intel had been developing Merom, the 64-bit evolution of the Pentium M, since 2001, and decided to expand it to all market segments, replacing NetBurst in desktop computers and servers. It inherited from Pentium M the choice of a short and efficient pipeline, delivering superior performance despite not reaching the high clocks of NetBurst.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor, scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

<span class="mw-page-title-main">Nehalem (microarchitecture)</span> CPU microarchitecture by Intel

Nehalem is the codename for Intel's 45 nm microarchitecture released in November 2008. It was used in the first generation of the Intel Core i5 and i7 processors, and succeeds the older Core microarchitecture used on Core 2 processors. The term "Nehalem" comes from the Nehalem River.

The ARM Cortex-A72 is a central processing unit implementing the ARMv8-A 64-bit instruction set designed by ARM Holdings' Austin design centre. The Cortex-A72 is a 3-way decode out-of-order superscalar pipeline. It is available as SIP core to licensees, and its design makes it suitable for integration with other SIP cores into one die constituting a system on a chip (SoC). The Cortex-A72 was announced in 2015 to serve as the successor of the Cortex-A57, and was designed to use 20% less power or offer 90% greater performance.

Multi-level caches can be designed in various ways depending on whether the content of one cache is present in other levels of caches. If all blocks in the higher level cache are also present in the lower level cache, then the lower level cache is said to be inclusive of the higher level cache. If the lower level cache contains only blocks that are not present in the higher level cache, then the lower level cache is said to be exclusive of the higher level cache. If the contents of the lower level cache are neither strictly inclusive nor exclusive of the higher level cache, then it is called non-inclusive non-exclusive (NINE) cache.

Cache placement policies are policies that determine where a particular memory block can be placed when it goes into a CPU cache. A block of memory cannot necessarily be placed at an arbitrary location in the cache; it may be restricted to a particular cache line or a set of cache lines by the cache's placement policy.

Cache hierarchy, or multi-level cache, is a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit (CPU) cores.

A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.

References

1 2 3 4 5 Jouppi, N. P. (1990-05-01). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. 17th Annual International Symposium on Computer Architecture, 1990. Proceedings. pp. 364–373. doi:10.1109/ISCA.1990.134547. ISBN 0-8186-2047-1.
↑ "Products (Formerly Crystal Well)". Intel® ARK (Product Specs). Retrieved 2016-11-16.
↑ Shimpi, Anand Lal. "Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested" . Retrieved 2016-11-16.
1 2 3 "Victim-Caching for Large Caches and Modern Workloads". 1996. CiteSeerX 10.1.1.27.9810 .{{cite journal}}: Cite journal requires |journal= (help)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 4 5 Jouppi, N. P. (1990-05-01). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. 17th Annual International Symposium on Computer Architecture, 1990. Proceedings. pp. 364–373. doi:10.1109/ISCA.1990.134547. ISBN 0-8186-2047-1.

[2] "Products (Formerly Crystal Well)". Intel® ARK (Product Specs). Retrieved 2016-11-16.

[3] Shimpi, Anand Lal. "Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested" . Retrieved 2016-11-16.

[:1-4] 1 2 3 "Victim-Caching for Large Caches and Modern Workloads". 1996. CiteSeerX 10.1.1.27.9810 .{{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]