Cache placement policies

Last updated

Cache placement policies are policies that determine where a particular memory block can be placed when it goes into a CPU cache. A block of memory cannot necessarily be placed at an arbitrary location in the cache; it may be restricted to a particular cache line or a set of cache lines [1] by the cache's placement policy. [2] [3]

Contents

There are three different policies available for placement of a memory block in the cache: direct-mapped, fully associative, and set-associative. Originally this space of cache organizations was described using the term "congruence mapping". [4]

Direct-mapped cache

In a direct-mapped cache structure, the cache is organized into multiple sets [1] with a single cache line per set. Based on the address of the memory block, it can only occupy a single cache line. The cache can be framed as a n × 1 column matrix. [5]

To place a block in the cache

To search a word in the cache

Advantages

Disadvantage

Example

Direct-Mapped Cache Direct-Mapped Cache Snehal Img.png
Direct-Mapped Cache

Consider a main memory of 16 kilobytes, which is organized as 4-byte blocks, and a direct-mapped cache of 256 bytes with a block size of 4 bytes. Because the main memory is 16kB, we need a minimum of 14 bits to uniquely represent a memory address.

Since each cache block is of size 4 bytes, the total number of sets in the cache is 256/4, which equals 64 sets.

The incoming address to the cache is divided into bits for Offset, Index and Tag.

Below are memory addresses and an explanation of which cache line they map to:

  1. Address 0x0000 (tag - 0b00_0000, index – 0b00_0000, offset – 0b00) corresponds to block 0 of the memory and maps to the set 0 of the cache.
  2. Address 0x0004 (tag - 0b00_0000, index – 0b00_0001, offset – 0b00) corresponds to block 1 of the memory and maps to the set 1 of the cache.
  3. Address 0x00FF (tag – 0b00_0000, index – 0b11_1111, offset – 0b11) corresponds to block 63 of the memory and maps to the set 63 of the cache.
  4. Address 0x0100 (tag – 0b00_0001, index – 0b00_0000, offset – 0b00) corresponds to block 64 of the memory and maps to the set 0 of the cache.

Fully associative cache

In a fully associative cache, the cache is organized into a single cache set with multiple cache lines. A memory block can occupy any of the cache lines. The cache organization can be framed as 1 × m row matrix. [5]

To place a block in the cache

To search a word in the cache

Fully associative cache Fully-Associative Cache Snehal Img.png
Fully associative cache

Advantages

Disadvantages

Example

Consider a main memory of 16 kilobytes, which is organized as 4-byte blocks, and a fully associative cache of 256 bytes and a block size of 4 bytes. Because the main memory is 16kB, we need a minimum of 14 bits to uniquely represent a memory address.

The total number of sets in the cache is 1, and the set contains 256/4=64 cache lines, as the cache block is of size 4 bytes.

The incoming address to the cache is divided into bits for offset and tag.

Since any block of memory can be mapped to any cache line, the memory block can occupy one of the cache lines based on the replacement policy.

Set-associative cache

Set-associative cache is a trade-off between direct-mapped cache and fully associative cache.

A set-associative cache can be imagined as a n × m matrix. The cache is divided into ‘n’ sets and each set contains ‘m’ cache lines. A memory block is first mapped onto a set and then placed into any cache line of the set.

The range of caches from direct-mapped to fully associative is a continuum of levels of set associativity. (A direct-mapped cache is one-way set-associative and a fully associative cache with m cache lines is m-way set-associative.)

Many processor caches in today's designs are either direct-mapped, two-way set-associative, or four-way set-associative. [5]

To place a block in the cache

To locate a word in the cache

Advantages

Disadvantages

Example

Consider a main memory of 16 kilobytes, which is organized as 4-byte blocks, and a 2-way set-associative cache of 256 bytes with a block size of 4 bytes. Because the main memory is 16kB, we need a minimum of 14 bits to uniquely represent a memory address.

Since each cache block is of size 4 bytes and is 2-way set-associative, the total number of sets in the cache is 256/(4 * 2), which equals 32 sets.

Set-Associative Cache Set-Associative Cache Snehal Img.png
Set-Associative Cache

The incoming address to the cache is divided into bits for Offset, Index and Tag.

Below are memory addresses and an explanation of which cache line on which set they map to:

  1. Address 0x0000 (tag - 0b000_0000, index – 0b0_0000, offset – 0b00) corresponds to block 0 of the memory and maps to the set 0 of the cache. The block occupies a cache line in set 0, determined by the replacement policy for the cache.
  2. Address 0x0004 (tag - 0b000_0000, index – 0b0_0001, offset – 0b00) corresponds to block 1 of the memory and maps to the set 1 of the cache. The block occupies a cache line in set 1, determined by the replacement policy for the cache.
  3. Address 0x00FF (tag – 0b000_0001, index – 0b1_1111, offset – 0b11) corresponds to block 63 of the memory and maps to the set 31 of the cache. The block occupies a cache line in set 31, determined by the replacement policy for the cache.
  4. Address 0x0100 (tag – 0b000_0010, index – 0b0_0000, offset – 0b00) corresponds to block 64 of the memory and maps to the set 0 of the cache. The block occupies a cache line in set 0, determined by the replacement policy for the cache.

Two-way skewed associative cache

Other schemes have been suggested, such as the skewed cache, [8] where the index for way 0 is direct, as above, but the index for way 1 is formed with a hash function. A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside is extra latency from computing the hash function. [9] Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages over conventional set-associative ones. [10]

Pseudo-associative cache

A true set-associative cache tests all the possible ways simultaneously, using something like a content-addressable memory. A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache and a column-associative cache are examples of a pseudo-associative cache.

In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache, but it has a much lower conflict miss rate than a direct-mapped cache, closer to the miss rate of a fully associative cache. [9]

See also

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called a one-dimensional array.

<span class="mw-page-title-main">Machine code</span> Lowest level instructions executed by a computer

In computer programming, machine code is computer code consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). For conventional binary computers, machine code is the binary representation of a computer program which is actually read and interpreted by the computer. A program in machine code consists of a sequence of machine instructions.

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

The Intel x86 computer instruction set architecture has supported memory segmentation since the original Intel 8086 in 1978. It allows programs to address more than 64 KB (65,536 bytes) of memory, the limit in earlier 80xx processors. In 1982, the Intel 80286 added support for virtual memory and memory protection; the original mode was renamed real mode, and the new version was named protected mode. The x86-64 architecture, introduced in 2003, has largely dropped support for segmentation in 64-bit mode.

<span class="mw-page-title-main">Memory management unit</span> Hardware translating virtual addresses to physical address

A memory management unit (MMU), sometimes called paged memory management unit (PMMU), is a computer hardware unit that examines all memory references on the memory bus, translating these requests, known as virtual memory addresses, into physical addresses in main memory.

A bitboard is a specialized bit array data structure commonly used in computer systems that play board games, where each bit corresponds to a game board space or piece. This allows parallel bitwise operations to set or query the game state, or determine moves or plays in the game.

In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation, in a process termed as direct addressing. The savings in processing time can be significant, because retrieving a value from memory is often faster than carrying out an "expensive" computation or input/output operation. The tables may be precalculated and stored in static program storage, calculated as part of a program's initialization phase (memoization), or even stored in hardware in application-specific platforms. Lookup tables are also used extensively to validate input values by matching against a list of valid items in an array and, in some programming languages, may include pointer functions to process the matching input. FPGAs also make extensive use of reconfigurable, hardware-implemented, lookup tables to provide programmable hardware functionality. LUTs differ from hash tables in a way that, to retrieve a value with key , a hash table would store the value in the slot where is a hash function i.e. is used to compute the slot, while in the case of LUT, the value is stored in slot , thus directly addressable.

Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. This scheme was introduced by Ravishankar and Goodman in 1983, under the name "write-once" cache coherency. A cache containing a coherency controller (snooper) is called a snoopy cache.

In CPU design, the use of a sum-addressed decoder (SAD) or sum-addressed memory (SAM) decoder is a method of reducing the latency of the CPU cache access and address calculation. This is achieved by fusing the address generation sum operation with the decode operation in the cache SRAM.

A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. It is a part of the chip's memory-management unit (MMU). A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that uses paged or segmented virtual memory.

<span class="mw-page-title-main">Page table</span> Data structure that maps virtual addresses with physical addresses

A page table is a data structure used by a virtual memory system in a computer to store mappings between virtual addresses and physical addresses. Virtual addresses are used by the program executed by the accessing process, while physical addresses are used by the hardware, or more specifically, by the random-access memory (RAM) subsystem. The page table is a key component of virtual address translation that is necessary to access data in memory. The page table is set up by the computer's operating system, and may be read and written during the virtual address translation process by the memory management unit or by low-level system software or firmware.

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computing, cache replacement policies are optimizing instructions or algorithms which a computer program or hardware-maintained structure can utilize to manage a cache of information. Caching improves performance by keeping recent or often-used data items in memory locations which are faster, or computationally cheaper to access, than normal memory stores. When the cache is full, the algorithm must choose which items to discard to make room for new data.

Memory segmentation is an operating system memory management technique of dividing a computer's primary memory into segments or sections. In a computer system using segmentation, a reference to a memory location includes a value that identifies a segment and an offset within that segment. Segments or sections are also used in object files of compiled programs when they are linked together into a program image and when the image is loaded into memory.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.

<span class="mw-page-title-main">Word addressing</span> Computer memory accessed in word units

In computer architecture, word addressing means that addresses of memory on a computer uniquely identify words of memory. It is usually used in contrast with byte addressing, where addresses uniquely identify bytes. Almost all modern computer architectures use byte addressing, and word addressing is largely only of historical interest. A computer that uses word addressing is sometimes called a word machine.

<span class="mw-page-title-main">ST6 and ST7</span> 8-bit microcontroller product lines from STMicroelectronics

The ST6 and ST7 are 8-bit microcontroller product lines from STMicroelectronics. They are commonly used in small embedded applications like washing machines.

A victim cache is a small, typically fully associative cache placed in the refill path of a CPU cache. It stores all the blocks evicted from that level of cache and was originally proposed in 1990. In modern architectures, this function is typically performed by Level 3 or Level 4 caches.

References

  1. 1 2 3 4 5 "The Basics of Cache" (PDF).
  2. "Cache Placement Policies". Archived from the original on Feb 21, 2020.
  3. "Placement Policies". Archived from the original on August 14, 2020.
  4. Mattson, R.L.; Gecsei, J.; Slutz, D. R.; Traiger, I (1970). "Evaluation Techniques for Storage Hierarchies". IBM Systems Journal. 9 (2): 78–117. doi:10.1147/sj.92.0078.
  5. 1 2 3 Solihin, Yan (2015). Fundamentals of Parallel Multi-core Architecture. Taylor & Francis. pp. 136–141. ISBN   978-1482211184.
  6. "Cache Miss Types" (PDF).
  7. "Fully Associative Cache". Archived from the original on December 24, 2017.
  8. André Seznec (1993). "A Case for Two-Way Skewed-Associative Caches". ACM SIGARCH Computer Architecture News. 21 (2): 169–178. doi: 10.1145/173682.165152 .
  9. 1 2 C. Kozyrakis. "Lecture 3: Advanced Caching Techniques" (PDF). Archived from the original (PDF) on September 7, 2012.
  10. Micro-Architecture "Skewed-associative caches have ... major advantages over conventional set-associative caches."