Cache hierarchy

Last updated

Cache hierarchy, or multi-level cache, is a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit (CPU) cores.

Contents

Cache hierarchy is a form and part of memory hierarchy and can be considered a form of tiered storage. [1] This design was intended to allow CPU cores to process faster despite the memory latency of main memory access. Accessing main memory can act as a bottleneck for CPU core performance as the CPU waits for data, while making all of main memory high-speed may be prohibitively expensive. High-speed caches are a compromise allowing high-speed access to the data most-used by the CPU, permitting a faster CPU clock. [2]

Generic multi-level cache organization Cache Organization.png
Generic multi-level cache organization

Background

In the history of computer and electronic chip development, there was a period when increases in CPU speed outpaced the improvements in memory access speed. [3] The gap between the speed of CPUs and memory meant that the CPU would often be idle. [4] CPUs were increasingly capable of running and executing larger amounts of instructions in a given time, but the time needed to access data from main memory prevented programs from fully benefiting from this capability. [5] This issue motivated the creation of memory models with higher access rates in order to realize the potential of faster processors. [6]

This resulted in the concept of cache memory, first proposed by Maurice Wilkes, a British computer scientist at the University of Cambridge in 1965. He called such memory models "slave memory". [7] Between roughly 1970 and 1990, papers and articles by Anant Agarwal, Alan Jay Smith, Mark D. Hill, Thomas R. Puzak, and others discussed better cache memory designs. The first cache memory models were implemented at the time, but even as researchers were investigating and proposing better designs, the need for faster memory models continued. This need resulted from the fact that although early cache models improved data access latency, with respect to cost and technical limitations it was not feasible for a computer system's cache to approach the size of main memory. From 1990 onward, ideas such as adding another cache level (second-level), as a backup for the first-level cache were proposed. Jean-Loup Baer, Wen-Hann Wang, Andrew W. Wilson, and others have conducted research on this model. When several simulations and implementations demonstrated the advantages of two-level cache models, the concept of multi-level caches caught on as a new and generally better model of cache memories. Since 2000, multi-level cache models have received widespread attention and are currently implemented in many systems, such as the three-level caches that are present in Intel's Core i7 products. [8]

Multi-level cache

Accessing main memory for each instruction execution may result in slow processing, with the clock speed depending on the time required to find and fetch the data. In order to hide this memory latency from the processor, data caching is used. [9] Whenever the data is required by the processor, it is fetched from the main memory and stored in the smaller memory structure called a cache. If there is any further need of that data, the cache is searched first before going to the main memory. [10] This structure resides closer to the processor in terms of the time taken to search and fetch data with respect to the main memory. [11] The advantages of using cache can be proven by calculating the average access time (AAT) for the memory hierarchy with and without the cache. [12]

Average access time (AAT)

Caches, being small in size, may result in frequent misses – when a search of the cache does not provide the sought-after information – resulting in a call to main memory to fetch data. Hence, the AAT is affected by the miss rate of each structure from which it searches for the data. [13]

AAT for main memory is given by Hit time main memory. AAT for caches can be given by

Hit timecache + (Miss ratecache × Miss Penaltytime taken to go to main memory after missing cache).[ further explanation needed ]

The hit time for caches is less than the hit time for the main memory, so the AAT for data retrieval is significantly lower when accessing data through the cache rather than main memory. [14]

Trade-offs

While using the cache may improve memory latency, it may not always result in the required improvement for the time taken to fetch data due to the way caches are organized and traversed. For example, direct-mapped caches that are the same size usually have a higher miss rate than fully associative caches. This may also depend on the benchmark of the computer testing the processor and on the pattern of instructions. But using a fully associative cache may result in more power consumption, as it has to search the whole cache every time. Due to this, the trade-off between power consumption (and associated heat) and the size of the cache becomes critical in the cache design. [13]

Evolution

Cache hierarchy for up to L3 level of cache and main memory with on-chip L1 Cache Hierarchy Updated.png
Cache hierarchy for up to L3 level of cache and main memory with on-chip L1

In the case of a cache miss, the purpose of using such a structure will be rendered useless and the computer will have to go to the main memory to fetch the required data. However, with a multiple-level cache, if the computer misses the cache closest to the processor (level-one cache or L1) it will then search through the next-closest level(s) of cache and go to main memory only if these methods fail. The general trend is to keep the L1 cache small and at a distance of 1–2 CPU clock cycles from the processor, with the lower levels of caches increasing in size to store more data than L1, hence being more distant but with a lower miss rate. This results in a better AAT. [15] The number of cache levels can be designed by architects according to their requirements after checking for trade-offs between cost, AATs, and size. [16] [17]

Performance gains

With the technology-scaling that allowed memory systems able to be accommodated on a single chip, most modern day processors have up to three or four cache levels. [18] The reduction in the AAT can be understood by this example, where the computer checks AAT for different configurations up to L3 caches.

Example: main memory = 50 ns, L1 = 1 ns with 10% miss rate, L2 = 5 ns with 1% miss rate, L3 = 10 ns with 0.2% miss rate.

Disadvantages

Properties

Cache organization with L1 as separate and L2 as unified Separate unified.png
Cache organization with L1 as separate and L2 as unified

Banked versus unified

In a banked cache, the cache is divided into a cache dedicated to instruction storage and a cache dedicated to data. In contrast, a unified cache contains both the instructions and data in the same cache. [22] During a process, the L1 cache (or most upper-level cache in relation to its connection to the processor) is accessed by the processor to retrieve both instructions and data. Requiring both actions to be implemented at the same time requires multiple ports and more access time in a unified cache. Having multiple ports requires additional hardware and wiring, leading to a significant structure between the caches and processing units. [23] To avoid this, the L1 cache is often organized as a banked cache which results in fewer ports, less hardware, and generally lower access times. [13]

Modern processors have split caches, and in systems with multilevel caches higher level caches may be unified while lower levels split. [24]

Inclusion policies

Inclusive cache organization Inclusivecache.png
Inclusive cache organization

Whether a block present in the upper cache layer can also be present in the lower cache level is governed by the memory system's inclusion policy, which may be inclusive, exclusive or non-inclusive non-exclusive (NINE).[ citation needed ]

With an inclusive policy, all the blocks present in the upper-level cache have to be present in the lower-level cache as well. Each upper-level cache component is a subset of the lower-level cache component. In this case, since there is a duplication of blocks, there is some wastage of memory. However, checking is faster.[ citation needed ]

Under an exclusive policy, all the cache hierarchy components are completely exclusive, so that any element in the upper-level cache will not be present in any of the lower cache components. This enables complete usage of the cache memory. However, there is a high memory-access latency. [25]

The above policies require a set of rules to be followed in order to implement them. If none of these are forced, the resulting inclusion policy is called non-inclusive non-exclusive (NINE). This means that the upper-level cache may or may not be present in the lower-level cache. [21]

Write policies

There are two policies which define the way in which a modified cache block will be updated in the main memory: write through and write back.[ citation needed ]

In the case of write through policy, whenever the value of the cache block changes, it is further modified in the lower-level memory hierarchy as well. [26] This policy ensures that the data is stored safely as it is written throughout the hierarchy.

However, in the case of the write back policy, the changed cache block will be updated in the lower-level hierarchy only when the cache block is evicted. A "dirty bit" is attached to each cache block and set whenever the cache block is modified. [27] During eviction, blocks with a set dirty bit will be written to the lower-level hierarchy. Under this policy, there is a risk for data-loss as the most recently changed copy of a datum is only stored in the cache and therefore some corrective techniques must be observed.

In case of a write where the byte is not present in the cache block, the byte may be brought to the cache as determined by a write allocate or write no-allocate policy. [28] Write allocate policy states that in case of a write miss, the block is fetched from the main memory and placed in the cache before writing. [29] In the write no-allocate policy, if the block is missed in the cache it will write in the lower-level memory hierarchy without fetching the block into the cache. [30]

The common combinations of the policies are "write block", "write allocate", and "write through write no-allocate".

Shared versus private

Cache organization with L1 private and L2 and L3 shared Shared private.png
Cache organization with L1 private and L2 and L3 shared

A private cache is assigned to one particular core in a processor, and cannot be accessed by any other cores. In some architectures, each core has its own private cache; this creates the risk of duplicate blocks in a system's cache architecture, which results in reduced capacity utilization. However, this type of design choice in a multi-layer cache architecture can also be good for a lower data-access latency. [28] [31] [32]

A shared cache is a cache which can be accessed by multiple cores. [33] Since it is shared, each block in the cache is unique and therefore has a larger hit rate as there will be no duplicate blocks. However, data-access latency can increase as multiple cores try to access the same cache. [34]

In multi-core processors, the design choice to make a cache shared or private impacts the performance of the processor. [35] In practice, the upper-level cache L1 (or sometimes L2) [36] [37] is implemented as private and lower-level caches are implemented as shared. This design provides high access rates for the high-level caches and low miss rates for the lower-level caches. [35]

Recent implementation models

Cache organization of Intel Nehalem microarchitecture Nehalem EP.png
Cache organization of Intel Nehalem microarchitecture

Intel Broadwell microarchitecture (2014)

Intel Kaby Lake microarchitecture (2016)

AMD Zen microarchitecture (2017)

AMD Zen 2 microarchitecture (2019)

IBM POWER7 (2010)

See also

Related Research Articles

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as traversing the elements in a one-dimensional array.

<span class="mw-page-title-main">Memory hierarchy</span> Computer memory architecture

In computer organisation, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies. Memory hierarchy affects performance in computer architectural design, algorithm predictions, and lower level programming constructs involving locality of reference.

In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. An algorithm must be analyzed to determine its resource usage, and the efficiency of an algorithm can be measured based on the usage of different resources. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.

<span class="mw-page-title-main">Opteron</span> Server and workstation processor line by AMD

Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture. It was released on April 22, 2003, with the SledgeHammer core (K8) and was intended to compete in the server and workstation markets, particularly in the same segment as the Intel Xeon processor. Processors based on the AMD K10 microarchitecture were announced on September 10, 2007, featuring a new quad-core configuration. The last released Opteron CPUs are the Piledriver-based Opteron 4300 and 6300 series processors, codenamed "Seoul" and "Abu Dhabi" respectively.

<span class="mw-page-title-main">Xeon</span> Line of Intel server and workstation processors

Xeon is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded markets. It was introduced in June 1998. Xeon processors are based on the same architecture as regular desktop-grade CPUs, but have advanced features such as support for error correction code (ECC) memory, higher core counts, more PCI Express lanes, support for larger amounts of RAM, larger cache memory and extra provision for enterprise-grade reliability, availability and serviceability (RAS) features responsible for handling hardware exceptions through the Machine Check Architecture (MCA). They are often capable of safely continuing execution where a normal processor cannot due to these extra RAS features, depending on the type and severity of the machine-check exception (MCE). Some also support multi-socket systems with two, four, or eight sockets through use of the Ultra Path Interconnect (UPI) bus.

A wait state is a delay experienced by a computer processor when accessing external memory or another device that is slow to respond.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

The AMD Family 10h, or K10, is a microprocessor microarchitecture by AMD based on the K8 microarchitecture. The first third-generation Opteron products for servers were launched on September 10, 2007, with the Phenom processors for desktops following and launching on November 11, 2007 as the immediate successors to the K8 series of processors.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor, scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

The SPARC64 V (Zeus) is a SPARC V9 microprocessor designed by Fujitsu. The SPARC64 V was the basis for a series of successive processors designed for servers, and later, supercomputers.

<span class="mw-page-title-main">Athlon II</span> Family of central processing unit models

Athlon II is a family of AMD multi-core 45 nm central processing units, which is aimed at the budget to mid-range market and is a complementary product lineup to the Phenom II.

The z196 microprocessor is a chip made by IBM for their zEnterprise 196 and zEnterprise 114 mainframe computers, announced on July 22, 2010. The processor was developed over a three-year time span by IBM engineers from Poughkeepsie, New York; Austin, Texas; and Böblingen, Germany at a cost of US$1.5 billion. Manufactured at IBM's Fishkill, New York fabrication plant, the processor began shipping on September 10, 2010. IBM stated that it was the world's fastest microprocessor at the time.

Multi-level caches can be designed in various ways depending on whether the content of one cache is present in other levels of caches. If all blocks in the higher level cache are also present in the lower level cache, then the lower level cache is said to be inclusive of the higher level cache. If the lower level cache contains only blocks that are not present in the higher level cache, then the lower level cache is said to be exclusive of the higher level cache. If the contents of the lower level cache are neither strictly inclusive nor exclusive of the higher level cache, then it is called non-inclusive non-exclusive (NINE) cache.

A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.

A victim cache is a small, usually fully associative cache placed in the refill path of a CPU cache that stores all the blocks evicted from that level of cache, originally proposed in 1990. In modern architectures, this function is typically performed by Level 3 or Level 4 caches.

<span class="mw-page-title-main">Power10</span> 2020 family of multi-core microprocessors by IBM

Power10 is a superscalar, multithreading, multi-core microprocessor family, based on the open source Power ISA, and announced in August 2020 at the Hot Chips conference; systems with Power10 CPUs. Generally available from September 2021 in the IBM Power10 Enterprise E1080 server.

<span class="mw-page-title-main">Zen 3</span> 2020 AMD 7-nanometer processor microarchitecture

Zen 3 is the codename for a CPU microarchitecture by AMD, released on November 5, 2020. It is the successor to Zen 2 and uses TSMC's 7 nm process for the chiplets and GlobalFoundries's 14 nm process for the I/O die on the server chips and 12 nm for desktop chips. Zen 3 powers Ryzen 5000 mainstream desktop processors and Epyc server processors. Zen 3 is supported on motherboards with 500 series chipsets; 400 series boards also saw support on select B450 / X470 motherboards with certain BIOSes. Zen 3 is the last microarchitecture before AMD switched to DDR5 memory and new sockets, which are AM5 for the desktop "Ryzen" chips alongside SP5 and SP6 for the EPYC server platform. According to AMD, Zen 3 has a 19% higher instructions per cycle (IPC) on average than Zen 2.

References

  1. Hennessy, John L; Patterson, David A; Asanović, Krste; Bakos, Jason D; Colwell, Robert P; Bhattacharjee, Abhishek; Conte, Thomas M; Duato, José; Franklin, Diana; Goldberg, David; Jouppi, Norman P; Li, Sheng; Muralimanohar, Naveen; Peterson, Gregory D; Pinkston, Timothy Mark; Ranganathan, Prakash; Wood, David Allen; Young, Clifford; Zaky, Amr (2011). Computer Architecture: a Quantitative Approach (Sixth ed.). ISBN   978-0128119051. OCLC   983459758.
  2. "Cache: Why Level It" (PDF).
  3. Ronald D. Miller; Lars I. Eriksson; Lee A Fleisher, 2014. Miller's Anesthesia E-Book. Elsevier Health Sciences. p. 75. ISBN   978-0-323-28011-2.
  4. Albert Y. Zomaya, 2006. Handbook of Nature-Inspired and Innovative Computing: Integrating Classical Models with Emerging Technologies. Springer Science & Business Media. p. 298. ISBN   978-0-387-40532-2.
  5. Richard C. Dorf, 2018. Sensors, Nanoscience, Biomedical Engineering, and Instruments: Sensors Nanoscience Biomedical Engineering. CRC Press. p. 4. ISBN   978-1-4200-0316-1.
  6. David A. Patterson; John L. Hennessy, 2004. Computer Organization and Design: The Hardware/Software Interface, Third Edition. Elsevier. p. 552. ISBN   978-0-08-050257-1.
  7. "Sir Maurice Vincent Wilkes | British computer scientist". Encyclopædia Britannica. Retrieved 2016-12-11.
  8. Berkeley, John L. Hennessy, Stanford University, and David A. Patterson, University of California. "Memory Hierarchy Design - Part 6. The Intel Core i7, fallacies, and pitfalls". EDN. Retrieved 2022-10-13.{{cite news}}: CS1 maint: multiple names: authors list (link)
  9. Shane Cook, 2012. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs. Newnes. pp. 107–109. ISBN   978-0-12-415988-4.
  10. Bruce Hellingsworth; Patrick Hall; Howard Anderson; 2001. Higher National Computing. Routledge. pp. 30–31. ISBN   978-0-7506-5230-8.
  11. Reeta Sahoo, Gagan Sahoo. Infomatic Practices. Saraswati House Pvt Ltd. pp. 1–. ISBN   978-93-5199-433-6.
  12. Phillip A. Laplante; Seppo J. Ovaska; 2011. Real-Time Systems Design and Analysis: Tools for the Practitioner. John Wiley & Sons. pp. 94–95. ISBN   978-1-118-13659-1.
  13. 1 2 3 Hennessey and Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann. ISBN   9780123704900.
  14. Cetin Kaya Koc, 2008. Cryptographic Engineering. Springer Science & Business Media. pp. 479–480. ISBN   978-0-387-71817-0.
  15. David A. Patterson; John L. Hennessy; 2008. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann. pp. 489–492. ISBN   978-0-08-092281-2.
  16. Harvey G. Cragon, 2000. Computer Architecture and Implementation. Cambridge University Press. pp. 95–97. ISBN   978-0-521-65168-4.
  17. Baker Mohammad, 2013. Embedded Memory Design for Multi-Core and Systems on Chip. Springer Science & Business Media. pp. 11–14. ISBN   978-1-4614-8881-1.
  18. Gayde, William. "How CPUs are Designed and Built". Techspot. Retrieved 17 August 2019.
  19. Vojin G. Oklobdzija, 2017. Digital Design and Fabrication. CRC Press. p. 4. ISBN   978-0-8493-8604-6.
  20. "Memory Hierarchy".
  21. 1 2 Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. Chapman and Hall. pp. Chapter 5: Introduction to Memory Hierarchy Organization. ISBN   9781482211184.
  22. Yan Solihin, 2015. Fundamentals of Parallel Multicore Architecture. CRC Press. p. 150. ISBN   978-1-4822-1119-1.
  23. Steve Heath, 2002. Embedded Systems Design. Elsevier. p. 106. ISBN   978-0-08-047756-5.
  24. Alan Clements, 2013. Computer Organization & Architecture: Themes and Variations. Cengage Learning. p. 588. ISBN   1-285-41542-6.
  25. "Performance Evaluation of Exclusive Cache Hierarchies" (PDF). Archived from the original (PDF) on 2012-08-13. Retrieved 2016-10-19.
  26. David A. Patterson; John L. Hennessy; 2017. Computer Organization and Design RISC-V Edition: The Hardware Software Interface. Elsevier Science. pp. 386–387. ISBN   978-0-12-812276-1.
  27. Stefan Goedecker; Adolfy Hoisie; 2001. Performance Optimization of Numerically Intensive Codes. SIAM. p. 11. ISBN   978-0-89871-484-5.
  28. 1 2 Solihin, Yan (2009). Fundamentals of Parallel Computer Architecture. Solihin Publishing. pp. Chapter 6: Introduction to Memory Hierarchy Organization. ISBN   9780984163007.
  29. Harvey G. Cragon, 1996. Memory Systems and Pipelined Processors. Jones & Bartlett Learning. p. 47. ISBN   978-0-86720-474-2.
  30. David A. Patterson; John L. Hennessy; 2007. Computer Organization and Design, Revised Printing, Third Edition: The Hardware/Software Interface. Elsevier. p. 484. ISBN   978-0-08-055033-6.
  31. "Software Techniques for Shared-Cache Multi-Core Systems". 2018-05-24.
  32. "An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors" (PDF). Archived from the original (PDF) on 2016-10-19.
  33. Akanksha Jain; Calvin Lin; 2019. Cache Replacement Policies. Morgan & Claypool Publishers. p. 45. ISBN   978-1-68173-577-1.
  34. David Culler; Jaswinder Pal Singh; Anoop Gupta; 1999. Parallel Computer Architecture: A Hardware/Software Approach. Gulf Professional Publishing. p. 436. ISBN   978-1-55860-343-1.
  35. 1 2 Stephen W. Keckler; Kunle Olukotun; H. Peter Hofstee; 2009. Multicore Processors and Systems. Springer Science & Business Media. p. 182. ISBN   978-1-4419-0263-4.
  36. 1 2 "Intel Broadwell Microarchitecture".
  37. 1 2 "Intel Kaby Lake Microrchitecture".
  38. "The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms" (PDF). Archived from the original (PDF) on 2014-08-11.
  39. "IBM Power7".