Memory hierarchy

Last updated
Diagram of the computer memory hierarchy ComputerMemoryHierarchy.svg
Diagram of the computer memory hierarchy

In computer architecture, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies. [1] Memory hierarchy affects performance in computer architectural design, algorithm predictions, and lower level programming constructs involving locality of reference.

Contents

Designing for high performance requires considering the restrictions of the memory hierarchy, i.e. the size and capabilities of each component. Each of the various components can be viewed as part of a hierarchy of memories (m1, m2, ..., mn) in which each member mi is typically smaller and faster than the next highest member mi+1 of the hierarchy. To limit waiting by higher levels, a lower level will respond by filling a buffer and then signaling for activating the transfer.

There are four major storage levels. [1]

This is a general memory hierarchy structuring. Many other structures are useful. For example, a paging algorithm may be considered as a level for virtual memory when designing a computer architecture, and one can include a level of nearline storage between online and offline storage.

Properties of the technologies in the memory hierarchy

Examples

Memory hierarchy of an AMD Bulldozer server Hwloc.png
Memory hierarchy of an AMD Bulldozer server

The number of levels in the memory hierarchy and the performance at each level has increased over time. The type of memory or storage components also change historically. [6]

Cache, memory, and external storage hierarchy of a 2020s computer system (AMD Zen 4)
LevelSizeThroughputLatencyNotes
Register file 18,432 bitsUp to 256 GB/s (512 bits/cycle)0.25 ns (1 cycle) [7] All CPU-related conversion assumes a 4.0 GHz clock. Same for below. Full utilization of throughput is impossible on real workloads. Size is provided for each core.
CPU cache L1 data32 KiBUp to 64 GB/s (64 bytes/4 cycles)1&nhsp;ns (4 cycles) [7] Hardware prefetching is required for maximum throughput. Size and throughput are per-core. Code cache has the same size but is not manipulable as data.
L21 MBUp to 18.3 GB/s (64 bytes/14 cycles)3.5 ns (14 cycles) [7] Size and throughput are per-core.
L31632 MBUp to 5.45 GB/s (64 bytes/47 cycles)11.75 ns (47 cycles) [7] Size is shared among 8 cores. Throughput is per-core.
Main memory (primary)64 GiB~60 GB/s82.5 nsSize is shared among all cores. Latency depends on the memory clock and memory timings. In this case, a result from a pair of 32 GB DDR5 DIMMs set to 6000 MT/s via the factory EXPO profile is used. [8]

Systems with multiple CPU sockets have an additional NUMA delay when a CPU tries to access memory under the control of another NUMA node.

Mass storage
(secondary)
Solid-state drive 2 TB2000 MB/s0.2 msFigures for a M.2 NVMe SSD from 2017, the Samsung 960 Pro. [9]
Hard disk drive 18 TB500 MB/s4.16 msPer-drive figures for Exos 2X18 (ST18000NM0092), an enterprise-grade 3.5 inch SATA HDD. [10]
Nearline
(tertiary)
Spun-down HDDs (MAID)Petabytes25 sPer-drive figures for Exos 2X18 (ST18000NM0092), from user manual entry for "start/stop times". [11] In a typical MAID setup, hundreds of spun-down HDDs may be used for petabytes of storage.
Tape library Exabytes160 MB/s [12] Minutes
Offline storageExabytesDepends on mediumDepends on human operation

Some CPUs include additional levels of cache between L3 and memory. For example, the Haswell microarchitecture includes an L4 cache of 128 MB on mobile units. [13] [14]

The lower levels of the hierarchy from mass storage downwards are also known as tiered storage. The formal distinction between online, nearline, and offline storage is: [15]

For example, always-on spinning disks are online, while spinning disks that spin down, such as massive arrays of idle disk (MAID), are nearline. Removable media such as tape cartridges that can be automatically loaded, as in a tape library, are nearline, while cartridges that must be manually loaded are offline.

Programming

Most modern CPUs are so fast that, for most program workloads, the bottleneck is the locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchy[ citation needed ]. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete. This is sometimes called the space cost, as a larger memory object is more likely to overflow a small and fast level and require use of a larger, slower level. The resulting load on memory use is known as pressure (respectively register pressure, cache pressure, and (main) memory pressure). Terms for data being missing from a higher level and needing to be fetched from a lower level are, respectively: register spilling (due to register pressure: register to cache), cache miss (cache to main memory), and (hard) page fault (real main memory to virtual memory, i.e. mass storage, commonly referred to as disk regardless of the actual mass storage technology used).

Modern programming languages mainly assume two levels of memory, main (working) memory and mass storage. The exception is the relatively low-level assembly language and in the inline assemblers of higher-level languages such as C. Taking optimal advantage of the memory hierarchy requires the cooperation of programmers, hardware, and compilers (as well as underlying support from the operating system):

Many programmers assume one level of memory. This works fine until the application hits a performance wall. At that point, the programmer needs to change the code's memory access patterns to that it works well with cache resources. A classic illustration of the effect of locality and caching is in the form of changing the order of iterating a three-dimensional array. Computer Systems: A Programmer's Perspective is a classic textbook that deals with this aspect of systems programming. [16]

See also

References

  1. 1 2 Toy, Wing; Zee, Benjamin (1986). Computer Hardware/Software Architecture. Prentice Hall. p.  30. ISBN   0-13-163502-6.
  2. Write-combining
  3. "Memory Hierarchy". Unitity Semiconductor Corporation. Archived from the original on 5 August 2009. Retrieved 16 September 2009.
  4. Pádraig Brady. "Multi-Core" . Retrieved 16 September 2009.
  5. 1 2 3 van der Pas, Ruud (2002). "Memory Hierarchy in Cache-Based Systems" (PDF). Santa Clara, California: Sun Microsystems: 26. 817-0742-10.{{cite journal}}: Cite journal requires |journal= (help)
  6. "Memory & Storage – Timeline of Computer History – Computer History Museum". www.computerhistory.org.
  7. 1 2 3 4 Fog, Agner. "The microarchitecture of Intel and AMD CPUs" (PDF). Chapters used: 24.16 Cache and memory access (Zen 4).
  8. "AMD Ryzen 7000/9000 DDR5 RAM OC Guide XPM and EXPO Profile Benchmarks".
  9. "Samsung 960 Pro M.2 NVMe SSD Review". storagereview.com. 20 October 2016. Retrieved 2017-04-13.
  10. "Datasheet Exos 2X18" (PDF).
  11. "2X18 SATA Product Manual" (PDF).
  12. "Ultrium – LTO Technology – Ultrium GenerationsLTO". Lto.org. Archived from the original on 2011-07-27. Retrieved 2014-07-31.
  13. Crothers, Brooke. "Dissecting Intel's top graphics in Apple's 15-inch MacBook Pro – CNET". News.cnet.com. Retrieved 2014-07-31.
  14. "SiSoftware Zone". Sisoftware.co.uk. Archived from the original on 2014-09-13. Retrieved 2014-07-31.
  15. Pearson, Tony (2010). "Correct use of the term Nearline". IBM Developerworks, Inside System Storage. Archived from the original on 2018-11-27. Retrieved 2015-08-16.
  16. "A Programmer's Perspective: Memory Systems".