Cache prefetching

Last updated

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch'). [1] [2] Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

Contents

Data vs. instruction cache prefetching

Cache prefetching can either fetch data or instructions into cache.

Hardware vs. software cache prefetching

Cache prefetching can be accomplished either by hardware or by software. [3]

Methods of hardware prefetching

Stream buffers

A typical stream buffer setup as originally proposed by Norman Jouppi in 1990 CachePrefetching StreamBuffers.png
A typical stream buffer setup as originally proposed by Norman Jouppi in 1990

Strided prefetching

This type of prefetching monitors the delta between the addresses of the memory accesses and looks for patterns within it.

Regular strides

In this pattern, consecutive memory accesses are made to blocks that are addresses apart. [3] [12] In this case, the prefetcher calculates the and uses it to compute the memory address for prefetching. E.g.: If the is 4, the address to be prefetched would A+4.

Irregular spatial strides

In this case, the delta between the addresses of consecutive memory accesses is variable but still follows a pattern. Some prefetchers designs [9] [13] [14] exploit this property to predict and prefetch for future accesses.

Irregular temporal prefetching

This class of prefetchers look for memory access streams that repeat over time. [15] [16] E.g. In this stream of memory accesses: N, A, B, C, E, G, H, A, B, C, I, J, K, A, B, C, L, M, N, O, A, B, C, ...; the stream A,B,C is repeating over time. Other design variation have tried to provide more efficient, performant implementations. [17] [18]

Collaborative prefetching

Computer applications generate a variety of access patterns. The processor and memory subsystem architectures used to execute these applications further disambiguate the memory access patterns they generate. Hence, the effectiveness and efficiency of prefetching schemes often depend on the application and the architectures used to execute them. [19] Recent research [20] [21] has focused on building collaborative mechanisms to synergistically use multiple prefetching schemes for better prefetching coverage and accuracy.

Methods of software prefetching

Compiler directed prefetching

Compiler directed prefetching is widely used within loops with a large number of iterations. In this technique, the compiler predicts future cache misses and inserts a prefetch instruction based on the miss penalty and execution time of the instructions.

These prefetches are non-blocking memory operations, i.e. these memory accesses do not interfere with actual memory accesses. They do not change the state of the processor or cause page faults.

One main advantage of software prefetching is that it reduces the number of compulsory cache misses. [3]

The following example shows how a prefetch instruction would be added into code to improve cache performance.

Consider a for loop as shown below:

for(inti=0;i<1024;i++){array1[i]=2*array1[i];}

At each iteration, the ith element of the array "array1" is accessed. Therefore, the system can prefetch the elements that are going to be accessed in future iterations by inserting a "prefetch" instruction as shown below:

for(inti=0;i<1024;i++){prefetch(array1[i+k]);array1[i]=2*array1[i];}

Here, the prefetch stride, depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the for loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles then there should be - which means that the system should prefetch 7 elements ahead. With the first iteration, i will be 0, so the system prefetches the 7th element. Now, with this arrangement, the first 7 accesses (i=0->6) will still be misses (under the simplifying assumption that each element of array1 is in a separate cache line of its own).

Comparison of hardware and software prefetching

Metrics of cache prefetching

There are three main metrics to judge cache prefetching [3]

Coverage

Coverage is the fraction of total misses that are eliminated because of prefetching, i.e.

,

where,

Accuracy

Accuracy is the fraction of total prefetches that were useful - i.e. the ratio of the number of memory addresses prefetched were actually referenced by the program to the total prefetches done.

While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these may be a small fraction of the total number of misses observed without any prefetching, this is a non-zero number of misses.

Timeliness

The qualitative definition of timeliness is how early a block is prefetched versus when it is actually referenced. An example to further explain timeliness is as follows:

Consider a for loop where each iteration takes 3 cycles to execute and the 'prefetch' operation takes 12 cycles. This implies that for the prefetched data to be useful, the system must start the prefetch iterations prior to its usage to maintain timeliness.

See also

Related Research Articles

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

In a computer operating system that uses paging for virtual memory management, page replacement algorithms decide which memory pages to page out, sometimes called swap out, or write to disk, when a page of memory needs to be allocated. Page replacement happens when a requested page is not in memory and a free page cannot be used to satisfy the allocation, either because there are none, or because the number of free pages is lower than some threshold.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

Cache only memory architecture (COMA) is a computer memory organization for use in multiprocessors in which the local memories at each node are used as cache. This is in contrast to using the local memories as actual main memory, as in NUMA organizations.

In computing, cache replacement policies are optimizing instructions or algorithms which a computer program or hardware-maintained structure can utilize to manage a cache of information. Caching improves performance by keeping recent or often-used data items in memory locations which are faster, or computationally cheaper to access, than normal memory stores. When the cache is full, the algorithm must choose which items to discard to make room for new data.

In computer engineering, out-of-order execution is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems.

The International Symposium on Computer Architecture (ISCA) is an annual academic conference on computer architecture, generally viewed as the top-tier in the field. Association for Computing Machinery's Special Interest Group on Computer Architecture and Institute of Electrical and Electronics Engineers Computer Society are technical sponsors.

In computer architecture, memory-level parallelism (MLP) is the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.

Runahead is a technique that allows a computer processor to speculatively pre-process instructions during cache miss cycles. The pre-processed instructions are used to generate instruction and data stream prefetches by executing instructions leading to cache misses before they would normally occur, effectively hiding memory latency. In runahead, the processor uses the idle execution resources to calculate instruction and data stream addresses using the available information that is independent of a cache miss. Once the processor has resolved the initial cache miss, all runahead results are discarded, and the processor resumes execution as normal. The primary use case of the technique is to mitigate the effects of the memory wall. The technique may also be used for other purposes, such as pre-computing branch outcomes to achieve highly accurate branch prediction.

The Alpha 21464 is an unfinished microprocessor that implements the Alpha instruction set architecture (ISA) developed by Digital Equipment Corporation and later by Compaq after it acquired Digital. The microprocessor was also known as EV8. Slated for a 2004 release, it was canceled on 25 June 2001 when Compaq announced that Alpha would be phased out in favor of Itanium by 2004. When it was canceled, the Alpha 21464 was at a late stage of development but had not been taped out.

Gather/scatter is a type of memory addressing that at once collects (gathers) from, or stores (scatters) data to, multiple, arbitrary indices. Examples of its use include sparse linear algebra operations, sorting algorithms, fast Fourier transforms, and some computational graph theory problems. It is the vector equivalent of register indirect addressing, with gather involving indexed reads, and scatter, indexed writes. Vector processors have hardware support for gather and scatter operations, as do many input/output systems, allowing large data sets to be transferred to main memory more rapidly.

Electronic systems’ power consumption has been a real challenge for Hardware and Software designers as well as users especially in portable devices like cell phones and laptop computers. Power consumption also has been an issue for many industries that use computer systems heavily such as Internet service providers using servers or companies with many employees using computers and other computational devices. Many different approaches have been discovered by researchers to estimate power consumption efficiently. This survey paper focuses on the different methods where power consumption can be estimated or measured in real-time.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Approximate computing is an emerging paradigm for energy-efficient and/or high-performance design. It includes a plethora of computation techniques that return a possibly inaccurate result rather than a guaranteed accurate result, and that can be used for applications where an approximate result is sufficient for its purpose. One example of such situation is for a search engine where no exact answer may exist for a certain search query and hence, many answers may be acceptable. Similarly, occasional dropping of some frames in a video application can go undetected due to perceptual limitations of humans. Approximate computing is based on the observation that in many scenarios, although performing exact computation requires large amount of resources, allowing bounded approximation can provide disproportionate gains in performance and energy, while still achieving acceptable result accuracy. For example, in k-means clustering algorithm, allowing only 5% loss in classification accuracy can provide 50 times energy saving compared to the fully accurate classification.

An AI accelerator, deep learning processor, or neural processing unit is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and machine vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFET transistors.

A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.

<span class="mw-page-title-main">Babak Falsafi</span>

Babak Falsafi is a computer scientist specializing in computer architecture and digital platform design. He is the founding director of EcoCloud at EPFL, an industrial/academic consortium investigating efficient and intelligent data-centric technologies. He is a professor in the School of Computer and Communication Sciences at EPFL. Prior to that he was a professor of electrical and computer engineering at Carnegie Mellon University, and an assistant professor of electrical and computer engineering at Purdue University. He holds a bachelor's degree in computer science, a bachelor's degree in electrical and computer engineering with distinctions from SUNY Buffalo, and a master's degree and PhD in computer science from University Wisconsin - Madison.

Norman Paul Jouppi is an American electrical engineer and computer scientist.

Trevor Mudge is a computer scientist, academic and researcher. He is the Bredt Family Chair of Computer Science and Engineering, and Professor of Electrical Engineering and Computer Science at the University of Michigan.

References

  1. 1 2 Smith, Alan Jay (1982-09-01). "Cache Memories". ACM Comput. Surv. 14 (3): 473–530. doi:10.1145/356887.356892. ISSN   0360-0300. S2CID   6023466.
  2. Li, Chunlin; Song, Mingyang; Du, Shaofeng; Wang, Xiaohai; Zhang, Min; Luo, Youlong (2020-09-01). "Adaptive priority-based cache replacement and prediction-based cache prefetching in edge computing environment". Journal of Network and Computer Applications. 165: 102715. doi:10.1016/j.jnca.2020.102715. S2CID   219506427.
  3. 1 2 3 4 5 6 Solihin, Yan (2016). Fundamentals of parallel multicore architecture. Boca Raton, Florida: CRC Press, Taylor & Francis Group. p. 163. ISBN   978-1482211184.
  4. Baer, Jean-Loup; Chen, Tien-Fu (1991-01-01). An Effective On-chip Preloading Scheme to Reduce Data Access Penalty. 1991 ACM/IEEE Conference on Supercomputing. Albuquerque, New Mexico, USA: Association for Computing Machinery. pp. 176–186. CiteSeerX   10.1.1.642.703 . doi:10.1145/125826.125932. ISBN   978-0897914598.
  5. Kennedy, Porterfield, Allan (1989-01-01). Software methods for improvement of cache performance on supercomputer applications (Thesis). Rice University. hdl:1911/19069.{{cite thesis}}: CS1 maint: multiple names: authors list (link)
  6. 1 2 3 Jouppi, Norman P. (1990). "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers". Proceedings of the 17th annual international symposium on Computer Architecture – ISCA 1990. 17th annual international symposium on Computer Architecture – ISCA 1990. New York, New York, USA: Association for Computing Machinery Press. pp. 364–373. CiteSeerX   10.1.1.37.6114 . doi:10.1145/325164.325162. ISBN   0-89791-366-3.
  7. Chen, Tien-Fu; Baer, Jean-Loup (1995-05-01). "Effective hardware-based data prefetching for high-performance processors". IEEE Transactions on Computers. 44 (5): 609–623. doi:10.1109/12.381947. ISSN   0018-9340. S2CID   1450745.
  8. Palacharla, S.; Kessler, R. E. (1994-01-01). Evaluating Stream Buffers As a Secondary Cache Replacement. 21st Annual International Symposium on Computer Architecture. Chicago, Illinois, USA: IEEE Computer Society Press. pp. 24–33. CiteSeerX   10.1.1.92.3031 . doi:10.1109/ISCA.1994.288164. ISBN   978-0818655104.
  9. 1 2 Grannaes, Marius; Jahre, Magnus; Natvig, Lasse (2011). "Storage Efficient Hardware Prefetching using Delta-Correlating Prediction Tables". Journal of Instruction-Level Parallelism (13): 1–16. CiteSeerX   10.1.1.229.3483 .
  10. Ishii, Yasuo; Inaba, Mary; Hiraki, Kei (2009-06-08). "Access map pattern matching for data cache prefetch". Proceedings of the 23rd International Conference on Supercomputing. ICS 2009. New York, New York, USA: Association for Computing Machinery. pp. 499–500. doi:10.1145/1542275.1542349. ISBN   978-1-60558-498-0. S2CID   37841036.
  11. Srinath, Santhosh; Mutlu, Onur; Kim, Hyesoon; Patt, Yale N. (February 2007). Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. 2007 IEEE 13th International Symposium on High Performance Computer Architecture. pp. 63–74. doi:10.1109/HPCA.2007.346185. ISBN   978-1-4244-0804-7. S2CID   6909725.
  12. Kondguli, Sushant; Huang, Michael (November 2017). T2: A Highly Accurate and Energy Efficient Stride Prefetcher. 2017 IEEE International Conference on Computer Design (ICCD). pp. 373–376. doi:10.1109/ICCD.2017.64. ISBN   978-1-5386-2254-4. S2CID   11055312.
  13. Shevgoor, Manjunath; Koladiya, Sahil; Balasubramonian, Rajeev; Wilkerson, Chris; Pugsley, Seth H.; Chishti, Zeshan (December 2015). Efficiently prefetching complex address patterns. 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 141–152. doi:10.1145/2830772.2830793. ISBN   9781450340342. S2CID   15294463.
  14. Kim, Jinchun; Pugsley, Seth H.; Gratz, Paul V.; Reddy, A.L. Narasimha; Wilkerson, Chris; Chishti, Zeshan (October 2016). Path confidence based lookahead prefetching. 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1–12. doi:10.1109/MICRO.2016.7783763. ISBN   978-1-5090-3508-3. S2CID   1097472.
  15. Joseph, Doug; Grunwald, Dirk (1997-05-01). "Prefetching using Markov predictors". Proceedings of the 24th Annual International Symposium on Computer Architecture. ISCA 1997. ISCA 1997. New York, New York, USA: Association for Computing Machinery. pp. 252–263. doi:10.1145/264107.264207. ISBN   978-0-89791-901-2. S2CID   434419.
  16. Collins, J.; Sair, S.; Calder, B.; Tullsen, D.M. (November 2002). Pointer cache assisted prefetching. 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings. pp. 62–73. doi:10.1109/MICRO.2002.1176239. ISBN   0-7695-1859-1. S2CID   5608519.
  17. Jain, Akanksha; Lin, Calvin (2013-12-07). "Linearizing irregular memory accesses for improved correlated prefetching". Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-46. New York, New York, USA: Association for Computing Machinery. pp. 247–259. doi:10.1145/2540708.2540730. ISBN   978-1-4503-2638-4. S2CID   196831.
  18. "Making Temporal Prefetchers Practical: The MISB Prefetcher – Research Articles – Arm Research – Arm Community". community.arm.com. 24 June 2019. Retrieved 2022-03-16.
  19. Kim, Jinchun; Teran, Elvira; Gratz, Paul V.; Jiménez, Daniel A.; Pugsley, Seth H.; Wilkerson, Chris (2017-05-12). "Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy". ACM SIGPLAN Notices. 52 (4): 737–749. doi: 10.1145/3093336.3037701 . ISSN   0362-1340.
  20. Kondguli, Sushant; Huang, Michael (2018-06-02). "Division of labor: a more effective approach to prefetching". Proceedings of the 45th Annual International Symposium on Computer Architecture. ISCA '18. Los Angeles, California: IEEE Press. pp. 83–95. doi:10.1109/ISCA.2018.00018. ISBN   978-1-5386-5984-7. S2CID   50777324.
  21. Pakalapati, Samuel; Panda, Biswabandan (May 2020). Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). pp. 118–131. doi:10.1109/ISCA45697.2020.00021. ISBN   978-1-7281-4661-4. S2CID   218683672.
  22. Callahan, David; Kennedy, Ken; Porterfield, Allan (1991-01-01). Software Prefetching. Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. Santa Clara, California, USA: Association for Computing Machinery. pp. 40–52. doi:10.1145/106972.106979. ISBN   978-0897913805.
  23. Lee, Jaekyu and Kim, Hyesoon and Vuduc, Richard (2012), "When Prefetching Works, When It Doesn't, and Why" (PDF), ACM Trans. Archit. Code Optim., 9: 1–29, doi:10.1145/2133382.2133384 {{citation}}: CS1 maint: multiple names: authors list (link)