Memory access pattern

Last updated

In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, [1] and also have implications for the approach to parallelism [2] [3] and distribution of workload in shared memory systems. [4] Further, cache coherency issues can affect multiprocessor performance, [5] which means that certain memory access patterns place a ceiling on parallelism (which manycore approaches seek to break). [6]

Contents

Computer memory is usually described as "random access", but traversals by software will still exhibit patterns that can be exploited for efficiency. Various tools exist to help system designers [7] and programmers understand, analyse and improve the memory access pattern, including VTune and Vectorization Advisor, [8] [9] [10] [11] [12] including tools to address GPU memory access patterns. [13]

Memory access patterns also have implications for security, [14] [15] which motivates some to try and disguise a program's activity for privacy reasons. [16] [17]

Examples

Sequential and Linear patterns are incorrectly drawn as counterparts to each other by some publications; while real-world workloads contain almost innumerable patterns. Random vs sequential access.svg
Sequential and Linear patterns are incorrectly drawn as counterparts to each other by some publications; while real-world workloads contain almost innumerable patterns.

Sequential

The simplest extreme is the sequential access pattern, where data is read, processed, and written out with straightforward incremented/decremented addressing. These access patterns are highly amenable to prefetching.

Strided

Strided or simple 2D, 3D access patterns (e.g., stepping through multi-dimensional arrays) are similarly easy to predict, and are found in implementations of linear algebra algorithms and image processing. Loop tiling is an effective approach. [19] Some systems with DMA provided a strided mode for transferring data between subtile of larger 2D arrays and scratchpad memory. [20]

Linear

A linear access pattern is closely related to "strided", where a memory address may be computed from a linear combination of some index. Stepping through indices sequentially with a linear pattern yields strided access. A linear access pattern for writes (with any access pattern for non-overlapping reads) may guarantee that an algorithm can be parallelised, which is exploited in systems supporting compute kernels.

Nearest neighbor

Nearest neighbor memory access patterns appear in simulation, and are related to sequential or strided patterns. An algorithm may traverse a data structure using information from the nearest neighbors of a data element (in one or more dimensions) to perform a calculation. These are common in physics simulations operating on grids. [21] Nearest neighbor can also refer to inter-node communication in a cluster; physics simulations which rely on such local access patterns can be parallelized with the data partitioned into cluster nodes, with purely nearest-neighbor communication between them, which may have advantages for latency and communication bandwidth. This use case maps well onto torus network topology. [22]

2D spatially coherent

In 3D rendering, access patterns for texture mapping and rasterization of small primitives (with arbitrary distortions of complex surfaces) are far from linear, but can still exhibit spatial locality (e.g., in screen space or texture space). This can be turned into good memory locality via some combination of morton order [23] and tiling for texture maps and frame buffer data (mapping spatial regions onto cache lines), or by sorting primitives via tile based deferred rendering. [24] It can also be advantageous to store matrices in morton order in linear algebra libraries. [25]

Scatter

A scatter memory access pattern combines sequential reads with indexed/random addressing for writes. [26] Compared to gather, It may place less load on a cache hierarchy since a processing element may dispatch writes in a "fire and forget" manner (bypassing a cache altogether), whilst using predictable prefetching (or even DMA) for its source data.

However, it may be harder to parallelise since there is no guarantee the writes do not interact, [27] and many systems are still designed assuming that a hardware cache will coalesce many small writes into larger ones.

In the past, forward texture mapping attempted to handle the randomness with "writes", whilst sequentially reading source texture information.

The PlayStation 2 console used conventional inverse texture mapping, but handled any scatter/gather processing "on-chip" using EDRAM, whilst 3D model (and a lot of texture data) from main memory was fed sequentially by DMA. This is why it lacked support for indexed primitives, and sometimes needed to manage textures "up front" in the display list.

Gather

In a gather memory access pattern, reads are randomly addressed or indexed, whilst the writes are sequential (or linear). [26] An example is found in inverse texture mapping, where data can be written out linearly across scan lines, whilst random access texture addresses are calculated per pixel.

Compared to scatter, the disadvantage is that caching (and bypassing latencies) is now essential for efficient reads of small elements, however it is easier to parallelise since the writes are guaranteed to not overlap. As such the gather approach is more common for gpgpu programming, [27] where the massive threading (enabled by parallelism) is used to hide read latencies. [27]

Combined gather and scatter

An algorithm may gather data from one source, perform some computation in local or on chip memory, and scatter results elsewhere. This is essentially the full operation of a GPU pipeline when performing 3D rendering- gathering indexed vertices and textures, and scattering shaded pixels in screen space. Rasterization of opaque primitives using a depth buffer is "commutative", allowing reordering, which facilitates parallel execution. In the general case synchronisation primitives would be needed.

Random

At the opposite extreme is a truly random memory access pattern. A few multiprocessor systems are specialised to deal with these. [28] The PGAS approach may help by sorting operations by data on the fly (useful when the problem *is* figuring out the locality of unsorted data). [21] Data structures which rely heavily on pointer chasing can often produce poor locality of reference, although sorting can sometimes help. Given a truly random memory access pattern, it may be possible to break it down (including scatter or gather stages, or other intermediate sorting) which may improve the locality overall; this is often a prerequisite for parallelizing.

Approaches

Data-oriented design

Data-oriented design is an approach intended to maximise the locality of reference, by organising data according to how it is traversed in various stages of a program, contrasting with the more common object oriented approach (i.e., organising such that data layout explicitly mirrors the access pattern). [1]

Contrast with locality of reference

Locality of reference refers to a property exhibited by memory access patterns. A programmer will change the memory access pattern (by reworking algorithms) to improve the locality of reference, [29] and/or to increase potential for parallelism. [26] A programmer or system designer may create frameworks or abstractions (e.g., C++ templates or higher-order functions) that encapsulate a specific memory access pattern. [30] [31]

Different considerations for memory access patterns appear in parallelism beyond locality of reference, namely the separation of reads and writes. E.g.: even if the reads and writes are "perfectly" local, it can be impossible to parallelise due to dependencies; separating the reads and writes into separate areas yields a different memory access pattern, maybe initially appear worse in pure locality terms, but desirable to leverage modern parallel hardware. [26]

Locality of reference may also refer to individual variables (e.g., the ability of a compiler to cache them in registers), whilst the term memory access pattern only refers to data held in an indexable memory (especially main memory).

See also

Related Research Articles

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as traversing the elements in a one-dimensional array.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

In computing, cache replacement policies are optimizing instructions or algorithms which a computer program or hardware-maintained structure can utilize to manage a cache of information. Caching improves performance by keeping recent or often-used data items in memory locations which are faster, or computationally cheaper to access, than normal memory stores. When the cache is full, the algorithm must choose which items to discard to make room for new data.

In computer science, a parallel random-access machine is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM). In the same way that the RAM is used by sequential-algorithm designers to model algorithmic performance, the PRAM is used by parallel-algorithm designers to model parallel algorithmic performance. Similar to the way in which the RAM model neglects practical issues, such as access time to cache memory versus main memory, the PRAM model neglects such issues as synchronization and communication, but provides any (problem-size-dependent) number of processors. Algorithm cost, for instance, is estimated using two parameters O(time) and O(time × processor_number).

<span class="mw-page-title-main">Linearizability</span> Property of some operation(s) in concurrent programming

In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:

  1. The extended list can be re-expressed as a sequential history.
  2. That sequential history is a subset of the original unextended list.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory such as hard drives or tape drives, or when memory is on a computer network. External memory algorithms are analyzed in the external memory model.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

A write buffer is a type of data buffer that can be used to hold data being written from the cache to main memory or to the next cache in the memory hierarchy to improve performance and reduce latency. It is used in certain CPU cache architectures like Intel's x86 and AMD64. In multi-core systems, write buffers destroy sequential consistency. Some software disciplines, like C11's data-race-freedom, are sufficient to regain a sequentially consistent view of memory.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations. CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

<span class="mw-page-title-main">Larrabee (microarchitecture)</span> Canceled Intel GPGPU chip

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in the state of Washington. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

Gather/scatter is a type of memory addressing that at once collects (gathers) from, or stores (scatters) data to, multiple, arbitrary indices. Examples of its use include sparse linear algebra operations, sorting algorithms, fast Fourier transforms, and some computational graph theory problems. It is the vector equivalent of register indirect addressing, with gather involving indexed reads, and scatter, indexed writes. Vector processors have hardware support for gather and scatter operations, as do many input/output systems, allowing large data sets to be transferred to main memory more rapidly.

<span class="mw-page-title-main">Log-structured merge-tree</span> Data structure

In computer science, the log-structured merge-tree is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

An Oblivious RAM (ORAM) simulator is a compiler that transforms an algorithm in such a way that the resulting algorithm preserves the input-output behavior of the original algorithm but the distribution of the memory access patterns of the transformed algorithm is independent of the memory access pattern of the original algorithm.

In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler. They may reduce cache pollution, reduce bandwidth requirement, bypass latencies, by providing better control over the working set. Most cache control instructions do not affect the semantics of a program, although some can.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

<span class="mw-page-title-main">Concurrent hash table</span>

A concurrent hash table or concurrent hash map is an implementation of hash tables allowing concurrent access by multiple threads using a hash function.

References

  1. 1 2 "Introduction to Data-Oriented Design" (PDF). Archived from the original (PDF) on 2019-11-16.
  2. Jang, Byunghyun; Schaa, Dana; Mistry, Perhaad & Kaeli, David (2010-05-27). "Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures". IEEE Transactions on Parallel and Distributed Systems. 22 (1). New York: IEEE: 105–118. doi:10.1109/TPDS.2010.107. eISSN   1558-2183. ISSN   1045-9219. S2CID   15997131. NLM unique id 101212014.
  3. Jeffers, James; Reinders, James; Sodani, Avinash (2016-05-31). Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition (2nd ed.). Morgan Kaufmann. ISBN   9780128091951.
  4. Jana, Siddhartha; Schuchart, Joseph; Chapman, Barbara (2014-10-06). "Analysis of Energy and Performance of PGAS-based Data Access Patterns" (PDF). Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. PGAS '14. New York, NY, USA: Association for Computing Machinery. pp. 1–10. doi:10.1145/2676870.2676882. ISBN   978-1-4503-3247-7.
  5. Marandola, Jussara; Louise, Stéphane; Cudennec, Loïc; Acquaviva, Jean-Thomas; Bader, David (2012-10-11). "Enhancing Cache Coherent Architectures with access patterns for embedded manycore systems". 2012 International Symposium on System on Chip (SoC) (PDF). IEEE. pp. 1–7. doi:10.1109/ISSoC.2012.6376369. ISBN   978-1-4673-2896-8.
  6. "intel terascale" (PDF).
  7. Brown, Mary; Jenevein, Roy M.; Ullah, Nasr (29 November 1998). Memory Access Pattern Analysis. WWC '98: Proceedings of the Workload Characterization: Methodology and Case Studies (published 1998-11-29). p. 105. ISBN   9780769504506.
  8. Ostadzadeh, S. Arash; Meeuws, Roel J.; Galuzzi, Carlo; Bertels, Koen (2010). "QUAD – A Memory Access Pattern Analyser" (PDF). In Sirisuk, Phaophak; Morgan, Fearghal; El-Ghazawi, Tarek; Amano, Hideharu (eds.). Reconfigurable Computing: Architectures, Tools and Applications. Lecture Notes in Computer Science. Vol. 5992. Berlin, Heidelberg: Springer. pp. 269–281. doi:10.1007/978-3-642-12133-3_25. ISBN   978-3-642-12133-3.
  9. Che, Shuai; Sheaffer, Jeremy W.; Skadron, Kevin (2011-11-12). "Dymaxion: Optimizing memory access patterns for heterogeneous systems" (PDF). Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC '11. New York, NY, USA: Association for Computing Machinery. pp. 1–11. doi:10.1145/2063384.2063401. ISBN   978-1-4503-0771-0.
  10. Harrison, Luddy (1996-01-01). "Examination of a memory access classification scheme for pointer-intensive and numeric programs". Proceedings of the 10th international conference on Supercomputing - ICS '96. New York, NY, USA: Association for Computing Machinery. pp. 133–140. doi:10.1145/237578.237595. ISBN   978-0-89791-803-9.
  11. Matsubara, Yuki; Sato, Yukinori (2014). "Online Memory Access Pattern Analysis on an Application Profiling Tool". 2014 Second International Symposium on Computing and Networking. pp. 602–604. doi:10.1109/CANDAR.2014.86. ISBN   978-1-4799-4152-0. S2CID   16476418.
  12. "Putting Your Data and Code in Order: Data and layout".
  13. Kim, Yooseong; Shrivastava, Aviral (2011-06-05). "CuMAPz: A tool to analyze memory access patterns in CUDA". Proceedings of the 48th Design Automation Conference. DAC '11. New York, NY, USA: Association for Computing Machinery. pp. 128–133. doi:10.1145/2024724.2024754. ISBN   978-1-4503-0636-2.
  14. Kim, Yooseong; Shrivastava, Aviral (2011-06-05). "CuMAPz: A tool to analyze memory access patterns in CUDA". Proceedings of the 48th Design Automation Conference. DAC '11. New York, NY, USA: Association for Computing Machinery. pp. 128–133. doi:10.1145/2024724.2024754. ISBN   978-1-4503-0636-2.
  15. Canteaut, Anne; Lauradoux, Cédric; Seznec, André (2006). Understanding cache attacks (report thesis). INRIA. ISSN   0249-6399.
  16. Hardesty, Larry (2013-07-02). "Protecting data in the cloud". MIT News.
  17. Rossi, Ben (2013-09-24). "Boosting cloud security with oblivious RAM". Information Age.
  18. Chuck Paridon. "Storage Performance Benchmarking Guidelines - Part I: Workload Design" (PDF). In practice, IO access patterns are as numerous as the stars
  19. Kennedy, Ken; McKinley, Kathryn S. (1992-08-01). "Optimizing for parallelism and data locality" (PDF). Proceedings of the 6th international conference on Supercomputing - ICS '92. New York, NY, USA: Association for Computing Machinery. pp. 323–334. doi:10.1145/143369.143427. ISBN   978-0-89791-485-7.
  20. Saidi, Selma; Tendulkar, P.; Lepley, Thierry; Maler, O. (2012). "Optimal 2D Data Partitioning for DMA Transfers on MPSoCs" (PDF). 2012 15th Euromicro Conference on Digital System Design. IEEE. pp. 584–591. doi:10.1109/DSD.2012.99. ISBN   978-0-7695-4798-5.
  21. 1 2 CITRIS and the Banatao Institute (2013-09-05). Partitioned Global Address Space Programming - Kathy Yelick . Retrieved 2024-11-02 via YouTube. covers cases where PGAS is a win, where data may not be already sorted, e.g., dealing with complex graphs - see "science across the irregularity spectrum".
  22. Weinberg, Jonathan; McCracken, Michael O.; Snavely, Allan; Strohmaier, Erich (12–18 November 2005). "Quantifying Locality in the Memory Access Patterns of HPC Applications" (PDF). ACM/IEEE SC 2005 Conference (SC'05). Seattle, WA, USA: IEEE. p. 50. doi:10.1109/SC.2005.59. ISBN   1-59593-061-2. Archived from the original (PDF) on 2016-08-03. mentions nearest neighbor access patterns in clusters
  23. Hakura, Ziyad S.; Gupta, Anoop (1997-05-01). "The design and analysis of a cache architecture for texture mapping" (PDF). Proceedings of the 24th annual international symposium on Computer architecture. ISCA '97. New York, NY, USA: Association for Computing Machinery. pp. 108–120. doi:10.1145/264107.264152. ISBN   978-0-89791-901-2.
  24. Nocentino, Anthony E.; Rhodes, Philip J. (2010-04-15). "Optimizing memory access on GPUs using morton order indexing" (PDF). Proceedings of the 48th Annual Southeast Regional Conference. ACMSE '10. New York, NY, USA: Association for Computing Machinery. pp. 1–4. doi:10.1145/1900008.1900035. ISBN   978-1-4503-0064-3. Archived from the original (PDF) on 2022-12-08.
  25. Wise, David S.; Frens, Jeremy D. (1999). "Morton-order Matrices Deserve Compilers ' Support Technical Report 533". S2CID   17192354.{{cite journal}}: Cite journal requires |journal= (help)
  26. 1 2 3 4 Harris, Mark (April 2005). "GPU Gems 2". 31.1.3 Stream Communication: Gather vs. Scatter. Archived from the original on 2016-06-14. Retrieved 2016-06-13.
  27. 1 2 3 GPU gems. Elsevier. 2011-01-13. ISBN   9780123849892.deals with "scatter memory access patterns" and "gather memory access patterns" in the text
  28. Wichmann, Nathan (2005). Cray and HPCC : Benchmark Developments and Results from the Past Year (PDF). CUG 2005 Proceedings. see global random access results for Cray X1. vector architecture for hiding latencies, not so sensitive to cache coherency
  29. "optimize-data-structures-and-memory-access-patterns-to-improve-data-locality".
  30. "Template-based Memory Access Engine for Accelerators in SoCs" (PDF).
  31. "Multi-Target Vectorization With MTPS C++ Generic Library" (PDF).a C++ template library for producing optimised memory access patterns