In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, [1] and also have implications for the approach to parallelism [2] [3] and distribution of workload in shared memory systems. [4] Further, cache coherency issues can affect multiprocessor performance, [5] which means that certain memory access patterns place a ceiling on parallelism (which manycore approaches seek to break). [6]
Computer memory is usually described as "random access", but traversals by software will still exhibit patterns that can be exploited for efficiency. Various tools exist to help system designers [7] and programmers understand, analyse and improve the memory access pattern, including VTune and Vectorization Advisor, [8] [9] [10] [11] [12] including tools to address GPU memory access patterns [13]
Memory access patterns also have implications for security, [14] [15] which motivates some to try and disguise a program's activity for privacy reasons. [16] [17]
The simplest extreme is the sequential access pattern, where data is read, processed, and written out with straightforward incremented/decremented addressing. These access patterns are highly amenable to prefetching.
Strided or simple 2D, 3D access patterns (e.g., stepping through multi-dimensional arrays) are similarly easy to predict, and are found in implementations of linear algebra algorithms and image processing. Loop tiling is an effective approach. [19] Some systems with DMA provided a strided mode for transferring data between subtile of larger 2D arrays and scratchpad memory. [20]
A linear access pattern is closely related to "strided", where a memory address may be computed from a linear combination of some index. Stepping through indices sequentially with a linear pattern yields strided access. A linear access pattern for writes (with any access pattern for non-overlapping reads) may guarantee that an algorithm can be parallelised, which is exploited in systems supporting compute kernels.
Nearest neighbor memory access patterns appear in simulation, and are related to sequential or strided patterns. An algorithm may traverse a data structure using information from the nearest neighbors of a data element (in one or more dimensions) to perform a calculation. These are common in physics simulations operating on grids. [21] Nearest neighbor can also refer to inter-node communication in a cluster; physics simulations which rely on such local access patterns can be parallelized with the data partitioned into cluster nodes, with purely nearest-neighbor communication between them, which may have advantages for latency and communication bandwidth. This use case maps well onto torus network topology. [22]
In 3D rendering, access patterns for texture mapping and rasterization of small primitives (with arbitrary distortions of complex surfaces) are far from linear, but can still exhibit spatial locality (e.g., in screen space or texture space). This can be turned into good memory locality via some combination of morton order [23] and tiling for texture maps and frame buffer data (mapping spatial regions onto cache lines), or by sorting primitives via tile based deferred rendering. [24] It can also be advantageous to store matrices in morton order in linear algebra libraries. [25]
A scatter memory access pattern combines sequential reads with indexed/random addressing for writes. [26] Compared to gather, It may place less load on a cache hierarchy since a processing element may dispatch writes in a "fire and forget" manner (bypassing a cache altogether), whilst using predictable prefetching (or even DMA) for its source data.
However, it may be harder to parallelise since there is no guarantee the writes do not interact, [27] and many systems are still designed assuming that a hardware cache will coalesce many small writes into larger ones.
In the past, forward texture mapping attempted to handle the randomness with "writes", whilst sequentially reading source texture information.
The PlayStation 2 console used conventional inverse texture mapping, but handled any scatter/gather processing "on-chip" using EDRAM, whilst 3D model (and a lot of texture data) from main memory was fed sequentially by DMA. This is why it lacked support for indexed primitives, and sometimes needed to manage textures "up front" in the display list.
In a gather memory access pattern, reads are randomly addressed or indexed, whilst the writes are sequential (or linear). [26] An example is found in inverse texture mapping, where data can be written out linearly across scan lines, whilst random access texture addresses are calculated per pixel.
Compared to scatter, the disadvantage is that caching (and bypassing latencies) is now essential for efficient reads of small elements, however it is easier to parallelise since the writes are guaranteed to not overlap. As such the gather approach is more common for gpgpu programming, [27] where the massive threading (enabled by parallelism) is used to hide read latencies. [27]
An algorithm may gather data from one source, perform some computation in local or on chip memory, and scatter results elsewhere. This is essentially the full operation of a GPU pipeline when performing 3D rendering- gathering indexed vertices and textures, and scattering shaded pixels in screen space. Rasterization of opaque primitives using a depth buffer is "commutative", allowing reordering, which facilitates parallel execution. In the general case synchronisation primitives would be needed.
At the opposite extreme is a truly random memory access pattern. A few multiprocessor systems are specialised to deal with these. [28] The PGAS approach may help by sorting operations by data on the fly (useful when the problem *is* figuring out the locality of unsorted data). [21] Data structures which rely heavily on pointer chasing can often produce poor locality of reference, although sorting can sometimes help. Given a truly random memory access pattern, it may be possible to break it down (including scatter or gather stages, or other intermediate sorting) which may improve the locality overall; this is often a prerequisite for parallelizing.
Data-oriented design is an approach intended to maximise the locality of reference, by organising data according to how it is traversed in various stages of a program, contrasting with the more common object oriented approach (i.e., organising such that data layout explicitly mirrors the access pattern). [29]
Locality of reference refers to a property exhibited by memory access patterns. A programmer will change the memory access pattern (by reworking algorithms) to improve the locality of reference, [30] and/or to increase potential for parallelism. [26] A programmer or system designer may create frameworks or abstractions (e.g., C++ templates or higher-order functions) that encapsulate a specific memory access pattern. [31] [32]
Different considerations for memory access patterns appear in parallelism beyond locality of reference, namely the separation of reads and writes. E.g.: even if the reads and writes are "perfectly" local, it can be impossible to parallelise due to dependencies; separating the reads and writes into separate areas yields a different memory access pattern, maybe initially appear worse in pure locality terms, but desirable to leverage modern parallel hardware. [26]
Locality of reference may also refer to individual variables (e.g., the ability of a compiler to cache them in registers), whilst the term memory access pattern only refers to data held in an indexable memory (especially main memory).
Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.
Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).
In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as traversing the elements in a one-dimensional array.
Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.
Texture mapping is a method for mapping a texture on a computer-generated graphic. Texture here can be high frequency detail, surface texture, or color.
In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.
Memory-mapped I/O (MMIO) and port-mapped I/O (PMIO) are two complementary methods of performing input/output (I/O) between the central processing unit (CPU) and peripheral devices in a computer. An alternative approach is using dedicated I/O processors, commonly known as channels on mainframe computers, which execute their own instructions.
In computer data storage, data striping is the technique of segmenting logically sequential data, such as a file, so that consecutive segments are stored on different physical storage devices.
A page table is the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. Virtual addresses are used by the program executed by the accessing process, while physical addresses are used by the hardware, or more specifically, by the random-access memory (RAM) subsystem. The page table is a key component of virtual address translation that is necessary to access data in memory.
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.
In computer science, a parallel random-access machine is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM). In the same way that the RAM is used by sequential-algorithm designers to model algorithmic performance, the PRAM is used by parallel-algorithm designers to model parallel algorithmic performance. Similar to the way in which the RAM model neglects practical issues, such as access time to cache memory versus main memory, the PRAM model neglects such issues as synchronization and communication, but provides any (problem-size-dependent) number of processors. Algorithm cost, for instance, is estimated using two parameters O(time) and O(time × processor_number).
The Write Anywhere File Layout (WAFL) is a proprietary file system that supports large, high-performance RAID arrays, quick restarts without lengthy consistency checks in the event of a crash or power failure, and growing the filesystems size quickly. It was designed by NetApp for use in its storage appliances like NetApp FAS, AFF, Cloud Volumes ONTAP and ONTAP Select.
General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.
In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named in France after Henri Lebesgue, who studied it in 1904, and named in the United States after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used, such as simple one dimensional arrays, binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree.
In computer science, stream processing is a programming paradigm which views data streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.
In computing, interleaved memory is a design which compensates for the relatively slow speed of dynamic random-access memory (DRAM) or core memory, by spreading memory addresses evenly across memory banks. That way, contiguous memory reads and writes use each memory bank in turn, resulting in higher memory throughput due to reduced waiting for memory banks to become ready for the operations.
This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices.
The PlayStation 2 technical specifications describe the various components of the PlayStation 2 (PS2) video game console.
In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler. They may reduce cache pollution, reduce bandwidth requirement, bypass latencies, by providing better control over the working set. Most cache control instructions do not affect the semantics of a program, although some can.
{{cite journal}}
: Cite journal requires |journal=
(help)In practice, IO access patterns are as numerous as the stars