Scratchpad memory

Last updated March 02, 2024

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor (or CPU), scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

In some systems^{[lower-alpha 1]} it can be considered similar to the L1 cache in that it is the next closest memory to the ALU after the processor registers, with explicit instructions to move data to and from main memory, often using DMA-based data transfer.^[1] In contrast to a system that uses caches, a system with scratchpads is a system with non-uniform memory access (NUMA) latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference from a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory.

Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in multiprocessor system-on-chip for embedded systems. They are mostly suited for storing temporary results (as it would be found in the CPU stack) that typically wouldn't need to always be committing to the main memory; however when fed by DMA, they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of locality of reference apply in relation to efficiency of use; although some systems allow strided DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications. They may be useful for realtime applications, where predictable timing is hindered by cache behavior.

Scratchpads are not used in mainstream desktop processors where generality is required for legacy software to run from generation to generation, in which the available on-chip memory size may change. They are better implemented in embedded systems, special-purpose processors and game consoles, where chips are often manufactured as MPSoC, and where software is often tuned to one hardware configuration.

Examples of use

Fairchild F8 of 1975 contained 64 bytes of scratchpad.
The TI-99/4A has 256 bytes of scratchpad memory on the 16-bit bus containing the processor registers of the TMS9900 ^[2]
Cyrix 6x86 is the only x86-compatible desktop processor to incorporate a dedicated scratchpad.
SuperH, used in Sega's consoles, could lock cachelines to an address outside of main memory for use as a scratchpad.
Sony's PS1's R3000 had a scratchpad instead of an L1 cache. It was possible to place the CPU stack here, an example of the temporary workspace usage.
Adapteva's Epiphany parallel coprocessor features local-stores for each core, connected by a network on a chip, with DMA possible between them and off-chip links (possibly to DRAM). The architecture is similar to Sony's Cell, except all cores can directly address each other's scratchpads, generating network messages from standard load/store instructions.
Sony's PS2 Emotion Engine includes a 16 KB scratchpad, to and from which DMA transfers could be issued to its GS, and main memory.
Cell's SPEs are restricted purely to working in their "local-store", relying on DMA for transfers from/to main memory and between local stores, much like a scratchpad. In this regard, additional benefit is derived from the lack of hardware to check and update coherence between multiple caches: the design takes advantage of the assumption that each processor's workspace is separate and private. It is expected this benefit will become more noticeable as the number of processors scales into the "many-core" future. Yet because of the elimination of some hardware logics, the data and instructions of applications on SPEs must be managed through software if the whole task on SPE can not fit in local store.^[3]^[4]^[5]
Many other processors allow L1 cache lines to be locked.
Most digital signal processors use a scratchpad. Many past 3D accelerators and game consoles (including the PS2) have used DSPs for vertex transformations. This differs from the stream-based approach of modern GPUs which have more in common with a CPU cache's functions.
NVIDIA's 8800 GPU running under CUDA provides 16 KB of scratchpad (NVIDIA calls it Shared Memory) per thread-bundle when being used for GPGPU tasks. Scratchpad also was used in later Fermi GPU (GeForce 400 series).^[6]
Ageia's PhysX chip includes a scratchpad RAM in a manner similar to the Cell; the theory of this specific physics processing unit is that a cache hierarchy is of less use than software managed physics and collision calculations. These memories are also banked and a switch manages transfers between them.
Intel's Knights Landing processor has a 16 GB MCDRAM that can be configured as either a cache, scratchpad memory, or divided into some cache and some scratchpad memory.
Movidius Myriad 2, a vision processing unit, organized as a multicore architecture with a large multiported shared scratchpad.
Graphcore has designed an AI accelerator based on scratchpad memories^[7]

Alternatives

Cache control vs scratchpads

Some architectures such as PowerPC attempt to avoid the need for cacheline locking or scratchpads through the use of cache control instructions. Marking an area of memory with "Data Cache Block: Zero" (allocating a line but setting its contents to zero instead of loading from main memory) and discarding it after use ('Data Cache Block: Invalidate', signaling that main memory didn't receive any updated data) the cache is made to behave as a scratchpad. Generality is maintained in that these are hints and the underlying hardware will function correctly regardless of actual cache size.

Shared L2 vs Cell local stores

Regarding interprocessor communication in a multicore setup, there are similarities between the Cell's inter-localstore DMA and a shared L2 cache setup as in the Intel Core 2 Duo or the Xbox 360's custom powerPC: the L2 cache allows processors to share results without those results having to be committed to main memory. This can be an advantage where the working set for an algorithm encompasses the entirety of the L2 cache. However, when a program is written to take advantage of inter-localstore DMA, the Cell has the benefit of each-other-Local-Store serving the purpose of BOTH the private workspace for a single processor AND the point of sharing between processors; i.e., the other Local Stores are on a similar footing viewed from one processor as the shared L2 cache in a conventional chip. The tradeoff is that of memory wasted in buffering and programming complexity for synchronization, though this would be similar to precached pages in a conventional chip. Domains where using this capability is effective include:

Pipeline processing (where one achieves the same effect as increasing the L1 cache's size by splitting one job into smaller chunks)
Extending the working set, e.g., a sweet spot for a merge sort where the data fits within 8×256 KB
Shared code uploading, like loading a piece of code to one SPU, then copy it from there to the others to avoid hitting the main memory again

It would be possible for a conventional processor to gain similar advantages with cache-control instructions, for example, allowing the prefetching to the L1 bypassing the L2, or an eviction hint that signaled a transfer from L1 to L2 but not committing to main memory; however, at present no systems offer this capability in a usable form and such instructions in effect should mirror explicit transfer of data among cache areas used by each core.

Notes

↑ Some older systems used a hidden part of main storage, referred to as bump storage, as scratchpad. In other systems, e.g., UNIVAC 1107, all addressable registers were held in scratchpad.

Related Research Articles

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture. It was released on April 22, 2003, with the SledgeHammer core (K8) and was intended to compete in the server and workstation markets, particularly in the same segment as the Intel Xeon processor. Processors based on the AMD K10 microarchitecture were announced on September 10, 2007, featuring a new quad-core configuration. The last released Opteron CPUs are the Piledriver-based Opteron 4300 and 6300 series processors, codenamed "Seoul" and "Abu Dhabi" respectively.

Cell is a 64-bit multi-core microprocessor microarchitecture that combines a general-purpose PowerPC core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

The Blackfin is a family of 16-/32-bit microprocessors developed, manufactured and marketed by Analog Devices. The processors have built-in, fixed-point digital signal processor (DSP) functionality performed by 16-bit multiply–accumulates (MACs), accompanied on-chip by a microcontroller. It was designed for a unified low-power processor architecture that can run operating systems while simultaneously handling complex numeric tasks such as real-time H.264 video encoding.

The Emotion Engine is a central processing unit developed and manufactured by Sony Computer Entertainment and Toshiba for use in the PlayStation 2 video game console. It was also used in early PlayStation 3 models sold in Japan and North America to provide PlayStation 2 game support. Mass production of the Emotion Engine began in 1999 and ended in late 2012 with the discontinuation of the PlayStation 2.

A physics processing unit (PPU) is a dedicated microprocessor designed to handle the calculations of physics, especially in the physics engine of video games. It is an example of hardware acceleration.

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in Whatcom County, Washington, near the town of Bellingham. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

<span class="mw-page-title-main">TILE64</span>

TILE64 is a VLIW ISA multicore processor manufactured by Tilera. It consists of a mesh network of 64 "tiles", where each tile houses a general purpose processor, cache, and a non-blocking router, which the tile uses to communicate with the other tiles on the processor.

The Vortex86 is a computing system-on-a-chip (SoC) based on a core compatible with the x86 microprocessor family. It is produced by DM&P Electronics, but originated with Rise Technology.

QorIQ is a brand of ARM-based and Power ISA–based communications microprocessors from NXP Semiconductors. It is the evolutionary step from the PowerQUICC platform, and initial products were built around one or more e500mc cores and came in five different product platforms, P1, P2, P3, P4, and P5, segmented by performance and functionality. The platform keeps software compatibility with older PowerPC products such as the PowerQUICC platform. In 2012 Freescale announced ARM-based QorIQ offerings beginning in 2013.

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices.

<span class="mw-page-title-main">TILEPro64</span>

TILEPro64 is a VLIW ISA multicore processor manufactured by Tilera. It consists of a cache-coherent mesh network of 64 "tiles", where each tile houses a general purpose processor, cache, and a non-blocking router, which the tile uses to communicate with the other tiles on the processor.

Project Denver is the codename of a central processing unit designed by Nvidia that implements the ARMv8-A 64/32-bit instruction sets using a combination of simple hardware decoder and software-based binary translation where "Denver's binary translation layer runs in software, at a lower level than the operating system, and stores commonly accessed, already optimized code sequences in a 128 MB cache stored in main memory". Denver is a very wide in-order superscalar pipeline. Its design makes it suitable for integration with other SIPs cores into one die constituting a system on a chip (SoC).

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from NVIDIA that received support for Microsoft's rendering API Direct3D 12 feature_level 11.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

Cache hierarchy, or multi-level cache, is a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit (CPU) cores.

References

↑ Steinke, Stefan; Lars Wehmeyer; Bo-Sik Lee; Peter Marwedel (2002). "Assigning Program and Data Objects to Scratchpad for Energy Reduction" (PDF). University of Dortmund. Retrieved 3 October 2013.: "3.2 Scratchpad model .. The scratchpad memory uses software to control the location assignment of data."
↑ "The TI-99/4A internal architecture". www.unige.ch. Retrieved 2023-03-08.
↑ J. Lu, K. Bai, A. Shrivastava, "SSDM: Smart Stack Data Management for Software Managed Multicores (SMMs)", Design Automation Conference (DAC), June 2–6, 2013
↑ K. Bai, A. Shrivastava, "Automatic and Efficient Heap Data Management for Limited Local Memory Multicore Architectures", Design Automation and Test in Europe (DATE), 2013
↑ K. Bai, J. Lu, A. Shrivastava, B. Holton, "CMSM: An Efficient and Effective Code Management for Software Managed Multicores", CODES+ISSS, 2013
↑ Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved 3 October 2013.
↑ Jia, Zhe; Tillman, Blake; Maggioni, Marco; Scarpazza, Daniele P. (December 7, 2019). Dissecting the Graphcore IPU Architecture via Microbenchmarking (PDF) (Technical report). Citadel Enterprise Americas, LLC. arXiv: 1912.03413 .

External links

Rajeshwari Banakar, Scratchpad Memory : A Design Alternative for Cache. On-chip memory in Embedded Systems // CODES'02. May 6–8, 2002

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Some older systems used a hidden part of main storage, referred to as bump storage, as scratchpad. In other systems, e.g., UNIVAC 1107, all addressable registers were held in scratchpad.

[2] Steinke, Stefan; Lars Wehmeyer; Bo-Sik Lee; Peter Marwedel (2002). "Assigning Program and Data Objects to Scratchpad for Energy Reduction" (PDF). University of Dortmund. Retrieved 3 October 2013.: "3.2 Scratchpad model .. The scratchpad memory uses software to control the location assignment of data."

[3] "The TI-99/4A internal architecture". www.unige.ch. Retrieved 2023-03-08.

[4] J. Lu, K. Bai, A. Shrivastava, "SSDM: Smart Stack Data Management for Software Managed Multicores (SMMs)", Design Automation Conference (DAC), June 2–6, 2013

[5] K. Bai, A. Shrivastava, "Automatic and Efficient Heap Data Management for Limited Local Memory Multicore Architectures", Design Automation and Test in Europe (DATE), 2013

[6] K. Bai, J. Lu, A. Shrivastava, B. Holton, "CMSM: An Efficient and Effective Code Management for Software Managed Multicores", CODES+ISSS, 2013

[7] Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved 3 October 2013.

[8] Jia, Zhe; Tillman, Blake; Maggioni, Marco; Scarpazza, Daniele P. (December 7, 2019). Dissecting the Graphcore IPU Architecture via Microbenchmarking (PDF) (Technical report). Citadel Enterprise Americas, LLC. arXiv: 1912.03413 .

[lower-alpha 1]

[1]

[2]

[3]

[4]

[5]

[6]

[7]