Computational RAM

Last updated

Computational RAM (C-RAM) is random-access memory with processing elements integrated on the same chip. This enables C-RAM to be used as a SIMD computer. It also can be used to more efficiently use memory bandwidth within a memory chip. The general technique of doing computations in memory is called Processing-In-Memory (PIM).

Contents

Overview

The most influential implementations of computational RAM came from The Berkeley IRAM Project. Vector IRAM (V-IRAM) combines DRAM with a vector processor integrated on the same chip. [1]

Reconfigurable Architecture DRAM (RADram) is DRAM with reconfigurable computing FPGA logic elements integrated on the same chip. [2] SimpleScalar simulations show that RADram (in a system with a conventional processor) can give orders of magnitude better performance on some problems than traditional DRAM (in a system with the same processor).

Some embarrassingly parallel computational problems are already limited by the von Neumann bottleneck between the CPU and the DRAM. Some researchers expect that, for the same total cost, a machine built from computational RAM will run orders of magnitude faster than a traditional general-purpose computer on these kinds of problems. [3]

As of 2011, the "DRAM process" (few layers; optimized for high capacitance) and the "CPU process" (optimized for high frequency; typically twice as many BEOL layers as DRAM; since each additional layer reduces yield and increases manufacturing cost, such chips are relatively expensive per square millimeter compared to DRAM) is distinct enough that there are three approaches to computational RAM:

Some CPUs designed to be built on a DRAM process technology (rather than a "CPU" or "logic" process technology specifically optimized for CPUs) include The Berkeley IRAM Project, TOMI Technology [4] [5] and the AT&T DSP1.

Because a memory bus to off-chip memory has many times the capacitance of an on-chip memory bus, a system with separate DRAM and CPU chips can have several times the energy consumption of an IRAM system with the same computer performance. [1]

Because computational DRAM is expected to run hotter than traditional DRAM, and increased chip temperatures result in faster charge leakage from the DRAM storage cells, computational DRAM is expected to require more frequent DRAM refresh. [2]

Processor-in-/near-memory

A processor-in-/near-memory (PINM) refers to a computer processor (CPU) tightly coupled to memory, generally on the same silicon chip.

The chief goal of merging the processing and memory components in this way is to reduce memory latency and increase bandwidth. Alternatively reducing the distance that data needs to be moved reduces the power requirements of a system. [6] Much of the complexity (and hence power consumption) in current processors stems from strategies to deal with avoiding memory stalls.

Examples

In the 1980s, a tiny CPU that executed FORTH was fabricated into a DRAM chip to improve PUSH and POP. FORTH is a stack-oriented programming language and this improved its efficiency.

The transputer also had large on chip memory given that it was made in the early 1980s making it essentially a processor-in-memory.

Notable PIM projects include the Berkeley IRAM project (IRAM) at the University of California, Berkeley [7] project and the University of Notre Dame PIM [8] effort.

DRAM-based PIM Taxonomy

DRAM-based near-memory and in-memory designs can be categorized into four groups:

See also

Related Research Articles

Processor design is a subfield of computer engineering and electronics engineering (fabrication) that deals with creating a processor, a key component of computer hardware.

<span class="mw-page-title-main">Field-programmable gate array</span> Array of logic gates that are reprogrammable

A field-programmable gate array (FPGA) is an integrated circuit designed to be configured after manufacturing. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). Circuit diagrams were previously used to specify the configuration, but this is increasingly rare due to the advent of electronic design automation tools.

<span class="mw-page-title-main">Static random-access memory</span> Type of computer memory

Static random-access memory is a type of random-access memory (RAM) that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.

<span class="mw-page-title-main">Dynamic random-access memory</span> Type of computer memory

Dynamic random-access memory is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal–oxide–semiconductor (MOS) technology. While most DRAM memory cell designs use a capacitor and transistor, some only use two transistors. In the designs where a capacitor is used, the capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The electric charge on the capacitors gradually leaks away; without intervention the data on the capacitor would soon be lost. To prevent this, DRAM requires an external memory refresh circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. This refresh process is the defining characteristic of dynamic random-access memory, in contrast to static random-access memory (SRAM) which does not require data to be refreshed. Unlike flash memory, DRAM is volatile memory, since it loses its data quickly when power is removed. However, DRAM does exhibit limited data remanence.

<span class="mw-page-title-main">System on a chip</span> Micro-electronic component

A system on a chip or system-on-chip is an integrated circuit that integrates most or all components of a computer or other electronic system. These components almost always include on-chip central processing unit (CPU), memory interfaces, input/output devices, input/output interfaces, and secondary storage interfaces, often alongside other components such as radio modems and a graphics processing unit (GPU) – all on a single substrate or microchip. SoCs may contain digital, and also analog, mixed-signal, and often radio frequency signal processing functions.

<span class="mw-page-title-main">Application-specific integrated circuit</span> Integrated circuit customized (typically optimized) for a specific task

An application-specific integrated circuit is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficiency video codec. Application-specific standard product chips are intermediate between ASICs and industry standard integrated circuits like the 7400 series or the 4000 series. ASIC chips are typically fabricated using metal–oxide–semiconductor (MOS) technology, as MOS integrated circuit chips.

Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs). The principal difference when compared to using ordinary microprocessors is the ability to make substantial changes to the datapath itself in addition to the control flow. On the other hand, the main difference from custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric.

Magnetoresistive random-access memory (MRAM) is a type of non-volatile random-access memory which stores data in magnetic domains. Developed in the mid-1980s, proponents have argued that magnetoresistive RAM will eventually surpass competing technologies to become a dominant or even universal memory. Currently, memory technologies in use such as flash RAM and DRAM have practical advantages that have so far kept MRAM in a niche role in the market.

Embedded DRAM (eDRAM) is dynamic random-access memory (DRAM) integrated on the same die or multi-chip module (MCM) of an application-specific integrated circuit (ASIC) or microprocessor. eDRAM's cost-per-bit is higher when compared to equivalent standalone DRAM chips used as external memory, but the performance advantages of placing eDRAM onto the same chip as the processor outweigh the cost disadvantages in many applications. In performance and size, eDRAM is positioned between level 3 cache and conventional DRAM on the memory bus, and effectively functions as a level 4 cache, though architectural descriptions may not explicitly refer to it in those terms.

Memory refresh is the process of periodically reading information from an area of computer memory and immediately rewriting the read information to the same area without modification, for the purpose of preserving the information. Memory refresh is a background maintenance process required during the operation of semiconductor dynamic random-access memory (DRAM), the most widely used type of computer memory, and in fact is the defining characteristic of this class of memory.

Internal RAM, or IRAM or on-chip RAM (OCRAM), is the address range of RAM that is internal to the CPU. Some object files contain an .iram section.

<span class="mw-page-title-main">Memory module</span>

In computing, a memory module or RAM stick is a printed circuit board on which memory integrated circuits are mounted. Memory modules permit easy installation and replacement in electronic systems, especially computers such as personal computers, workstations, and servers. The first memory modules were proprietary designs that were specific to a model of computer from a specific manufacturer. Later, memory modules were standardized by organizations such as JEDEC and could be used in any system designed to use them.

A three-dimensional integrated circuit is a MOS integrated circuit (IC) manufactured by stacking as many as 16 or more ICs and interconnecting them vertically using, for instance, through-silicon vias (TSVs) or Cu-Cu connections, so that they behave as a single device to achieve performance improvements at reduced power and smaller footprint than conventional two dimensional processes. The 3D IC is one of several 3D integration schemes that exploit the z-direction to achieve electrical performance benefits in microelectronics and nanoelectronics.

<span class="mw-page-title-main">Random-access memory</span> Form of computer data storage

Random-access memory is a form of computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory, in contrast with other direct-access data storage media, where the time required to read and write data items varies significantly depending on their physical locations on the recording medium, due to mechanical limitations such as media rotation speeds and arm movement.

The Berkeley IRAM project was a 1996–2004 research project in the Computer Science Division of the University of California, Berkeley which explored computer architecture enabled by the wide bandwidth between memory and processor made possible when both are designed on the same integrated circuit (chip). Since it was envisioned that such a chip would consist primarily of random-access memory (RAM), with a smaller part needed for the central processing unit (CPU), the research team used the term "Intelligent RAM" to describe a chip with this architecture. Like the J–Machine project at MIT, the primary objective of the research was to avoid the Von Neumann bottleneck which occurs when the connection between memory and CPU is a relatively narrow memory bus between separate integrated circuits.

Computing with Memory refers to computing platforms where function response is stored in memory array, either one or two-dimensional, in the form of lookup tables (LUTs) and functions are evaluated by retrieving the values from the LUTs. These computing platforms can follow either a purely spatial computing model, as in field-programmable gate array (FPGA), or a temporal computing model, where a function is evaluated across multiple clock cycles. The latter approach aims at reducing the overhead of programmable interconnect in FPGA by folding interconnect resources inside a computing element. It uses dense two-dimensional memory arrays to store large multiple-input multiple-output LUTs. Computing with Memory differs from Computing in Memory or processor-in-memory (PIM) concepts, widely investigated in the context of integrating a processor and memory on the same chip to reduce memory latency and increase bandwidth. These architectures seek to reduce the distance the data travels between the processor and the memory. The Berkeley IRAM project is one notable contribution in the area of PIM architectures.

<span class="mw-page-title-main">Memory cell (computing)</span> Part of computer memory

The memory cell is the fundamental building block of computer memory. The memory cell is an electronic circuit that stores one bit of binary information and it must be set to store a logic 1 and reset to store a logic 0. Its value is maintained/stored until it is changed by the set/reset process. The value in the memory cell can be accessed by reading it.

<span class="mw-page-title-main">High Bandwidth Memory</span> Type of memory used on processors that require high speed memory

High Bandwidth Memory (HBM) is a high-speed computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators, network devices, high-performance datacenter AI ASICs and FPGAs and in some supercomputers. The first HBM memory chip was produced by SK Hynix in 2013, and the first devices to use HBM were the AMD Fiji GPUs in 2015.

Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard for use in large data center computers, initially designed to be layered on top of PCI Express, for directly connecting central processing units (CPUs) to external accelerators like graphics processing units (GPUs), ASICs, FPGAs or fast storage. It offers low latency, high speed, direct memory access connectivity between devices of different instruction set architectures.

A deep learning processor (DLP), or a deep learning accelerator, is an electronic circuit designed for deep learning algorithms, usually with separate data memory and dedicated instruction set architecture. Deep learning processors range from mobile devices, such as neural processing units (NPUs) in Huawei cellphones, to cloud computing servers such as tensor processing units (TPU) in the Google Cloud Platform.

References

  1. 1 2 3 Christoforos E. Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, et al. "Scalable Processors in the Billion-Transistor Era: IRAM". IEEE Computer (magazine). 1997. says "Vector IRAM ... can operate as a parallel built-in self-test engine for the memory array, significantly reducing the DRAM testing time and the associated cost."
  2. 1 2 Mark Oskin, Frederic T. Chong, and Timothy Sherwood. "Active Pages: A Computation Model for Intelligent Memory". 1998.
  3. Daniel J. Bernstein. "Historical notes on mesh routing in NFS". 2002. "programming a computational RAM"
  4. "TOMI the milliwatt microprocessor" [ permanent dead link ]
  5. Yong-Bin Kim and Tom W. Chen. "Assessing Merged DRAM/Logic Technology". 1998. "Archived copy" (PDF). Archived from the original (PDF) on 2011-07-25. Retrieved 2011-11-27.{{cite web}}: CS1 maint: archived copy as title (link)
  6. "GYRFALCON STARTS SHIPPING AI CHIP". electronics-lab. 2018-10-10. Retrieved 5 December 2018.
  7. IRAM
  8. "PIM". Archived from the original on 2015-11-09. Retrieved 2015-05-26.
  9. Hadi Asghari-Moghaddam, et al., "Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems".
  10. Liu Ke, et al., "RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing".
  11. Dongping, Zhang, et al., "TOP-PIM: Throughput-oriented programmable processing in memory".
  12. Sukhan Lee, et al., "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product".
  13. Shuangchen Li, et al.,"DRISA: A dram-based reconfigurable in-situ accelerator".
  14. Marzieh Lenjani, et al., "Fulcrum: a Simplified Control and Access Mechanism toward Flexible and Practical In-situ Accelerators".

Bibliography