Computational RAM

Last updated

Computational RAM (C-RAM) is random-access memory with processing elements integrated on the same chip. This enables C-RAM to be used as a SIMD computer. It also can be used to more efficiently use memory bandwidth within a memory chip. The general technique of doing computations in memory is called Processing-In-Memory (PIM).

Contents

Overview

The most influential implementations of computational RAM came from The Berkeley IRAM Project. Vector IRAM (V-IRAM) combines DRAM with a vector processor integrated on the same chip. [1]

Reconfigurable Architecture DRAM (RADram) is DRAM with reconfigurable computing FPGA logic elements integrated on the same chip. [2] SimpleScalar simulations show that RADram (in a system with a conventional processor) can give orders of magnitude better performance on some problems than traditional DRAM (in a system with the same processor).

Some embarrassingly parallel computational problems are already limited by the von Neumann bottleneck between the CPU and the DRAM. Some researchers expect that, for the same total cost, a machine built from computational RAM will run orders of magnitude faster than a traditional general-purpose computer on these kinds of problems. [3]

As of 2011, the "DRAM process" (few layers; optimized for high capacitance) and the "CPU process" (optimized for high frequency; typically twice as many BEOL layers as DRAM; since each additional layer reduces yield and increases manufacturing cost, such chips are relatively expensive per square millimeter compared to DRAM) is distinct enough that there are three approaches to computational RAM:

Some CPUs designed to be built on a DRAM process technology (rather than a "CPU" or "logic" process technology specifically optimized for CPUs) include The Berkeley IRAM Project, TOMI Technology [4] [5] and the AT&T DSP1.

Because a memory bus to off-chip memory has many times the capacitance of an on-chip memory bus, a system with separate DRAM and CPU chips can have several times the energy consumption of an IRAM system with the same computer performance. [1]

Because computational DRAM is expected to run hotter than traditional DRAM, and increased chip temperatures result in faster charge leakage from the DRAM storage cells, computational DRAM is expected to require more frequent DRAM refresh. [2]

Processor-in-/near-memory

A processor-in-/near-memory (PINM) refers to a computer processor (CPU) tightly coupled to memory, generally on the same silicon chip.

The chief goal of merging the processing and memory components in this way is to reduce memory latency and increase bandwidth. Alternatively reducing the distance that data needs to be moved reduces the power requirements of a system. [6] Much of the complexity (and hence power consumption) in current processors stems from strategies to deal with avoiding memory stalls.

Examples

In the 1980s, a tiny CPU that executed FORTH was fabricated into a DRAM chip to improve PUSH and POP. FORTH is a stack-oriented programming language and this improved its efficiency.

The transputer also had large on chip memory given that it was made in the early 1980s making it essentially a processor-in-memory.

Notable PIM projects include the Berkeley IRAM project (IRAM) at the University of California, Berkeley [7] project and the University of Notre Dame PIM [8] effort.

DRAM-based PIM Taxonomy

DRAM-based near-memory and in-memory designs can be categorized into four groups:

See also

Related Research Articles

Processor design is a subfield of computer science and computer engineering (fabrication) that deals with creating a processor, a key component of computer hardware.

<span class="mw-page-title-main">Field-programmable gate array</span> Array of logic gates that are reprogrammable

A field-programmable gate array (FPGA) is a type of configurable integrated circuit that can be repeatedly programmed after manufacturing. FPGAs are a subset of logic devices referred to as programmable logic devices (PLDs). They consist of an array of programmable logic blocks with a connecting grid, that can be configured "in the field" to interconnect with other logic blocks to perform various digital functions. FPGAs are often used in limited (low) quantity production of custom-made products, and in research and development, where the higher cost of individual FPGAs is not as important, and where creating and manufacturing a custom circuit wouldn't be feasible. Other applications for FPGAs include the telecommunications, automotive, aerospace, and industrial sectors, which benefit from their flexibility, high signal processing speed, and parallel processing abilities.

<span class="mw-page-title-main">Static random-access memory</span> Type of computer memory

Static random-access memory is a type of random-access memory (RAM) that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.

<span class="mw-page-title-main">Dynamic random-access memory</span> Type of computer memory

Dynamic random-access memory is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal–oxide–semiconductor (MOS) technology. While most DRAM memory cell designs use a capacitor and transistor, some only use two transistors. In the designs where a capacitor is used, the capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The electric charge on the capacitors gradually leaks away; without intervention the data on the capacitor would soon be lost. To prevent this, DRAM requires an external memory refresh circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. This refresh process is the defining characteristic of dynamic random-access memory, in contrast to static random-access memory (SRAM) which does not require data to be refreshed. Unlike flash memory, DRAM is volatile memory, since it loses its data quickly when power is removed. However, DRAM does exhibit limited data remanence.

<span class="mw-page-title-main">System on a chip</span> Micro-electronic component

A system on a chip or system-on-chip is an integrated circuit that integrates most or all components of a computer or other electronic system. These components almost always include on-chip central processing unit (CPU), memory interfaces, input/output devices and interfaces, and secondary storage interfaces, often alongside other components such as radio modems and a graphics processing unit (GPU) – all on a single substrate or microchip. SoCs may contain digital and also analog, mixed-signal and often radio frequency signal processing functions.

In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.

<span class="mw-page-title-main">Application-specific integrated circuit</span> Integrated circuit customized for a specific task

An application-specific integrated circuit is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficiency video codec. Application-specific standard product chips are intermediate between ASICs and industry standard integrated circuits like the 7400 series or the 4000 series. ASIC chips are typically fabricated using metal–oxide–semiconductor (MOS) technology, as MOS integrated circuit chips.

Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with flexible hardware platforms like field-programmable gate arrays (FPGAs). The principal difference when compared to using ordinary microprocessors is the ability to add custom computational blocks using FPGAs. On the other hand, the main difference from custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric, thus providing new computational blocks without the need to manufacture and add new chips to the existing system.

Embedded DRAM (eDRAM) is dynamic random-access memory (DRAM) integrated on the same die or multi-chip module (MCM) of an application-specific integrated circuit (ASIC) or microprocessor. eDRAM's cost-per-bit is higher when compared to equivalent standalone DRAM chips used as external memory, but the performance advantages of placing eDRAM onto the same chip as the processor outweigh the cost disadvantages in many applications. In performance and size, eDRAM is positioned between level 3 cache and conventional DRAM on the memory bus, and effectively functions as a level 4 cache, though architectural descriptions may not explicitly refer to it in those terms.

Memory refresh is a process of periodically reading information from an area of computer memory and immediately rewriting the read information to the same area without modification, for the purpose of preserving the information. Memory refresh is a background maintenance process required during the operation of semiconductor dynamic random-access memory (DRAM), the most widely used type of computer memory, and in fact is the defining characteristic of this class of memory.

Internal RAM, or IRAM or on-chip RAM (OCRAM), is the address range of RAM that is internal to the CPU. Some object files contain an .iram section.

<span class="mw-page-title-main">Memory module</span>

In computing, a memory module or RAM stick is a printed circuit board on which memory integrated circuits are mounted.

A three-dimensional integrated circuit is a MOS integrated circuit (IC) manufactured by stacking as many as 16 or more ICs and interconnecting them vertically using, for instance, through-silicon vias (TSVs) or Cu-Cu connections, so that they behave as a single device to achieve performance improvements at reduced power and smaller footprint than conventional two dimensional processes. The 3D IC is one of several 3D integration schemes that exploit the z-direction to achieve electrical performance benefits in microelectronics and nanoelectronics.

<span class="mw-page-title-main">Random-access memory</span> Form of computer data storage

Random-access memory is a form of electronic computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory, in contrast with other direct-access data storage media, where the time required to read and write data items varies significantly depending on their physical locations on the recording medium, due to mechanical limitations such as media rotation speeds and arm movement.

The Berkeley IRAM project was a 1996–2004 research project in the Computer Science Division of the University of California, Berkeley which explored computer architecture enabled by the wide bandwidth between memory and processor made possible when both are designed on the same integrated circuit (chip). Since it was envisioned that such a chip would consist primarily of random-access memory (RAM), with a smaller part needed for the central processing unit (CPU), the research team used the term "Intelligent RAM" to describe a chip with this architecture. Like the J–Machine project at MIT, the primary objective of the research was to avoid the Von Neumann bottleneck which occurs when the connection between memory and CPU is a relatively narrow memory bus between separate integrated circuits.

Computing with Memory refers to computing platforms where function response is stored in memory array, either one or two-dimensional, in the form of lookup tables (LUTs) and functions are evaluated by retrieving the values from the LUTs. These computing platforms can follow either a purely spatial computing model, as in field-programmable gate array (FPGA), or a temporal computing model, where a function is evaluated across multiple clock cycles. The latter approach aims at reducing the overhead of programmable interconnect in FPGA by folding interconnect resources inside a computing element. It uses dense two-dimensional memory arrays to store large multiple-input multiple-output LUTs. Computing with Memory differs from Computing in Memory or processor-in-memory (PIM) concepts, widely investigated in the context of integrating a processor and memory on the same chip to reduce memory latency and increase bandwidth. These architectures seek to reduce the distance the data travels between the processor and the memory. The Berkeley IRAM project is one notable contribution in the area of PIM architectures.

The term is used for two different things:

  1. In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first. This may improve the power usage and performance of moving data between the processor and the main memory.
  2. In software engineering, in-memory processing is a software architecture where a database is kept entirely in random-access memory (RAM) or flash memory so that usual accesses, in particular read or query operations, do not require access to disk storage. This may allow faster data operations such as "joins", and faster reporting and decision-making in business.
<span class="mw-page-title-main">Memory cell (computing)</span> Part of computer memory

The memory cell is the fundamental building block of computer memory. The memory cell is an electronic circuit that stores one bit of binary information and it must be set to store a logic 1 and reset to store a logic 0. Its value is maintained/stored until it is changed by the set/reset process. The value in the memory cell can be accessed by reading it.

<span class="mw-page-title-main">High Bandwidth Memory</span> Type of memory used on processors that require high transfer rate memory

High Bandwidth Memory (HBM) is a computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators, network devices, high-performance datacenter AI ASICs, as on-package cache in CPUs and on-package RAM in upcoming CPUs, and FPGAs and in some supercomputers. The first HBM memory chip was produced by SK Hynix in 2013, and the first devices to use HBM were the AMD Fiji GPUs in 2015.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

References

  1. 1 2 3 Christoforos E. Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, et al. "Scalable Processors in the Billion-Transistor Era: IRAM". IEEE Computer (magazine). 1997. says "Vector IRAM ... can operate as a parallel built-in self-test engine for the memory array, significantly reducing the DRAM testing time and the associated cost."
  2. 1 2 Mark Oskin, Frederic T. Chong, and Timothy Sherwood. "Active Pages: A Computation Model for Intelligent Memory" Archived 2017-09-22 at the Wayback Machine . 1998.
  3. Daniel J. Bernstein. "Historical notes on mesh routing in NFS". 2002. "programming a computational RAM"
  4. "TOMI the milliwatt microprocessor" [ permanent dead link ]
  5. Yong-Bin Kim and Tom W. Chen. "Assessing Merged DRAM/Logic Technology". 1998. "Archived copy" (PDF). Archived from the original (PDF) on 2011-07-25. Retrieved 2011-11-27.{{cite web}}: CS1 maint: archived copy as title (link)
  6. "GYRFALCON STARTS SHIPPING AI CHIP". electronics-lab. 2018-10-10. Retrieved 5 December 2018.
  7. IRAM
  8. "PIM". Archived from the original on 2015-11-09. Retrieved 2015-05-26.
  9. Hadi Asghari-Moghaddam, et al., "Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems".
  10. Liu Ke, et al., "RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing".
  11. Dongping, Zhang, et al., "TOP-PIM: Throughput-oriented programmable processing in memory".
  12. Sukhan Lee, et al., "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product".
  13. Shuangchen Li, et al.,"DRISA: A dram-based reconfigurable in-situ accelerator".
  14. Marzieh Lenjani, et al., "Fulcrum: a Simplified Control and Access Mechanism toward Flexible and Practical In-situ Accelerators".

Bibliography