Power consumption in relation to physical size of electronic hardware has increased as the components have become smaller and more densely packed. Coupled with high operating frequencies, this has led to unacceptable levels of power dissipation. Memory accounts for a high proportion of the power consumed, and this contribution may be reduced by optimizing data organization –the way data is stored. [1]
Power optimization in high memory density electronic systems has become one of the major challenges for devices such as mobile phones, embedded systems, and wireless devices. As the number of cores on a single chip is growing the power consumption by the devices also increases. Studies on power consumption distribution in smartphones and data-centers have shown that the memory subsystem consumes around 40% of the total power. In server systems, the study reveals that the memory consumes around 1.5 times the core power consumption. [2]
System level buses such as off-chip buses or long on-chip buses between IP blocks are often major sources of energy consumption due to their large load capacitance. Experimental results have shown that the bus activity for memory access can be reduced to 50% by organizing the data. Consider the case of compiling the code written in C programming language:
intA[4][4],B[4][4];for(i=0;i<4;i++){for(j=0;j<4;j++){B[i][j]=A[j][i];}}
Most existing C compilers place a multidimensional array in row-major form, that is row by row: this is shown in the "unoptimized" column in the adjoining table. As a result, no memory access while running this code has sequential memory access because elements in columns are accessed sequentially. But it is possible to change the way in which they are placed in memory so as to maximize the number of sequential accesses from memory. This can be achieved by ordering the data as shown in the "optimized" column of the table. Such redistribution of data by the compiler can significantly reduce energy consumption due to memory access. [3]
unoptimized | optimized |
---|---|
A[0][0] | A[0][0] |
A[0][1] | B[0][0] |
A[0][2] | A[1][0] |
A[0][3] | B[0][1] |
A[0][0] | A[2][0] |
A[1][0] | B[0][2] |
A[1][1] | A[3][0] |
. | B[0][3] |
. | A[0][1] |
B[0][0] | B[1][0] |
B[0][1] | A[1][1] |
B[0][2] | B[1][1] |
B[0][3] | . |
B[1][0] | . |
. | . |
. | A[3][3] |
B[3][3] | B[3][3] |
This method involves source code transformations that either modifies the data structure included in the source code or introduces new data structures or, possibly, modifies the access mode and the access paths with the aim of lowering power consumption. Certain techniques are used to perform such transformations.
The basic idea is to modify the local array declaration ordering, so that the arrays more frequently accessed are placed on top of the stack in such a way that the memory locations frequently used are accessed directly. To achieve this, the array declarations are reorganized to place first the more frequently accessed arrays, requiring either a static estimation or a dynamic analysis of frequency of access of the local arrays.
In any computation program, local variables are stored in stack of a program and global variables are stored in data memory. This method involves converting local arrays into global arrays so that they are stored in data memory instead of stack. The location of a global array can be determined at compile time, whereas local array location can only be determined when the subprogram is called and depends on the stack pointer value. As a consequence, the global arrays are accessed with offset addressing mode with constant 0 while local arrays, excluding the first, are accessed with constant offset different from 0, and this achieves an energy reduction.
In this method, elements that are accessed more frequently are identified via profiling or static considerations. A copy of these elements is then stored in a temporary array which can be accessed without any data cache miss. This results in a significant system energy reduction, but it can also reduce performance. [1]
On-chip caches use static RAM that consumes between 25% and 50% of the total chip power and occupies about 50% of the total chip area. Scratchpad memory occupies less area than on-chip caches. This will typically reduce the energy consumption of the memory unit, because less area implies reduction in the total switched capacitance. Current embedded processors particularly in the area of multimedia applications and graphic controllers have on-chip scratch pad memories. In cache memory systems, the mapping of program elements is done during run time, whereas in scratchpad memory systems this is done either by the user or automatically by the compiler using a suitable algorithm. [4]
In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power consumption.
In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.
Static random-access memory is a type of random-access memory (RAM) that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.
In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data and/or resources within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such as traversing the elements in a one-dimensional array.
Dynamic random-access memory is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal–oxide–semiconductor (MOS) technology. While most DRAM memory cell designs use a capacitor and transistor, some only use two transistors. In the designs where a capacitor is used, the capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The electric charge on the capacitors gradually leaks away; without intervention the data on the capacitor would soon be lost. To prevent this, DRAM requires an external memory refresh circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. This refresh process is the defining characteristic of dynamic random-access memory, in contrast to static random-access memory (SRAM) which does not require data to be refreshed. Unlike flash memory, DRAM is volatile memory, since it loses its data quickly when power is removed. However, DRAM does exhibit limited data remanence.
A system on a chip or system-on-chip is an integrated circuit that integrates most or all components of a computer or other electronic system. These components almost always include on-chip central processing unit (CPU), memory interfaces, input/output devices, input/output interfaces, and secondary storage interfaces, often alongside other components such as radio modems and a graphics processing unit (GPU) – all on a single substrate or microchip. SoCs may contain digital, and also analog, mixed-signal, and often radio frequency signal processing functions.
In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. An algorithm must be analyzed to determine its resource usage, and the efficiency of an algorithm can be measured based on the usage of different resources. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.
In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.
The Blackfin is a family of 16-/32-bit microprocessors developed, manufactured and marketed by Analog Devices. The processors have built-in, fixed-point digital signal processor (DSP) functionality supplied by 16-bit multiply–accumulates (MACs), accompanied on-chip by a microcontroller. It was designed for a unified low-power processor architecture that can run operating systems while simultaneously handling complex numeric tasks such as real-time H.264 video encoding.
Semiconductor memory is a digital electronic semiconductor device used for digital data storage, such as computer memory. It typically refers to devices in which data is stored within metal–oxide–semiconductor (MOS) memory cells on a silicon integrated circuit memory chip. There are numerous different types using different semiconductor technologies. The two main types of random-access memory (RAM) are static RAM (SRAM), which uses several transistors per memory cell, and dynamic RAM (DRAM), which uses a transistor and a MOS capacitor per cell. Non-volatile memory uses floating-gate memory cells, which consist of a single floating-gate transistor per cell.
The AT&T Hobbit is a microprocessor design that AT&T Corporation developed in the early 1990s. It was based on the company's CRISP design, which in turn grew out of the C Machine design by Bell Labs of the late 1980s. All were optimized for running code compiled from the C programming language.
In computer science, execute in place (XIP) is a method of executing programs directly from long-term storage rather than copying it into RAM. It is an extension of using shared memory to reduce the total amount of memory required.
In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.
Memory refresh is the process of periodically reading information from an area of computer memory and immediately rewriting the read information to the same area without modification, for the purpose of preserving the information. Memory refresh is a background maintenance process required during the operation of semiconductor dynamic random-access memory (DRAM), the most widely used type of computer memory, and in fact is the defining characteristic of this class of memory.
Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor, scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.
Computational RAM (C-RAM) is random-access memory with processing elements integrated on the same chip. This enables C-RAM to be used as a SIMD computer. It also can be used to more efficiently use memory bandwidth within a memory chip. The general technique of doing computations in memory is called Processing-In-Memory (PIM).
This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices.
Bus encoding refers to converting/encoding a piece of data to another form before launching on the bus. While bus encoding can be used to serve various purposes like reducing the number of pins, compressing the data to be transmitted, reducing cross-talk between bit lines, etc., it is one of the popular techniques used in system design to reduce dynamic power consumed by the system bus. Bus encoding aims to reduce the Hamming distance between 2 consecutive values on the bus. Since the activity is directly proportional to the Hamming distance, bus encoding proves to be effective in reducing the overall activity factor thereby reducing the dynamic power consumption in the system.
Power consumption is becoming increasingly important for both embedded, mobile computing and high-performance systems. Off-chip data bus consumes a significant part of system power. It is observed that the off-chip data bus consumes between 9.8% and 23.2% of the total power consumed by the system depending on the system. So, reducing the power consumption of the off-chip data bus would reduce the overall power consumption.
In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, and also have implications for the approach to parallelism and distribution of workload in shared memory systems. Further, cache coherency issues can affect multiprocessor performance, which means that certain memory access patterns place a ceiling on parallelism.