Value cache encoding

Last updated

Power consumption is becoming increasingly important for both embedded, mobile computing and high-performance systems. [1] Off-chip data bus consumes a significant part of system power. It is observed that the off-chip data bus consumes between 9.8% and 23.2% of the total power consumed by the system depending on the system. So, reducing the power consumption of the off-chip data bus would reduce the overall power consumption.

Contents

Introduction

[2] Off-chip buses are associated with higher capacitance values internal node capacitances and hence, therefore techniques for minimizing switching at external address and data buses, even at the expense of a slight increase in switching at internal capacitances are quite useful. Off chip power consumption can be reduced by at least these two techniques: by reducing the number of bus lines activated during data transfer and by reducing the number of bit transitions on the active bus lines. A mix of both of these techniques produce optimal result.
Value cache encoding is a scheme which is used to reduce power consumption in off chip data bus. In this scheme, Cache at both side of data bus is used to reduce the dynamic power dissipation on off-chip data buses. These caches are maintained such that their contents are the same all the time.

Scheme Details

In this protocol, we employ a small cache (called value cache, or VC for short) at each side of the off-chip data bus. These value caches keep track of the data values that have recently been transmitted over the bus. The entries in these caches are constructed in such a way that the contents of both the value caches are the same all the time. When a data value needs to be transmitted over the bus, we first check whether it is in the value cache of the sender (whether it is memory or cache). If it is, we transmit only the index of the data (i.e., its value cache address, or index) instead of the actual data value and, the other side (receiver) can determine the data value by using this index and its value cache.
To transmit the data in the value cache using only 1 bit switching activity, the size of value cache is limited to the width of the data bus. That is, with a 32-bit bus, the VC could have only 32 entries. Since the value caches employed by our power protocol are very small, the width of the index value is much smaller than the width of the actual data value. Consequently, fewer off-chip bus lines need to be activated for transmission
Our approach tries to achieve the first option by exploiting the locality of the data values communicated over the off-chip data bus. However, once the width of the data (that needs to be transmitted) has been reduced, we can also expect a reduction (in general) on the average bit switching activity per transfer. In addition, this switching activity can be further reduced by using well-known bus encoding schemes in conjunction with our strategy

Cache Consistency

The receiver side runs the same placement and replacement policy for the VC as the sender. Thus, the value of the data sent over the bus is copied in the receiver VC at the same index location, as in the sender VC. Engineers use one extra control bit to indicate whether the data being sent over the bus is the verbatim data or an index to the VC. A memory-write activity is handled in a similar fashion.

Example

We assume that initially the values 100 and 200 are not present in the VC. During Transaction #1, A is sent from the memory to the cache. The requested data item, stored at some address (say address X of the memory) has a value 100. The memory controller searches the VC for the value 100 and detects a miss. Therefore, the value 100 is sent over the off-chip data bus. Also, following our power protocol, the value 100 is stored at the same location (say 5) of the VCs of both the source and the destination. For Transaction #2, the memory controller searches for the value 200, cannot find the value in the VC, and repeats the steps as described above. At this point the value caches at the both ends contain data values 100 and 200. In Transaction #3, the memory controller finds that a value 100 has to be sent to serve a read request (of the same memory location as before, or of a different memory location that has the same value). But, note that the value 100 is already present in the VC of the sender in location 5 as a result of Transaction #1. Therefore, instead of sending the value 100, the memory controller just sends the index value 5. The receiver, on the other hand, fetches the value of the actual data (100 in this case) from location 5 of its VC. Finally, in Transaction #4, we want to send the data item D having a value 200 to the memory (i.e., a write request). But, the value 200 is already cached in both the VCs as a result of Transaction #2 from memory to cache. Consequently, the index to the cached copy (present in the VC) of value 200 is used to complete Transaction #4 but in the reverse direction. This last transaction shows that, the data placed in the VCs during a transaction in one direction can be re-used (from the VC) during a transaction in the reverse direction.

Replacement Policy

[3] LRU is used as replacement policy in both caches. It is implemented using reference bit and a n-bit time stamp for each value stored in cache. When a value appears in input, reference bit is set. At regular intervals, the reference bit is shifted right into the high-order bit position of the n-bit timestamp causing all bits in the timestamp also to be shifted right and the lowest-order bit in the timestamp being discarded. For example, timestamp 000 means this value did not appear during the last three time intervals, timestamp 100 means it was just seen in the last interval, and the timestamp 000 with reference bit set means it is encountered in the current time slot.
Operation mentioned above is performed on all entries of both caches with all reference bits reset. Thus, the timestamp keeps the history of value occurrences for the last n time periods.
When an entry is required and a value is to be evicted, the entry that is selected is the one with the smallest timestamp and clear reference bit. The new value is put in with a fresh reference bit and timestamp (all 0's) in this selected entry.

Type of Value Cache

After describing the protocol, we will now see two approaches for maintaining cache :

  1. Both caches can be initialized using fixed set of values depending on frequency of appearance of values in previous run.
  2. A changing set of frequent values can be maintained as the program runs. Thus, the contents of the frequent value tables adapt to changes in the frequent values for different parts of execution.
    The advantage of filling cache with fixed value is that the coders do not have to change the contents of the table dynamically thus reducing the runtime overhead. However, it requires that values be known before hand and different program needs different values. The second method, on the other hand, does not need a priori information of data values and does not distinguish among different programs. With these features, we pay the price of

identifying the frequent values on the fly.

Other Application

The protocol we discussed was applied on bus whose one end is on-chip cache and other end was off-chip memory, it is possible to adapt our strategy to work with an off-chip L2 cache as well. In addition, power protocol can also be used to reduce the switching activity between on-chip L1 cache and on-chip L2 cache (although the results would not be as good as those with the off-chip bus). In fact, our strategy can be used between any two communicating devices in the system (with the VC support). Further, we are not restricted by point-to-point configurations. That is, our approach can be made to work in an environment where multiple devices are communicating over a shared (power-hungry) data bus.
Obviously, in this case, among other things, we would need a coherence mechanism (the discussion of which is beyond the scope of this paper). A drawback of our strategy is the extra space needed by two value caches (one on-chip and the other off-chip). In this paper, we do not present a detailed study of the circuit space implications of our approach. As will be presented in the experimental results section, even a small VC (128 entries) generates reasonably good energy behavior; so, we can expect that the space overhead due to our optimization will not be excessive.

See also

Related Research Articles

The Internet Control Message Protocol (ICMP) is a supporting protocol in the Internet protocol suite. It is used by network devices, including routers, to send error messages and operational information indicating success or failure when communicating with another IP address. For example, an error is indicated when a requested service is not available or that a host or router could not be reached. ICMP differs from transport protocols such as TCP and UDP in that it is not typically used to exchange data between systems, nor is it regularly employed by end-user network applications.

<span class="mw-page-title-main">Peripheral Component Interconnect</span> Local computer bus for attaching hardware devices

Peripheral Component Interconnect (PCI) is a local computer bus for attaching hardware devices in a computer and is part of the PCI Local Bus standard. The PCI bus supports the functions found on a processor bus but in a standardized format that is independent of any given processor's native bus. Devices connected to the PCI bus appear to a bus master to be connected directly to its own bus and are assigned addresses in the processor's address space. It is a parallel bus, synchronous to a single bus clock. Attached devices can take either the form of an integrated circuit fitted onto the motherboard or an expansion card that fits into a slot. The PCI Local Bus was first implemented in IBM PC compatibles, where it displaced the combination of several slow Industry Standard Architecture (ISA) slots and one fast VESA Local Bus (VLB) slot as the bus configuration. It has subsequently been adopted for other computer types. Typical PCI cards used in PCs include: network cards, sound cards, modems, extra ports such as Universal Serial Bus (USB) or serial, TV tuner cards and hard disk drive host adapters. PCI video cards replaced ISA and VLB cards until rising bandwidth needs outgrew the abilities of PCI. The preferred interface for video cards then became Accelerated Graphics Port (AGP), a superset of PCI, before giving way to PCI Express.

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, and file transfer rely on TCP, which is part of the Transport layer of the TCP/IP suite. SSL/TLS often runs on top of TCP.

<span class="mw-page-title-main">Synchronous dynamic random-access memory</span> Type of computer memory

Synchronous dynamic random-access memory is any DRAM where the operation of its external pin interface is coordinated by an externally supplied clock signal.

<span class="mw-page-title-main">I²C</span> Serial communication bus

I2C (Inter-Integrated Circuit; pronounced as “eye-squared-see” or “eye-two-see”), alternatively known as I2C or IIC, is a synchronous, multi-controller/multi-target (historically-termed as master/slave), single-ended, serial communication bus invented in 1982 by Philips Semiconductors. It is widely used for attaching lower-speed peripheral integrated circuits (ICs) to processors and microcontrollers in short-distance, intra-board communication.

<span class="mw-page-title-main">Cache coherence</span> Computer architecture term concerning shared resource data

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

<span class="mw-page-title-main">Memory management unit</span> Hardware translating virtual addresses to physical address

A memory management unit (MMU), sometimes called paged memory management unit (PMMU), is a computer hardware unit that examines all memory references on the memory bus, translating these requests, known as virtual memory addresses, into physical addresses in main memory.

Serial Peripheral Interface (SPI) is a de facto standard for synchronous serial communication, used primarily in embedded systems for short-distance wired communication between integrated circuits.

Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. This scheme was introduced by Ravishankar and Goodman in 1983, under the name "write-once" cache coherency. A cache containing a coherency controller (snooper) is called a snoopy cache.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

The RTP Control Protocol (RTCP) is a binary-encoded out-of-band signaling protocol that functions alongside the Real-time Transport Protocol (RTP). Its basic functionality and packet structure is defined in RFC 3550. RTCP provides statistics and control information for an RTP session. It partners with RTP in the delivery and packaging of multimedia data but does not transport any media data itself.

<span class="mw-page-title-main">RapidIO</span> High-speed interconnect technology

The RapidIO architecture is a high-performance packet-switched electrical connection technology. It supports messaging, read/write and cache coherency semantics. Based on industry-standard electrical specifications such as those for Ethernet, RapidIO can be used as a chip-to-chip, board-to-board, and chassis-to-chassis interconnect.

The Arm Advanced Microcontroller Bus Architecture (AMBA) is an open-standard, on-chip interconnect specification for the connection and management of functional blocks in system-on-a-chip (SoC) designs. It facilitates development of multi-processor designs with large numbers of controllers and components with a bus architecture. Since its inception, the scope of AMBA has, despite its name, gone far beyond microcontroller devices. Today, AMBA is widely used on a range of ASIC and SoC parts including applications processors used in modern portable mobile devices like smartphones. AMBA is a registered trademark of Arm Ltd.

<span class="mw-page-title-main">Low Pin Count</span> Low-bandwidth computer motherboard bus

The Low Pin Count (LPC) bus is a computer bus used on IBM-compatible personal computers to connect low-bandwidth devices to the CPU, such as the BIOS ROM, "legacy" I/O devices, and Trusted Platform Module (TPM). "Legacy" I/O devices usually include serial and parallel ports, PS/2 keyboard, PS/2 mouse, and floppy disk controller.

CANopen is a communication protocol stack and device profile specification for embedded systems used in automation. In terms of the OSI model, CANopen implements the layers above and including the network layer. The CANopen standard consists of an addressing scheme, several small communication protocols and an application layer defined by a device profile. The communication protocols have support for network management, device monitoring and communication between nodes, including a simple transport layer for message segmentation/desegmentation. The lower level protocol implementing the data link and physical layers is usually Controller Area Network (CAN), although devices using some other means of communication can also implement the CANopen device profile.

NACK-Oriented Reliable Multicast (NORM) is a transport layer Internet protocol designed to provide reliable transport in multicast groups in data networks. It is formally defined by the Internet Engineering Task Force (IETF) in Request for Comments (RFC) 5740, which was published in November 2009.

Bus encoding refers to converting/encoding a piece of data to another form before launching on the bus. While bus encoding can be used to serve various purposes like reducing the number of pins, compressing the data to be transmitted, reducing cross-talk between bit lines, etc., it is one of the popular techniques used in system design to reduce dynamic power consumed by the system bus. Bus encoding aims to reduce the Hamming distance between 2 consecutive values on the bus. Since the activity is directly proportional to the Hamming distance, bus encoding proves to be effective in reducing the overall activity factor thereby reducing the dynamic power consumption in the system.

In lower power systems, Hierarchical Value Cache refers to the hierarchical arrangement of Value Caches (VCs) in such a fashion that lower level VCs observe higher hit-rates, but undergo more switching activity on VC hits.

<span class="mw-page-title-main">I3C (bus)</span> Serial bus specification

I3C, also known as SenseWire, is a specification to enable communication between computer chips by defining the electrical connection between the chips and signaling patterns to be used. Short for "Improved Inter Integrated Circuit", the standard defines the electrical connection between the chips to be a two wire, shared (multidrop), serial data bus, one wire (SCL) being used as a clock to define the sampling times, the other wire (SDA) being used as a data line whose voltage can be sampled. The standard defines a signalling protocol in which multiple chips can control communication and thereby act as the bus controller.

References

  1. Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1176262
  2. Dinesh C Suresh; Banit Agrawal; Jun Yang; Walid Najjar (28 June 2005). "A Tunable Bus Encoder for Off-Chip Data Buses" (PDF). Retrieved 2015-04-22.
  3. Jun Yang; Gupta, R. (2001). "FV encoding for low-power data I/O". ISLPED'01: Proceedings of the 2001 International Symposium on Low Power Electronics and Design (IEEE Cat. No.01TH8581). IEEE. pp. 84–87. doi:10.1109/LPE.2001.945379. ISBN   1-58113-371-5. S2CID   9582140.