Value cache encoding

Last updated

Power consumption is becoming increasingly important for both embedded, mobile computing and high-performance systems. [1] Off-chip data bus consumes a significant part of system power. It is observed that the off-chip data bus consumes between 9.8% and 23.2% of the total power consumed by the system depending on the system. So, reducing the power consumption of the off-chip data bus would reduce the overall power consumption.

Contents

Introduction

[2] Off-chip buses are associated with higher capacitance values internal node capacitances and hence, therefore techniques for minimizing switching at external address and data buses, even at the expense of a slight increase in switching at internal capacitances are quite useful. Off chip power consumption can be reduced by at least these two techniques: by reducing the number of bus lines activated during data transfer and by reducing the number of bit transitions on the active bus lines. A mix of both of these techniques produce optimal result.
Value cache encoding is a scheme which is used to reduce power consumption in off chip data bus. In this scheme, Cache at both side of data bus is used to reduce the dynamic power dissipation on off-chip data buses. These caches are maintained such that their contents are the same all the time.

Scheme Details

In this protocol, we employ a small cache (called value cache, or VC for short) at each side of the off-chip data bus. These value caches keep track of the data values that have recently been transmitted over the bus. The entries in these caches are constructed in such a way that the contents of both the value caches are the same all the time. When a data value needs to be transmitted over the bus, we first check whether it is in the value cache of the sender (whether it is memory or cache). If it is, we transmit only the index of the data (i.e., its value cache address, or index) instead of the actual data value and, the other side (receiver) can determine the data value by using this index and its value cache.
To transmit the data in the value cache using only 1 bit switching activity, the size of value cache is limited to the width of the data bus. That is, with a 32-bit bus, the VC could have only 32 entries. Since the value caches employed by our power protocol are very small, the width of the index value is much smaller than the width of the actual data value. Consequently, fewer off-chip bus lines need to be activated for transmission
Our approach tries to achieve the first option by exploiting the locality of the data values communicated over the off-chip data bus. However, once the width of the data (that needs to be transmitted) has been reduced, we can also expect a reduction (in general) on the average bit switching activity per transfer. In addition, this switching activity can be further reduced by using well-known bus encoding schemes in conjunction with our strategy

Cache Consistency

The receiver side runs the same placement and replacement policy for the VC as the sender. Thus, the value of the data sent over the bus is copied in the receiver VC at the same index location, as in the sender VC. We use one extra control bit to indicate whether the data being sent over the bus is the verbatim data or an index to the VC. A memory-write activity is handled in a similar fashion.

Example

We assume that initially the values 100 and 200 are not present in the VC. During Transaction #1, A is sent from the memory to the cache. The requested data item, stored at some address (say address X of the memory) has a value 100. The memory controller searches the VC for the value 100 and detects a miss. Therefore, the value 100 is sent over the off-chip data bus. Also, following our power protocol, the value 100 is stored at the same location (say 5) of the VCs of both the source and the destination. For Transaction #2, the memory controller searches for the value 200, cannot find the value in the VC, and repeats the steps as described above. At this point the value caches at the both ends contain data values 100 and 200. In Transaction #3, the memory controller finds that a value 100 has to be sent to serve a read request (of the same memory location as before, or of a different memory location that has the same value). But, note that the value 100 is already present in the VC of the sender in location 5 as a result of Transaction #1. Therefore, instead of sending the value 100, the memory controller just sends the index value 5. The receiver, on the other hand, fetches the value of the actual data (100 in this case) from location 5 of its VC. Finally, in Transaction #4, we want to send the data item D having a value 200 to the memory (i.e., a write request). But, the value 200 is already cached in both the VCs as a result of Transaction #2 from memory to cache. Consequently, the index to the cached copy (present in the VC) of value 200 is used to complete Transaction #4 but in the reverse direction. This last transaction shows that, the data placed in the VCs during a transaction in one direction can be re-used (from the VC) during a transaction in the reverse direction.

Replacement Policy

[3] LRU is used as replacement policy in both caches. It is implemented using reference bit and a n-bit time stamp for each value stored in cache. When a value appears in input, reference bit is set. At regular intervals, the reference bit is shifted right into the high-order bit position of the n-bit timestamp causing all bits in the timestamp also to be shifted right and the lowest-order bit in the timestamp being discarded. For example, timestamp 000 means this value did not appear during the last three time intervals, timestamp 100 means it was just seen in the last interval, and the timestamp 000 with reference bit set means it is encountered in the current time slot.
Operation mentioned above is performed on all entries of both caches with all reference bits reset. Thus, the timestamp keeps the history of value occurrences for the last n time periods.
When an entry is required and a value is to be evicted, the entry that is selected is the one with the smallest timestamp and clear reference bit. The new value is put in with a fresh reference bit and timestamp (all 0's) in this selected entry.

Type of Value Cache

After describing the protocol, we will now see two approaches for maintaining cache :

  1. Both caches can be initialized using fixed set of values depending on frequency of appearance of values in previous run.
  2. A changing set of frequent values can be maintained as the program runs. Thus, the contents of the frequent value tables adapt to changes in the frequent values for different parts of execution.
    The advantage of filling cache with fixed value is that the coders do not have to change the contents of the table dynamically thus reducing the runtime overhead. However, it requires that values be known before hand and different program needs different values. The second method, on the other hand, does not need a priori information of data values and does not distinguish among different programs. With these features, we pay the price of

identifying the frequent values on the fly.

Other Application

The protocol we discussed was applied on bus whose one end is on-chip cache and other end was off-chip memory, it is possible to adapt our strategy to work with an off-chip L2 cache as well. In addition, power protocol can also be used to reduce the switching activity between on-chip L1 cache and on-chip L2 cache (although the results would not be as good as those with the off-chip bus). In fact, our strategy can be used between any two communicating devices in the system (with the VC support). Further, we are not restricted by point-to-point configurations. That is, our approach can be made to work in an environment where multiple devices are communicating over a shared (power-hungry) data bus.
Obviously, in this case, among other things, we would need a coherence mechanism (the discussion of which is beyond the scope of this paper). A drawback of our strategy is the extra space needed by two value caches (one on-chip and the other off-chip). In this paper, we do not present a detailed study of the circuit space implications of our approach. As will be presented in the experimental results section, even a small VC (128 entries) generates reasonably good energy behavior; so, we can expect that the space overhead due to our optimization will not be excessive.

See also

Related Research Articles

<span class="mw-page-title-main">Bus (computing)</span> System that transfers data between components within a computer

In computer architecture, a bus is a communication system that transfers data between components inside a computer, or between computers. This expression covers all related hardware components and software, including communication protocols.

The Internet Control Message Protocol (ICMP) is a supporting protocol in the Internet protocol suite. It is used by network devices, including routers, to send error messages and operational information indicating success or failure when communicating with another IP address, for example, an error is indicated when a requested service is not available or that a host or router could not be reached. ICMP differs from transport protocols such as TCP and UDP in that it is not typically used to exchange data between systems, nor is it regularly employed by end-user network applications.

<span class="mw-page-title-main">Peripheral Component Interconnect</span> Local computer bus for attaching hardware devices

Peripheral Component Interconnect (PCI) is a local computer bus for attaching hardware devices in a computer and is part of the PCI Local Bus standard. The PCI bus supports the functions found on a processor bus but in a standardized format that is independent of any given processor's native bus. Devices connected to the PCI bus appear to a bus master to be connected directly to its own bus and are assigned addresses in the processor's address space. It is a parallel bus, synchronous to a single bus clock. Attached devices can take either the form of an integrated circuit fitted onto the motherboard or an expansion card that fits into a slot. The PCI Local Bus was first implemented in IBM PC compatibles, where it displaced the combination of several slow Industry Standard Architecture (ISA) slots and one fast VESA Local Bus (VLB) slot as the bus configuration. It has subsequently been adopted for other computer types. Typical PCI cards used in PCs include: network cards, sound cards, modems, extra ports such as Universal Serial Bus (USB) or serial, TV tuner cards and hard disk drive host adapters. PCI video cards replaced ISA and VLB cards until rising bandwidth needs outgrew the abilities of PCI. The preferred interface for video cards then became Accelerated Graphics Port (AGP), a superset of PCI, before giving way to PCI Express.

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, and file transfer rely on TCP, which is part of the Transport Layer of the TCP/IP suite. SSL/TLS often runs on top of TCP.

Time to live (TTL) or hop limit is a mechanism which limits the lifespan or lifetime of data in a computer or network. TTL may be implemented as a counter or timestamp attached to or embedded in the data. Once the prescribed event count or timespan has elapsed, data is discarded or revalidated. In computer networking, TTL prevents a data packet from circulating indefinitely. In computing applications, TTL is commonly used to improve the performance and manage the caching of data.

<span class="mw-page-title-main">Static random-access memory</span> Type of computer memory

Static random-access memory is a type of random-access memory (RAM) that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.

<span class="mw-page-title-main">Synchronous dynamic random-access memory</span> Type of computer memory

Synchronous dynamic random-access memory is any DRAM where the operation of its external pin interface is coordinated by an externally supplied clock signal.

<span class="mw-page-title-main">Cache coherence</span> Computer architecture term concerning shared resource data

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

The MESI protocol is an Invalidate-based cache coherence protocol, and is one of the most common protocols that support write-back caches. It is also known as the Illinois protocol. Write back caches can save a lot of bandwidth that is generally wasted on a write through cache. There is always a dirty state present in write back caches that indicates that the data in the cache is different from that in main memory. The Illinois Protocol requires a cache to cache transfer on a miss if the block resides in another cache. This protocol reduces the number of main memory transactions with respect to the MSI protocol. This marks a significant improvement in performance.

The Serial Peripheral Interface (SPI) is a synchronous serial communication interface specification used for short-distance communication, primarily in embedded systems. The interface was developed by Motorola in the mid-1980s and has become a de facto standard. Typical applications include Secure Digital cards and liquid crystal displays.

Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. A cache containing a coherency controller (snooper) is called a snoopy cache. This scheme was introduced by Ravishankar and Goodman in 1983.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

The RTP Control Protocol (RTCP) is a sister protocol of the Real-time Transport Protocol (RTP). Its basic functionality and packet structure is defined in RFC 3550. RTCP provides out-of-band statistics and control information for an RTP session. It partners with RTP in the delivery and packaging of multimedia data, but does not transport any media data itself.

<span class="mw-page-title-main">RapidIO</span> Electrical connection technology

The RapidIO architecture is a high-performance packet-switched electrical connection technology. RapidIO supports messaging, read/write and cache coherency semantics. Based on industry-standard electrical specifications such as those for Ethernet, RapidIO can be used as a chip-to-chip, board-to-board, and chassis-to-chassis interconnect.

CANopen is a communication protocol and device profile specification for embedded systems used in automation. In terms of the OSI model, CANopen implements the layers above and including the network layer. The CANopen standard consists of an addressing scheme, several small communication protocols and an application layer defined by a device profile. The communication protocols have support for network management, device monitoring and communication between nodes, including a simple transport layer for message segmentation/desegmentation. The lower level protocol implementing the data link and physical layers is usually Controller Area Network (CAN), although devices using some other means of communication can also implement the CANopen device profile.

RuBee is a two way active wireless protocol designed for harsh environment and high security asset visibility applications. RuBee utilizes longwave signals to send and receive short data packets in a local regional network. The protocol is similar to the IEEE 802 protocols in that RuBee is networked by using on-demand, peer-to-peer, and active radiating transceivers. RuBee is different in that it uses a low frequency (131 kHz) carrier. One result is that RuBee is slow compared to other packet based network data standards (WiFi). 131 kHz as an operating frequency provides RuBee with the advantages of ultra low power consumption, and normal operation near steel and/or water. These features make it easy to deploy sensors, controls, or even actuators and indicators.

Bus encoding refers to converting/encoding a piece of data to another form before launching on the bus. While bus encoding can be used to serve various purposes like reducing the number of pins, compressing the data to be transmitted, reducing cross-talk between bit lines, etc., it is one of the popular techniques used in system design to reduce dynamic power consumed by the system bus. Bus encoding aims to reduce the Hamming distance between 2 consecutive values on the bus. Since the activity is directly proportional to the Hamming distance, bus encoding proves to be effective in reducing the overall activity factor thereby reducing the dynamic power consumption in the system.

In lower power systems, Hierarchical Value Cache refers to the hierarchical arrangement of Value Caches (VCs) in such a fashion that lower level VCs observe higher hit-rates, but undergo more switching activity on VC hits.

<span class="mw-page-title-main">I3C (bus)</span>

I3C is a specification to enable communication between computer chips by defining the electrical connection between the chips and signaling patterns to be used. Short for "Improved Inter Integrated Circuit", the standard defines the electrical connection between the chips to be a two wire, shared (multidrop), serial data bus, one wire (SCL) being used as a clock to define the sampling times, the other wire (SDA) being used as a data line whose voltage can be sampled. The standard defines a signalling protocol in which multiple chips can control communication and thereby act as the bus controller.

References

  1. Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1176262
  2. Dinesh C Suresh; Banit Agrawal; Jun Yang; Walid Najjar (28 June 2005). "A Tunable Bus Encoder for Off-Chip Data Buses" (PDF). Retrieved 2015-04-22.
  3. Jun Yang; Gupta, R. (2001). "FV encoding for low-power data I/O". ISLPED'01: Proceedings of the 2001 International Symposium on Low Power Electronics and Design (IEEE Cat. No.01TH8581). ieeexplore.ieee.org. pp. 84–87. doi:10.1109/LPE.2001.945379. ISBN   1-58113-371-5. S2CID   9582140.