Dragon protocol

Last updated

The Dragon Protocol [1] is an update based cache coherence protocol used in multi-processor systems. Write propagation is performed by directly updating all the cached values across multiple processors. Update based protocols such as the Dragon protocol perform efficiently when a write to a cache block is followed by several reads made by other processors, since the updated cache block is readily available across caches associated with all the processors.

Contents

States

Each cache block resides in one of the four states: exclusive-clean, shared-clean, shared-modified and modify.

For any given pair of caches, the permitted states of a given cache block in conjunction with the states of the other cache's states, are as follows (the states abbreviated in the order above):

 E  Sc  Sm  M 
 E Red x.svgRed x.svgRed x.svgRed x.svg
 Sc Red x.svgGreen check.svgGreen check.svgRed x.svg
 Sm Red x.svgGreen check.svgRed x.svgRed x.svg
 M Red x.svgRed x.svgRed x.svgRed x.svg

Transactions

There are 4 processor transactions and 2 bus transactions.

Processor Read (PrRd): This happens when the processor completes a successful read on a certain cache block placed in its cache.

Processor Write (PrWr): This happens when the processor completes a successful write on a certain cache block placed in its cache. This makes the processor to be the latest to update the cache block.

Processor Read Miss (PrRdMiss): This happens when the processor fails to read a cache block from its cache, and needs to fetch the block from either the memory or another cache.

Processor Write Miss (PrWrMiss): This happens when the processor fails to write to a cache block from its cache, and needs to fetch the block from the memory or another cache and then write to it. This again makes the processor to be the latest to update the cache block.

Bus Read (BusRd): This happens when a processor requests the bus to fetch the latest value of the cache block, whether it be from the main memory or another processor’s cache.

Flush: This happens when a processor places an entire cache block on the bus. This is to reflect the changes made by the processor to the cached block in the main memory.

Bus Update (BusUpd): This happens when a processor modifies a cache block, and other processors need an update in their respective cache blocks. This is unique to only write update protocols. BusUpd takes shorter time when compared to Flush operation, since writes made to caches are faster than to memory. Another point to note is that a cache cannot update its local copy of a cache block and then request the bus to send out a bus update. If this does happen, then it might be possible that two caches independently update their local copy and then request for the bus. They would then see the two writes simultaneously which would not follow Sequential consistency.

A shared line is also required to indicate whether a certain cache block is available in multiple caches. This is required because one of the caches could evict the block without needing to update the other blocks. The shared line helps reduce memory and bus transactions in some cases where the block is available in only one cache and a bus update is hence, not required. Such a dedicated line to detect sharing is seen across write-update protocols, like the Firefly protocol and implemented based on bus standards such as Futurebus (IEEE standard P896.1). [2]

Transitions

Dragon Protocol - Processor Initiated Transactions Dragon Protocol Processor Initiated Transactions.png
Dragon Protocol - Processor Initiated Transactions

Processor-initiated transitions

Based on the current state of the block and the transaction initiated by the processor, the cache block undergoes one of the following state transitions:

Dragon protocol - Bus Initiated Transactions Dragon Protocol Bus Initiated Transactions.png
Dragon protocol - Bus Initiated Transactions

Bus-initiated transitions

Based on the current state of the block and the transaction initiated by the bus, the cache block undergoes one of the following state transitions:

Low Level Design Choices

Elimination of the Shared-modified state

The processor with the cache block in Sm state is responsible for updating main memory when the cache block is replaced. But if the main memory is updated whenever a bus update transaction occurs, there is no need for separate Sm and Sc states, and the protocol can afford a single shared state. However, this causes a lot more memory transactions [3] which can slow down the system, especially when multiple processors are writing to the same cache block.

Broadcasting the replacement of Sc block

The protocol allows the cache blocks in the Sc state to be replaced silently without any bus activity. If a broadcast was made to let other caches know that a Sc block is being replaced, they could test the shared line and move to state E if there were no other sharers. The advantage of having a block in state E is that if the block is later written, it goes to M state and there is no need to generate a bus update transaction. So at the cost of broadcasting the replacements of Sc blocks, we are able to avoid bus update transactions. And since broadcasting replacements is not time critical, if a cache is not required to process the replacement right away, there is no downside. On the other hand, if a cache does not process an update right away, it may lead to out-of-order updates. In such cases a three state update protocol, like the Firefly protocol, may have performance advantages.

Comparisons

Dragon vs Write invalidate protocols

Write Invalidate is another set of cache coherence protocols, where once a cache block is modified, the other values of the same block in other caches are not updated, but invalidated. [4] Write invalidate protocols are more efficient in cases where are there are many subsequent writes to the same cache block, as the invalidation happens once and further bus transactions by other processors are avoided. However, the write update protocol is more efficient in cases where a write to a block is followed by multiple reads to the same block. Since we are updating the other cached values once we write it, they have access to the data immediately. In such a case. the write invalidate protocol is highly disadvantageous because every time a cache block is modified in another cache, the rest of the caches need will encounter a coherence miss, and initiate a bus transaction to read the new value. In contrast, the write update protocol tends to, at times, keep the values of the block updated for longer than necessary, which will lead to an increase in other types of misses, i.e. conflict and capacity misses.

Dragon vs Firefly protocol

In case of Firefly, cache-to-cache transfers of modified blocks are also written back to main memory at the same time. But since accesses made to the main memory are slower by orders of magnitude when compared to caches, it requires added complexity of performing a write back as a separate bus operation. In any case, it results in lower performance. [5] This problem is avoided altogether in case of Dragon protocol, since the shared blocks are not written back to memory at all. However, this comes at the cost of one added state (Shared-modified).

Related Research Articles

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

<span class="mw-page-title-main">Cache coherence</span> Computer architecture term concerning shared resource data

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

The MESI protocol is an Invalidate-based cache coherence protocol, and is one of the most common protocols that support write-back caches. It is also known as the Illinois protocol. Write back caches can save a lot of bandwidth that is generally wasted on a write through cache. There is always a dirty state present in write back caches that indicates that the data in the cache is different from that in main memory. The Illinois Protocol requires a cache to cache transfer on a miss if the block resides in another cache. This protocol reduces the number of main memory transactions with respect to the MSI protocol. This marks a significant improvement in performance.

In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores. Consistency is different from coherence, which occurs in systems that are cached or cache-less, and is consistency of data with respect to all processors. Coherence deals with maintaining a global order in which writes to a single location or single variable are seen by all processors. Consistency deals with the ordering of operations to multiple locations with respect to all processors.

Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. This scheme was introduced by Ravishankar and Goodman in 1983, under the name "write-once" cache coherency. A cache containing a coherency controller (snooper) is called a snoopy cache.

In computer science, the test-and-set instruction is an instruction used to write (set) 1 to a memory location and return its old value as a single atomic operation. The caller can then "test" the result to see if the state was changed by the call. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process's test-and-set is finished. A central processing unit (CPU) may use a test-and-set instruction offered by another electronic component, such as dual-port RAM; a CPU itself may also offer a test-and-set instruction.

Memory coherence is an issue that affects the design of computer systems in which two or more processors or cores share a common area of memory.

In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as a single shared address space. The term "shared" does not mean that there is a single centralized memory, but that the address space is shared—i.e., the same physical address on two processors refers to the same location in memory. Distributed global address space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's private memory.

Memcached is a general-purpose distributed memory-caching system. It is often used to speed up dynamic database-driven websites by caching data and objects in RAM to reduce the number of times an external data source must be read. Memcached is free and open-source software, licensed under the Revised BSD license. Memcached runs on Unix-like operating systems and on Microsoft Windows. It depends on the libevent library.

In computing, the MSI protocol - a basic cache-coherence protocol - operates in multiprocessor systems. As with other cache coherency protocols, the letters of the protocol name identify the possible states in which a cache line can be.

The MOSI protocol is an extension of the basic MSI cache coherency protocol. It adds the Owned state, which indicates that the current processor owns this block, and will service requests from other processors for the block.

(For a detailed description see Cache coherency protocols )

In cache coherency protocol literature, Write-Once was the first MESI protocol defined. It has the optimization of executing write-through on the first write and a write-back on all subsequent writes, reducing the overall bus traffic in consecutive writes to the computer memory. It was first described by James R. Goodman in (1983). Cache coherence protocols are an important issue in Symmetric multiprocessing systems, where each CPU maintains a cache of the memory.

The Firefly cache coherence protocol is the schema used in the DEC Firefly multiprocessor workstation, developed by DEC Systems Research Center. This protocol is a 3 State Write Update Cache Coherence Protocol. Unlike the Dragon protocol, the Firefly protocol updates the Main Memory as well as the Local caches on Write Update Bus Transition. Thus the Shared Clean and Shared Modified States present in case of Dragon Protocol, are not distinguished between in case of Firefly Protocol.

The MERSI protocol is a cache coherency and memory coherence protocol used by the PowerPC G4. The protocol consists of five states, Modified (M), Exclusive (E), Read Only or Recent (R), Shared (S) and Invalid (I). The M, E, S and I states are the same as in the MESI protocol. The R state is similar to the E state in that it is constrained to be the only clean, valid, copy of that data in the computer system. Unlike the E state, the processor is required to initially request ownership of the cache line in the R state before the processor may modify the cache line and transition to the M state. In both the MESI and MERSI protocols, the transition from the E to M is silent.

The MESIF protocol is a cache coherency and memory coherence protocol developed by Intel for cache coherent non-uniform memory architectures. The protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F).

Multi-level caches can be designed in various ways depending on whether the content of one cache is present in other levels of caches. If all blocks in the higher level cache are also present in the lower level cache, then the lower level cache is said to be inclusive of the higher level cache. If the lower level cache contains only blocks that are not present in the higher level cache, then the lower level cache is said to be exclusive of the higher level cache. If the contents of the lower level cache are neither strictly inclusive nor exclusive of the higher level cache, then it is called non-inclusive non-exclusive (NINE) cache.

A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.

Directory-based coherence is a mechanism to handle Cache coherence problem in Distributed shared memory (DSM) a.k.a. Non-Uniform Memory Access (NUMA). Another popular way is to use a special type of computer bus between all the nodes as a "shared bus". Directory-based coherence uses a special directory to serve instead of the shared bus in the bus-based coherence protocols. Both of these designs use the corresponding medium as a tool to facilitate the communication between different nodes, and to guarantee that the coherence protocol is working properly along all the communicating nodes. In directory based cache coherence, this is done by using this directory to keep track of the status of all cache blocks, the status of each block includes in which cache coherence "state" that block is, and which nodes are sharing that block at that time, which can be used to eliminate the need to broadcast all the signals to all nodes, and only send it to the nodes that are interested in this single block.

Examples of coherency protocols for cache memory are listed here. For simplicity, all "miss" Read and Write status transactions which obviously come from state "I", in the diagrams are not shown. They are shown directly on the new state. Many of the following protocols have only historical value. At the moment the main protocols used are the R-MESI type / MESIF protocols and the HRT-ST-MESI or a subset or an extension of these.

References

  1. Atkinson, Russell R.; McCreight, Edward M. (1987-01-01). "The dragon processor". Proceedings of the second international conference on Architectural support for programming languages and operating systems. ASPLOS II. Los Alamitos, CA, USA: IEEE Computer Society Press. pp. 65–69. doi:10.1145/36206.36185. ISBN   978-0818608056. S2CID   7019821.
  2. Stenström, Per (1990-06-01). "A Survey of Cache Coherence Schemes for Multiprocessors". Computer. 23 (6): 12–24. doi:10.1109/2.55497. ISSN   0018-9162.
  3. Culler, David; Singh, Jaswinder Pal; Gupta, Anoop (1999). Parallel Computer Architecture, 1st Edition | David Culler, Jaswinder Pal Singh, Anoop Gupta |. ISBN   9781558603431 . Retrieved 2016-10-24.{{cite book}}: |website= ignored (help)
  4. Solihin, Yan (2015). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. pp. 205–206, 231–232. ISBN   9781482211184.
  5. Archibald, James; Baer, Jean-Loup (1986-09-01). "Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model". ACM Trans. Comput. Syst. 4 (4): 273–298. doi:10.1145/6513.6514. ISSN   0734-2071. S2CID   713808.

See also