Directory-based coherence

Last updated November 04, 2024

Directory-based coherence is a mechanism to handle cache coherence problem in distributed shared memory (DSM) a.k.a. non-uniform memory access (NUMA). Another popular way is to use a special type of computer bus between all the nodes as a "shared bus" (a.k.a. system bus).^[1] Directory-based coherence uses a special directory to serve instead of the shared bus in the bus-based coherence protocols. Both of these designs use the corresponding medium (i.e. directory or bus) as a tool to facilitate the communication between different nodes, and to guarantee that the coherence protocol is working properly along all the communicating nodes. In directory based cache coherence, this is done by using this directory to keep track of the status of all cache blocks, the status of each block includes in which cache coherence "state" that block is, and which nodes are sharing that block at that time, which can be used to eliminate the need to broadcast all the signals to all nodes, and only send it to the nodes that are interested in this single block.

Scalability: This is one of the strongest motivations for going to directory based designs. Scalability is, in short, how good a specific system is in handling the growing amount of work that it is responsible to do. For this criterion, bus based systems cannot do well due to the limitation caused when having a shared bus that all nodes are using in the same time. For a relatively small number of nodes, bus systems can do well. However, while the number of nodes is growing, some problems may occur in this regard. Especially since only one node is allowed to use the bus at a time, which will significantly harm the performance of the overall system. On the other hand, using directory-based systems, there will be no such bottleneck to constrain the scalability of the system.
Simplicity: This is one of the points where the bus-system is superior. Since the bus structure itself can serve as an organizer for all the traffic that goes through the system, and ensure the atomicity of all the signals passed through, there will be no need to put more effort in ensuring atomicity and ordering between signals as the case in directory based systems, which leads to several overhead faced in the later system design when dealing with issues like consistency.

From the above discussion it follows that using bus based systems seems more attractive for relatively small systems. However, directory based systems become crucial when the system scales up and the number of nodes grows. So there is a kind of trade-off between the simplicity and the scalability when comparing between bus-based and directory-based cache coherence designs.^[1]

History

The idea of directory-based cache coherence systems began long ago. The idea of DASH (Directory Architecture for SHared-memory) was first proposed by C.K. Tang^[2] in the mid 1970s. However, applying it to cache coherence was proposed a few years later in 1978, when researchers at Stanford University proposed the first version of this coherence systems called Stanford DASH, in a paper^[3] that described the system with the difficulties and improvements associated with such designs. Beside this approach, several attempts were done to provide a scalable systems. For instance, BBN Butterfly ^[4] which was introduced in 1985, and IBM PR3^[5] which was introduced in 1987, are some examples of scalable multiprocessor systems. However, both of these systems have a drawback; For example, BBN Butterfly does not have caches. Similarly, IBM PR3 does not provide hardware cache coherence, which limits the performance of both of these designs, especially when employing high performance processors.^[6]

The limitations of other competitors made it easier for DASH based systems to get chosen when designing cache coherence systems and other systems needing scalability in cache-based nodes. In 1985, James Archibald^[7] and Jean-Loup Baer from the University of Washington published a paper^[8] that proposes a more economical, expandable, and modular variation of the "global directory" approach in the term of hardware use in the design.

In 1992, Daniel Lenoski from Stanford university published a paper^[9] proposing advances in cache coherence protocols for directory-based systems. In a 1996 paper,^[10] he introduced the design of the SGI Origin 2000, a family of server computers employing directory based cache coherence. The subsequent Origin 3000 ^[11] was introduced in July 2000.

Protocols

Unlike snoopy coherence protocols, in a directory based coherence approach, the information about which caches have a copy of a block is maintained in a structure called directory. In a directory based scheme, participating caches do not broadcast requests to all other sharing caches of the block in order to locate cached copies, instead it queries the directory to retrieve the information about which block have cached copies and sends only to those particular processors and hence traffic saving compared to a snoopy protocol is large. In well optimized applications, most data sharing is only for data that is read only, and there is little sharing for data that is frequently read and written. A directory approach can result in a substantial traffic saving compared to broadcast/snoopy approach in such applications.

As shown in the data flow diagram, the actors involved in a distributed shared memory system implementing directory based coherence protocol are:

Requestor Node: This node is the processor who is requesting for a read/write of a memory block.
Directory Node: This node maintains the information of the state of each cache block in the system and requestor directs its requests to the directory node.
Owner Node: An owner node owns the most recent state of the cache block, note that directory might not be always up to date with latest data.
Sharer Node: One or many node which are sharing a copy of the cache block.

Requestor and Owner nodes maintain their state transition similar to a snoopy coherence protocols like MESI protocol. However, unlike a bus based implementation where nodes communicate using a common bus, directory based implementation uses message passing model to exchange information required for maintaining cache coherence.

Directory node acts as a serializing point and all communications are directed through this node to maintain correctness.

Directory node

A directory node keeps track of the overall state of a cache block in the entire cache system for all processors. It can be in three states :

Uncached (U): No processor has data cached, memory up-to-date .
Shared (S): one or more processors have data cached, memory up-to-date. In this state directory and sharers have clean copy of the cached block.
Exclusive/Modified (EM): one processor (owner) has data cached; memory out-of-date. Note that directory cannot distinguish a block cached in an exclusive or modified state at the processor, as processors can transition from an exclusive state to modified state without any bus transaction.

Explanation of the directory state transition finite-state machine (refer image 1) is captured below in the table:

Initial State	Bus Request	Response/ Action	New State
U	BusRd or BusRdX	Fetch block from memory as the directory is having the updated copy of the block. send the memory block to requestor using the message (ReplyD). if there are no sharers: requestor = first sharer, directory transitions into EM state.	EM
EM	BusRd	Send intervention(Int) to the owner	S
	BusRdX	Send Invalidation(Inv) to the current owner.	-
S	BusRd	Reply to the requestor with the memory block (ReplyD)	-
	BusRdX	Reply to the requestor with the memory block (ReplyD) Invalidate(Inv) all sharers.	EM
	BusUpgr	Invalidate(Inv) all sharers. Reply to the requestor that he can upgrade. (Reply)	EM

In addition to cache state, a directory must track which processors have data when in the shared state. This is required for sending invalidation and intervention requests to the individual processor caches which have the cache block in shared state. Few of the popular implementation approaches are:

Full bit-vector: A bit field for each processor at the directory node are maintained. The storage overhead scales with the number of processors.
Limited pointer: In this approach directory information of limited number of blocks is kept at the directory to reduce storage overhead.

The protocol described above is the basic implementation and race conditions can occur due to the fact that directory can be out of sync with the caches and also messages between processors can be overlapping. More complex implementations are available like Scalable Coherent Interface which have multiple states.

DASH^[3] cache coherence protocol is another protocol that uses directory-based coherence scheme. DASH protocol uses a clustered approach, where processors inside a cluster are kept coherent using bus based snooping scheme, while the clusters are connected in a directory approach. Even though various protocols use different implementations for tracking cache blocks, however the concept of directory remains same.

Related Research Articles

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. NUMA is beneficial for workloads with high memory locality of reference and low lock contention, because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus.

The Scalable Coherent Interface or Scalable Coherent Interconnect (SCI), is a high-speed interconnect standard for shared memory multiprocessing and message passing. The goal was to scale well, provide system-wide memory coherence and a simple interface; i.e. a standard to replace existing buses in multiprocessor systems with one with no inherent scalability and performance limitations.

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task must communicate with one or more remote processors. In contrast, a shared memory multiprocessor offers a single memory space used by all processors. Processors do not have to be aware where data resides, except that there may be performance penalties, and that race conditions are to be avoided.

The MESI protocol is an invalidate-based cache coherence protocol, and is one of the most common protocols that support write-back caches. It is also known as the Illinois protocol due to its development at the University of Illinois at Urbana-Champaign. Write back caches can save considerable bandwidth generally wasted on a write through cache. There is always a dirty state present in write-back caches that indicates that the data in the cache is different from that in the main memory. The Illinois Protocol requires a cache-to-cache transfer on a miss if the block resides in another cache. This protocol reduces the number of main memory transactions with respect to the MSI protocol. This marks a significant improvement in performance.

In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores. Consistency is different from coherence, which occurs in systems that are cached or cache-less, and is consistency of data with respect to all processors. Coherence deals with maintaining a global order in which writes to a single location or single variable are seen by all processors. Consistency deals with the ordering of operations to multiple locations with respect to all processors.

Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. This scheme was introduced by Ravishankar and Goodman in 1983, under the name "write-once" cache coherency. A cache containing a coherency controller (snooper) is called a snoopy cache.

Memory coherence is an issue that affects the design of computer systems in which two or more processors or cores share a common area of memory.

In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as a single shared address space. The term "shared" does not mean that there is a single centralized memory, but that the address space is shared—i.e., the same physical address on two processors refers to the same location in memory. Distributed global address space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's private memory.

Cache only memory architecture (COMA) is a computer memory organization for use in multiprocessors in which the local memories at each node are used as cache. This is in contrast to using the local memories as actual main memory, as in NUMA organizations.

In computing, the MSI protocol - a basic cache-coherence protocol - operates in multiprocessor systems. As with other cache coherency protocols, the letters of the protocol name identify the possible states in which a cache line can be.

The MOSI protocol is an extension of the basic MSI cache coherency protocol. It adds the Owned state, which indicates that the current processor owns this block, and will service requests from other processors for the block.

Fireplane is a computer internal interconnect created by Sun Microsystems.

The Firefly cache coherence protocol is the schema used in the DEC Firefly multiprocessor workstation, developed by DEC Systems Research Center. This protocol is a 3 State Write Update Cache Coherence Protocol. Unlike the Dragon protocol, the Firefly protocol updates the Main Memory as well as the Local caches on Write Update Bus Transition. Thus the Shared Clean and Shared Modified States present in case of Dragon Protocol, are not distinguished between in case of Firefly Protocol.

The SGI Origin 2000 is a family of mid-range and high-end server computers developed and manufactured by Silicon Graphics (SGI). They were introduced in 1996 to succeed the SGI Challenge and POWER Challenge. At the time of introduction, these ran the IRIX operating system, originally version 6.4 and later, 6.5. A variant of the Origin 2000 with graphics capability is known as the Onyx2. An entry-level variant based on the same architecture but with a different hardware implementation is known as the Origin 200. The Origin 2000 was succeeded by the Origin 3000 in July 2000, and was discontinued on June 30, 2002.

The Dragon Protocol is an update based cache coherence protocol used in multi-processor systems. Write propagation is performed by directly updating all the cached values across multiple processors. Update based protocols such as the Dragon protocol perform efficiently when a write to a cache block is followed by several reads made by other processors, since the updated cache block is readily available across caches associated with all the processors.

In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors.

Stanford DASH was a cache coherent multiprocessor developed in the late 1980s by a group led by Anoop Gupta, John L. Hennessy, Mark Horowitz, and Monica S. Lam at Stanford University. It was based on adding a pair of directory boards designed at Stanford to up to 16 SGI IRIS 4D Power Series machines and then cabling the systems in a mesh topology using a Stanford-modified version of the Torus Routing Chip. The boards designed at Stanford implemented a directory-based cache coherence protocol allowing Stanford DASH to support distributed shared memory for up to 64 processors. Stanford DASH was also notable for both supporting and helping to formalize weak memory consistency models, including release consistency. Because Stanford DASH was the first operational machine to include scalable cache coherence, it influenced subsequent computer science research as well as the commercially available SGI Origin 2000. Stanford DASH is included in the 25th anniversary retrospective of selected papers from the International Symposium on Computer Architecture and several computer science books, has been simulated by the University of Edinburgh, and is used as a case study in contemporary computer science classes.

In computer engineering, directory-based cache coherence is a type of cache coherence mechanism, where directories are used to manage caches in place of bus snooping. Bus snooping methods scale poorly due to the use of broadcasting. These methods can be used to target both performance and scalability of directory systems.

Examples of coherency protocols for cache memory are listed here. For simplicity, all "miss" Read and Write status transactions which obviously come from state "I", in the diagrams are not shown. They are shown directly on the new state. Many of the following protocols have only historical value. At the moment the main protocols used are the R-MESI type / MESIF protocols and the HRT-ST-MESI or a subset or an extension of these.

References

1 2 Solihin, Yan (2009). Fundamentals of parallel computer architecture. pp. 319–360.
↑ Tang, C.K. "Cache system design in the tightly coupled multiprocessor system". AFIPS '76 Proceedings of the June 7–10, 1976, National Computer Conference and Exposition.
1 2 "The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor" (PDF). Computer Systems Laboratory.
↑ Schmidt, G.E. "The Butterfly Parallel Processor". In Proc. Of ICS.
↑ "The IBM research parallel processor prototype PR3: Introduction and architicture". In Proceeding of the 1985 International Conference of Parallel Processing.
↑ "Design of Scalable Shared-Memory Multiprocessors: The DASH approach". Computer System Laboratory, Stanford University.
↑ "James Archibald". ece.byu.edu. Archived from the original on 2017-08-02. Retrieved 2016-11-15.
↑ "An economical solution to the cache coherence problem". ISCA '84 Proceedings of the 11th Annual International Symposium on Computer Architecture.
↑ Lenoski, Daniel; Laudon, James; Gharachorloo, Kourosh; Weber, Wolf-Dietrich; Gupta, Anoop; Hennessy, John; Horowitz, Mark; Lam, Monica S. (1992-03-01). "The Stanford Dash Multiprocessor". Computer. 25 (3): 63–79. doi:10.1109/2.121510. ISSN 0018-9162. S2CID 9731523.
↑ Laudon, James; Lenoski, Daniel (1997-01-01). "The SGI Origin". Proceedings of the 24th annual international symposium on Computer architecture - ISCA '97. New York, NY, USA: ACM. pp. 241–251. doi:10.1145/264107.264206. ISBN 978-0897919012. S2CID 692050.
↑ Corp., Silicon Graphics International. "Support Home Page". support1-sgi.custhelp.com. Archived from the original on 2018-04-13. Retrieved 2016-11-16.
↑ Solihin, Yan (2009). Fundamentals of Parallel Multicore Architecture. pp. 319–361.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Solihin, Yan (2009). Fundamentals of parallel computer architecture. pp. 319–360.

[2] Tang, C.K. "Cache system design in the tightly coupled multiprocessor system". AFIPS '76 Proceedings of the June 7–10, 1976, National Computer Conference and Exposition.

[:1-3] 1 2 "The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor" (PDF). Computer Systems Laboratory.

[4] Schmidt, G.E. "The Butterfly Parallel Processor". In Proc. Of ICS.

[5] "The IBM research parallel processor prototype PR3: Introduction and architicture". In Proceeding of the 1985 International Conference of Parallel Processing.

[6] "Design of Scalable Shared-Memory Multiprocessors: The DASH approach". Computer System Laboratory, Stanford University.

[7] "James Archibald". ece.byu.edu. Archived from the original on 2017-08-02. Retrieved 2016-11-15.

[8] "An economical solution to the cache coherence problem". ISCA '84 Proceedings of the 11th Annual International Symposium on Computer Architecture.

[9] Lenoski, Daniel; Laudon, James; Gharachorloo, Kourosh; Weber, Wolf-Dietrich; Gupta, Anoop; Hennessy, John; Horowitz, Mark; Lam, Monica S. (1992-03-01). "The Stanford Dash Multiprocessor". Computer. 25 (3): 63–79. doi:10.1109/2.121510. ISSN 0018-9162. S2CID 9731523.

[10] Laudon, James; Lenoski, Daniel (1997-01-01). "The SGI Origin". Proceedings of the 24th annual international symposium on Computer architecture - ISCA '97. New York, NY, USA: ACM. pp. 241–251. doi:10.1145/264107.264206. ISBN 978-0897919012. S2CID 692050.

[11] Corp., Silicon Graphics International. "Support Home Page". support1-sgi.custhelp.com. Archived from the original on 2018-04-13. Retrieved 2016-11-16.

[12] Solihin, Yan (2009). Fundamentals of Parallel Multicore Architecture. pp. 319–361.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]