MESI protocol

Last updated

The MESI protocol is an Invalidate-based cache coherence protocol, and is one of the most common protocols that support write-back caches. It is also known as the Illinois protocol due to its development at the University of Illinois at Urbana-Champaign. [1] Write back caches can save considerable bandwidth generally wasted on a write through cache. There is always a dirty state present in write-back caches that indicates that the data in the cache is different from that in the main memory. The Illinois Protocol requires a cache-to-cache transfer on a miss if the block resides in another cache. This protocol reduces the number of main memory transactions with respect to the MSI protocol. This marks a significant improvement in performance. [2]

Contents

States

The letters in the acronym MESI represent four exclusive states that a cache line can be marked with (encoded using two additional bits):

Modified (M)
The cache line is present only in the current cache, and is dirty - it has been modified (M state) from the value in main memory. The cache is required to write the data back to the main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Shared state(S).
Exclusive (E)
The cache line is present only in the current cache, but is clean - it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.
Shared (S)
Indicates that this cache line may be stored in other caches of the machine and is clean - it matches the main memory. The line may be discarded (changed to the Invalid state) at any time.
Invalid (I)
Indicates that this cache line is invalid (unused).

For any given pair of caches, the permitted states of a given cache line are as follows:

 M  E  S  I 
 M Red x.svgRed x.svgRed x.svgGreen check.svg
 E Red x.svgRed x.svgRed x.svgGreen check.svg
 S Red x.svgRed x.svgGreen check.svgGreen check.svg
 I Green check.svgGreen check.svgGreen check.svgGreen check.svg

When the block is marked M (modified) or E (exclusive), the copies of the block in other Caches are marked as I (Invalid).

Operation

Image 1.1 State diagram for MESI protocol Red: Bus initiated transaction. Black: Processor initiated transactions. Diagrama MESI.GIF
Image 1.1 State diagram for MESI protocol Red: Bus initiated transaction. Black: Processor initiated transactions.

The MESI protocol is defined by a finite-state machine that transitions from one state to another based on 2 stimuli.

The first stimulus is the processor-specific Read and Write request. For example: A processor P1 has a Block X in its Cache, and there is a request from the processor to read or write from that block.

The second stimulus is given through the bus connecting the processors. In particular, the "Bus side requests" come from other processors that don't have the cache block or the updated data in their Cache. The bus requests are monitored with the help of Snoopers, [4] which monitor all the bus transactions.

Following are the different types of Processor requests and Bus side requests:

Processor Requests to Cache include the following operations:

  1. PrRd: The processor requests to read a Cache block.
  2. PrWr: The processor requests to write a Cache block

Bus side requests are the following:

  1. BusRd: Snooped request that indicates there is a read request to a Cache block requested by another processor
  2. BusRdX: Snooped request that indicates there is a write request to a Cache block requested by another processor that doesn't already have the block.
  3. BusUpgr: Snooped request that indicates that there is a write request to a Cache block requested by another processor that already has that cache block residing in its own cache.
  4. Flush: Snooped request that indicates that an entire cache block is written back to the main memory by another processor.
  5. FlushOpt: Snooped request that indicates that an entire cache block is posted on the bus in order to supply it to another processor (Cache to Cache transfers).

(Such Cache to Cache transfers can reduce the read miss latency if the latency to bring the block from the main memory is more than from Cache to Cache transfers, which is generally the case in bus based systems.)

Snooping Operation: In a snooping system, all caches on a bus monitor all the transactions on that bus. Every cache has a copy of the sharing status of every block of physical memory it has stored. The state of the block is changed according to the State Diagram of the protocol used. (Refer image above for MESI state diagram). The bus has snoopers on both sides:

  1. Snooper towards the Processor/Cache side.
  2. The snooping function on the memory side is done by the Memory controller.

Explanation:

Each Cache block has its own 4 state finite-state machine (refer image 1.1). The State transitions and the responses at a particular state with respect to different inputs are shown in Table1.1 and Table 1.2

Table 1.1 State Transitions and response to various Processor Operations
Initial StateOperationResponse
Invalid(I)PrRd
  • Issue BusRd to the bus
  • other Caches see BusRd and check if they have a valid copy, inform sending cache
  • State transition to (S)Shared, if other Caches have valid copy.
  • State transition to (E)Exclusive, if none (must ensure all others have reported).
  • If other Caches have copy, one of them sends value, else fetch from Main Memory
PrWr
  • Issue BusRdX signal on the bus
  • State transition to (M)Modified in the requestor Cache.
  • If other Caches have copy, they send value, otherwise fetch from Main Memory
  • If other Caches have copy, they see BusRdX signal and invalidate their copies.
  • Write into Cache block modifies the value.
Exclusive(E)PrRd
  • No bus transactions generated
  • State remains the same.
  • Read to the block is a Cache Hit
PrWr
  • No bus transaction generated
  • State transition from Exclusive to (M)Modified
  • Write to the block is a Cache Hit
Shared(S)PrRd
  • No bus transactions generated
  • State remains the same.
  • Read to the block is a Cache Hit.
PrWr
  • Issues BusUpgr signal on the bus.
  • State transition to (M)Modified.
  • other Caches see BusUpgr and mark their copies of the block as (I)Invalid.
Modified(M)PrRd
  • No bus transactions generated
  • State remains the same.
  • Read to the block is a Cache hit
PrWr
  • No bus transactions generated
  • State remains the same.
  • Write to the block is a Cache hit.
Table 1.2 State Transitions and response to various Bus Operations
Initial StateOperationResponse
Invalid(I)BusRd
  • No State change. Signal Ignored.
BusRdX/BusUpgr
  • No State change. Signal Ignored
Exclusive(E)BusRd
  • Transition to Shared (Since it implies a read taking place in other cache).
  • Put FlushOpt on bus together with contents of block.
BusRdX
  • Transition to Invalid.
  • Put FlushOpt on Bus, together with the data from now-invalidated block.
Shared(S)BusRd
  • No State change (other cache performed read on this block, so still shared).
  • May put FlushOpt on bus together with contents of block (design choice, which cache with Shared state does this).
BusRdX/BusUpgr
  • Transition to Invalid (cache that sent BuxRdX/BusUpgr becomes Modified)
  • May put FlushOpt on bus together with contents of block (design choice, which cache with Shared state does this)
Modified(M)BusRd
  • Transition to (S)Shared.
  • Put FlushOpt on Bus with data. Received by sender of BusRd and Memory Controller, which writes to Main memory.
BusRdX
  • Transition to (I)Invalid.
  • Put FlushOpt on Bus with data. Received by sender of BusRdx and Memory Controller, which writes to Main memory.

A write may only be performed freely if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation known as Request For Ownership (RFO).

A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all the other caches in the system) of the corresponding main memory location and insert the data that it holds. This can be done by forcing the read to back off (i.e. retry later), then writing the data to main memory and changing the cache line to the Shared state. It can also be done by sending data from Modified cache to the cache performing the read. Note, snooping only required for read misses (protocol ensures that Modified cannot exist if any other cache can perform a read hit).

A cache that holds a line in the Shared state must listen for invalidate or request-for-ownership broadcasts from other caches, and discard the line (by moving it into Invalid state) on a match.

The Modified and Exclusive states are always precise: i.e. they match the true cache line ownership situation in the system. The Shared state may be imprecise: if another cache discards a Shared line, this cache may become the sole owner of that cache line, but it will not be promoted to Exclusive state. Other caches do not broadcast notices when they discard cache lines, and this cache could not use such notifications without maintaining a count of the number of shared copies.

In that sense the Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache line in state S, a bus transaction is necessary to invalidate all other cached copies. State E enables modifying a cache line with no bus transaction.

Illustration of MESI protocol operations

Let us assume that the following stream of read/write references. All the references are to the same location and the digit refers to the processor issuing the reference.

The stream is : R1, W1, R3, W3, R1, R3, R2.

Initially it is assumed that all the caches are empty.

Table 1.3 An example of how MESI works All operations to same cache block (Example: "R3" means read block by processor 3)
LocalRequestP1P2P3Generated

Bus Request

Data Supplier
0Initially-----
1R1E--BusRdMem
2W1M----
3R3S-SBusRdP1's Cache
4W3I-MBusUpgr-
5R1S-SBusRdP3's Cache
6R3S-S--
7R2SSSBusRdP1/P3's Cache

Note:The term snooping referred to below is a protocol for maintaining cache coherency in symmetric multiprocessing environments. All the caches on the bus monitor (snoop) the bus if they have a copy of the block of data that is requested on the bus.


Read For Ownership

A Read For Ownership (RFO) is an operation in cache coherency protocols that combines a read and an invalidate broadcast. The operation is issued by a processor trying to write into a cache line that is in the shared (S) or invalid (I) states of the MESI protocol. The operation causes all other caches to set the state of such a line to I. A read for ownership transaction is a read operation with intent to write to that memory address. Therefore, this operation is exclusive. It brings data to the cache and invalidates all other processor caches that hold this memory line. This is termed "BusRdX" in tables above.

Memory Barriers

MESI in its naive, straightforward implementation exhibits two particular performance issues. First, when writing to an invalid cache line, there is a long delay while the line is fetched from other CPUs. Second, moving cache lines to the invalid state is time-consuming. To mitigate these delays, CPUs implement store buffers and invalidate queues. [5]

Store Buffer

A store buffer is used when writing to an invalid cache line. As the write will proceed anyway, the CPU issues a read-invalid message (hence the cache line in question and all other CPUs' cache lines that store that memory address are invalidated) and then pushes the write into the store buffer, to be executed when the cache line finally arrives in the cache.

A direct consequence of the store buffer's existence is that when a CPU commits a write, that write is not immediately written in the cache. Therefore, whenever a CPU needs to read a cache line, it first scans its own store buffer for the existence of the same line, as there is a possibility that the same line was written by the same CPU before but hasn't yet been written in the cache (the preceding write is still waiting in the store buffer). Note that while a CPU can read its own previous writes in its store buffer, other CPUs cannot see those writes until they are flushed to the cache - a CPU cannot scan the store buffer of other CPUs.

Invalidate Queues

With regard to invalidation messages, CPUs implement invalidate queues, whereby incoming invalidate requests are instantly acknowledged but not immediately acted upon. Instead, invalidation messages simply enter an invalidation queue and their processing occurs as soon as possible (but not necessarily instantly). Consequently, a CPU can be oblivious to the fact that a cache line in its cache is actually invalid, as the invalidation queue contains invalidations that have been received but haven't yet been applied. Note that, unlike the store buffer, the CPU can't scan the invalidation queue, as that CPU and the invalidation queue are physically located on opposite sides of the cache.

As a result, memory barriers are required. A store barrier will flush the store buffer, ensuring all writes have been applied to that CPU's cache. A read barrier will flush the invalidation queue, thus ensuring that all writes by other CPUs become visible to the flushing CPU. Furthermore, memory management units do not scan the store buffer, causing similar problems. This effect is visible even in single threaded processors. [6]

Advantages of MESI over MSI

The most striking difference between MESI and MSI is the extra "exclusive" state present in the MESI protocol. This extra state was added as it has many advantages. When a processor needs to read a block that none of the other processors have and then write to it, two bus transactions will take place in the case of MSI. First, a BusRd request is issued to read the block followed by a BusUpgr request before writing to the block. The BusRdX request in this scenario is useless as none of the other caches have the same block, but there is no way for one cache to know about this. Thus, MESI protocol overcomes this limitation by adding an Exclusive state, which results in saving a bus request. This makes a huge difference when a sequential application is running. As only one processor works on a piece of data, all the accesses will be exclusive. MSI performs much worse in this case due to the extra bus messages. Even in the case of a highly parallel application with minimal sharing of data, MESI is far faster. Adding the Exclusive state also comes at no cost as 3 states and 4 states are both representable with 2 bits.

Disadvantage of MESI

In case continuous read and write operations are performed by various caches on a particular block, the data has to be flushed to the bus every time. Thus, the main memory will pull this on every flush and remain in a clean state. But this is not a requirement and is just an additional overhead caused by using MESI. This challenge was overcome by the MOESI protocol. [7]

In case of S (Shared State), multiple snoopers may response with FlushOpt with the same data (see the example above). The F state in MESIF addresses this redundancy.

See also

Related Research Articles

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

<span class="mw-page-title-main">Harvard architecture</span> Computer architecture where code and data each have a separate bus

The Harvard architecture is a computer architecture with separate storage and signal pathways for instructions and data. It is often contrasted with the von Neumann architecture, where program instructions and data share the same memory and pathways.

<span class="mw-page-title-main">Cache coherence</span> Computer architecture term concerning shared resource data

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. This scheme was introduced by Ravishankar and Goodman in 1983, under the name "write-once" cache coherency. A cache containing a coherency controller (snooper) is called a snoopy cache.

In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as a single shared address space. The term "shared" does not mean that there is a single centralized memory, but that the address space is shared—i.e., the same physical address on two processors refers to the same location in memory. Distributed global address space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's private memory.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computing, the MSI protocol - a basic cache-coherence protocol - operates in multiprocessor systems. As with other cache coherency protocols, the letters of the protocol name identify the possible states in which a cache line can be.

The MOSI protocol is an extension of the basic MSI cache coherency protocol. It adds the Owned state, which indicates that the current processor owns this block, and will service requests from other processors for the block.

(For a detailed description see Cache coherency protocols )

<span class="mw-page-title-main">Intel 82497</span>

The Intel 82497 is a Cache Controller for the P5 Pentium processor. It works with multiple 82492 Cache SRAMs.

In cache coherency protocol literature, Write-Once was the first MESI protocol defined. It has the optimization of executing write-through on the first write and a write-back on all subsequent writes, reducing the overall bus traffic in consecutive writes to the computer memory. It was first described by James R. Goodman in (1983). Cache coherence protocols are an important issue in Symmetric multiprocessing systems, where each CPU maintains a cache of the memory.

The Firefly cache coherence protocol is the schema used in the DEC Firefly multiprocessor workstation, developed by DEC Systems Research Center. This protocol is a 3 State Write Update Cache Coherence Protocol. Unlike the Dragon protocol, the Firefly protocol updates the Main Memory as well as the Local caches on Write Update Bus Transition. Thus the Shared Clean and Shared Modified States present in case of Dragon Protocol, are not distinguished between in case of Firefly Protocol.

A modified Harvard architecture is a variation of the Harvard computer architecture that, unlike the pure Harvard architecture, allows memory that contains instructions to be accessed as data. Most modern computers that are documented as Harvard architecture are, in fact, modified Harvard architecture.

The Dragon Protocol is an update based cache coherence protocol used in multi-processor systems. Write propagation is performed by directly updating all the cached values across multiple processors. Update based protocols such as the Dragon protocol perform efficiently when a write to a cache block is followed by several reads made by other processors, since the updated cache block is readily available across caches associated with all the processors.

libtorrent

libtorrent is an open-source implementation of the BitTorrent protocol. It is written in and has its main library interface in C++. Its most notable features are support for Mainline DHT, IPv6, HTTP seeds and μTorrent's peer exchange. libtorrent uses Boost, specifically Boost.Asio to gain its platform independence. It is known to build on Windows and most Unix-like operating systems.

The MERSI protocol is a cache coherency and memory coherence protocol used by the PowerPC G4. The protocol consists of five states, Modified (M), Exclusive (E), Read Only or Recent (R), Shared (S) and Invalid (I). The M, E, S and I states are the same as in the MESI protocol. The R state is similar to the E state in that it is constrained to be the only clean, valid, copy of that data in the computer system. Unlike the E state, the processor is required to initially request ownership of the cache line in the R state before the processor may modify the cache line and transition to the M state. In both the MESI and MERSI protocols, the transition from the E to M is silent.

The MESIF protocol is a cache coherency and memory coherence protocol developed by Intel for cache coherent non-uniform memory architectures. The protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F).

Directory-based coherence is a mechanism to handle cache coherence problem in distributed shared memory (DSM) a.k.a. non-uniform memory access (NUMA). Another popular way is to use a special type of computer bus between all the nodes as a "shared bus". Directory-based coherence uses a special directory to serve instead of the shared bus in the bus-based coherence protocols. Both of these designs use the corresponding medium as a tool to facilitate the communication between different nodes, and to guarantee that the coherence protocol is working properly along all the communicating nodes. In directory based cache coherence, this is done by using this directory to keep track of the status of all cache blocks, the status of each block includes in which cache coherence "state" that block is, and which nodes are sharing that block at that time, which can be used to eliminate the need to broadcast all the signals to all nodes, and only send it to the nodes that are interested in this single block.

Examples of coherency protocols for cache memory are listed here. For simplicity, all "miss" Read and Write status transactions which obviously come from state "I", in the diagrams are not shown. They are shown directly on the new state. Many of the following protocols have only historical value. At the moment the main protocols used are the R-MESI type / MESIF protocols and the HRT-ST-MESI or a subset or an extension of these.

References

  1. Papamarcos, M. S.; Patel, J. H. (1984). "A low-overhead coherence solution for multiprocessors with private cache memories" (PDF). Proceedings of the 11th annual international symposium on Computer architecture - ISCA '84. p. 348. doi:10.1145/800015.808204. ISBN   0818605383. S2CID   195848872 . Retrieved March 19, 2013.
  2. Gómez-Luna, J.; Herruzo, E.; Benavides, J.I. "MESI Cache Coherence Simulator for Teaching Purposes". Clei Electronic Journal. 12 (1, PAPER 5, APRIL 2009). CiteSeerX   10.1.1.590.6891 .
  3. Culler, David (1997). Parallel Computer Architecture. Morgan Kaufmann Publishers. pp. Figure 5–15 State transition diagram for the Illinois MESI protocol. Pg 286.
  4. Bigelow, Narasiman, Suleman. "An evaluation of Snoopy Based Cache Coherence protocols" (PDF). ECE Department, University of Texas at Austin.{{cite web}}: CS1 maint: multiple names: authors list (link)
  5. Handy, Jim (1998). The Cache Memory Book. Morgan Kaufmann. ISBN   9780123229809.
  6. Chen, G.; Cohen, E.; Kovalev, M. (2014). "Store Buffer Reduction with MMUs". Verified Software: Theories, Tools and Experiments. Lecture Notes in Computer Science. Vol. 8471. p. 117. doi:10.1007/978-3-319-12154-3_8. ISBN   978-3-319-12153-6.
  7. "Memory System (Memory Coherency and Protocol)" (PDF). AMD64 Technology. September 2006.