Write buffer

Last updated January 27, 2025

A write buffer is a type of data buffer that can be used to hold data being written from the cache to main memory or to the next cache in the memory hierarchy to improve performance and reduce latency. It is used in certain CPU cache architectures like Intel's x86 and AMD64.^[1] In multi-core systems, write buffers destroy sequential consistency. Some software disciplines, like C11's data-race-freedom,^[2] are sufficient to regain a sequentially consistent view of memory.

A variation of write-through caching is called buffered write-through.^{[ citation needed ]}

Use of a write buffer in this manner frees the cache to service read requests while the write is taking place. It is especially useful for very slow main memory in that subsequent reads are able to proceed without waiting for long main memory latency. When the write buffer is full (i.e. all buffer entries are occupied), subsequent writes still have to wait until slots are freed. Subsequent reads could be served from the write buffer. To further mitigate this stall, one optimization called write buffer merge may be implemented. Write buffer merge combines writes that have consecutive destination addresses into one buffer entry. Otherwise, they would occupy separate entries which increases the chance of pipeline stall.

A victim buffer is a type of write buffer that stores dirty evicted lines in write-back caches^{[note 1]} so that they get written back to main memory. Besides reducing pipeline stall by not waiting for dirty lines to write back as a simple write buffer does, a victim buffer may also serve as a temporary backup storage when subsequent cache accesses exhibit locality, requesting those recently evicted lines, which are still in the victim buffer.

The store buffer was invented by IBM during Project ACS between 1964 and 1968,^[3] but it was first implemented in commercial products in the 1990s.

Notes

↑ Write-through caches don't need write the evicted cache lines as they are written to main memory when the cache is written.

Related Research Articles

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

The MESI protocol is an invalidate-based cache coherence protocol, and is one of the most common protocols that support write-back caches. It is also known as the Illinois protocol due to its development at the University of Illinois at Urbana-Champaign. Write back caches can save considerable bandwidth generally wasted on a write through cache. There is always a dirty state present in write-back caches that indicates that the data in the cache is different from that in the main memory. The Illinois Protocol requires a cache-to-cache transfer on a miss if the block resides in another cache. This protocol reduces the number of main memory transactions with respect to the MSI protocol. This marks a significant improvement in performance.

In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores. Consistency is different from coherence, which occurs in systems that are cached or cache-less, and is consistency of data with respect to all processors. Coherence deals with maintaining a global order in which writes to a single location or single variable are seen by all processors. Consistency deals with the ordering of operations to multiple locations with respect to all processors.

Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only instruction-level parallelism in an instruction stream, which can be exploited by various and complementary techniques such as superscalar and out-of-order execution for better performance.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. It is a part of the chip's memory-management unit (MMU). A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that uses paged or segmented virtual memory.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computer engineering, out-of-order execution is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

A register file is an array of processor registers in a central processing unit (CPU). The instruction set architecture of a CPU will almost always define a set of registers which are used to stage data between memory and the functional units on the chip. The register file is part of the architecture and visible to the programmer, as opposed to the concept of transparent caches. In simpler CPUs, these architectural registers correspond one-for-one to the entries in a physical register file (PRF) within the CPU. More complicated CPUs use register renaming, so that the mapping of which physical entry stores a particular architectural register changes dynamically during execution.

In computing, a page cache, sometimes also called disk cache, is a transparent cache for the pages originating from a secondary storage device such as a hard disk drive (HDD) or a solid-state drive (SSD). The operating system keeps a page cache in otherwise unused portions of the main memory (RAM), resulting in quicker access to the contents of cached pages and overall performance improvements. A page cache is implemented in kernels with the paging memory management, and is mostly transparent to applications.

The Alpha 21064 is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture (ISA). It was introduced as the DECchip 21064 before it was renamed in 1994. The 21064 is also known by its code name, EV4. It was announced in February 1992 with volume availability in September 1992. The 21064 was the first commercial implementation of the Alpha ISA, and the first microprocessor from Digital to be available commercially. It was succeeded by a derivative, the Alpha 21064A in October 1993. This last version was replaced by the Alpha 21164 in 1995.

The R2000 is a 32-bit microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture (ISA). Introduced in January 1986, it was, by a few months, the first commercial implementation of the RISC architecture. The R2000 competed with Digital Equipment Corporation (DEC) VAX minicomputers and with Motorola 68000 and Intel Corporation 80386 microprocessors. R2000 users included Ardent Computer, DEC, Silicon Graphics, Northern Telecom and MIPS's own Unix workstations.

Processor consistency is one of the consistency models used in the domain of concurrent computing.

In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler. They may reduce cache pollution, reduce bandwidth requirement, bypass latencies, by providing better control over the working set. Most cache control instructions do not affect the semantics of a program, although some can.

Latency oriented processor architecture is the microarchitecture of a microprocessor designed to serve a serial computing thread with a low latency. This is typical of most central processing units (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases. Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total throughput of the system, rather than the service latencies for all individual threads that they work on.

References

↑ Owens, Scott, Susmit Sarkar, and Peter Sewell. "A better x86 memory model: x86-TSO." Theorem Proving in Higher Order Logics. Springer Berlin Heidelberg, 2009. 391-407.
↑ Oberhauser, Jonas. "A Simpler Reduction Theorem for x86-TSO." Verified Software: Theories, Tools, and Experiments. Springer International Publishing, 2015. 142-164
↑ Cocke, John (2007). "The search for performance in scientific processors". ACM Turing Award Lectures. p. 1987. doi:10.1145/1283920.1283945. ISBN 978-1-4503-1049-9.

This computing article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[3] Write-through caches don't need write the evicted cache lines as they are written to main memory when the cache is written.

[1] Owens, Scott, Susmit Sarkar, and Peter Sewell. "A better x86 memory model: x86-TSO." Theorem Proving in Higher Order Logics. Springer Berlin Heidelberg, 2009. 391-407.

[2] Oberhauser, Jonas. "A Simpler Reduction Theorem for x86-TSO." Verified Software: Theories, Tools, and Experiments. Springer International Publishing, 2015. 142-164

[4] Cocke, John (2007). "The search for performance in scientific processors". ACM Turing Award Lectures. p. 1987. doi:10.1145/1283920.1283945. ISBN 978-1-4503-1049-9.

[1]

[2]

[note 1]

[3]