Memory dependence prediction

Last updated

Memory dependence prediction is a technique, employed by high-performance out-of-order execution microprocessors that execute memory access operations (loads and stores) out of program order, to predict true dependencies between loads and stores at instruction execution time. With the predicted dependence information, the processor can then decide to speculatively execute certain loads and stores out of order, while preventing other loads and stores from executing out-of-order (keeping them in-order). Later in the pipeline, memory disambiguation techniques are used to determine if the loads and stores were correctly executed and, if not, to recover.

Contents

By using the memory dependence predictor to keep most dependent loads and stores in order, the processor gains the benefits of aggressive out-of-order load/store execution but avoids many of the memory dependence violations that occur when loads and stores were incorrectly executed. This increases performance because it reduces the number of pipeline flushes that are required to recover from these memory dependence violations. See the memory disambiguation article for more information on memory dependencies, memory dependence violations, and recovery.

In general, memory dependence prediction predicts whether two memory operations are dependent, that is, if they interact by accessing the same memory location. Besides using store to load (RAW or true) memory dependence prediction for the out-of-order scheduling of loads and stores, other applications of memory dependence prediction have been proposed. See for example. [1]

Memory dependence prediction is an optimization on top of memory dependency speculation. Sequential execution semantics imply that stores and loads appear to execute in the order specified by the program. However, as with out-of-order execution of other instructions, it may be possible to execute two memory operations in a different order from that implied by the program. This is possible when the two operations are independent. In memory dependence speculation a load may be allowed to execute before a store that precedes it. Speculation succeeds when the load is independent of the store, that is, when the two instructions access different memory locations. Speculation fails when the load is dependent upon the store, that is, when the two accesses overlap in memory. In the first, modern out-of-order designs, memory speculation was not used as its benefits were limited. Αs the scope of the out-of-order execution increased over few tens of instructions, naive memory dependence speculation was used. In naive memory dependence speculation, [2] a load is allowed to bypass any preceding store. As with any form of speculation, it is important to weight the benefits of correct speculation against the penalty paid on incorrect speculation. As the scope of out-of-order execution increases further into several tens of instructions, the performance benefits of naive speculation decrease. To retain the benefits of aggressive memory dependence speculation while avoiding the costs of mispeculation several predictors have been proposed.

Selective memory dependence prediction [2] [3] stalls specific loads until it is certain that no violation may occur. It does not explicitly predict dependencies. This predictor may delay loads longer than necessary and hence result in suboptimal performance. In fact, in some cases it performs worse than naively speculating all loads as early as possible. This is because often its faster to mispeculate and recover than to wait for all preceding stores to execute. Exact memory dependence prediction was developed at the University of Wisconsin–Madison. Specifically, Dynamic Speculation and Synchronization [2] [3] delays loads only as long as it is necessary by predicting the exact store a load should wait for. This predictor predicts exact dependences (store and load pair). The synonym predictor [1] groups together all dependences that share a common load or store instruction. The store sets [4] predictor represents multiple potential dependences efficiently by grouping together all possible stores a load may depend upon. The store barrier [5] predictor treats certain store instructions as barriers. That is, all subsequent load or store operations are not allowed to bypass the specific store. The store barrier predictor does not explicitly predict dependencies. This predictor may unnecessarily delay subsequent, yet independent loads. Memory dependence prediction has other applications beyond the scheduling of loads and stores. For example, speculative memory cloaking [1] and speculative memory bypassing [1] use memory dependence prediction to streamline the communication of values through memory.

Analogy to branch prediction

Memory dependence prediction for loads and stores is analogous to branch prediction for conditional branch instructions. In branch prediction, the branch predictor predicts which way the branch will resolve before it is known. The processor can then speculatively fetch and execute instructions down one of the paths of the branch. Later, when the branch instruction executes, it can be determined if the branch instruction was correctly predicted. If not, this is a branch misprediction, and a pipeline flush is necessary to throw away instructions that were speculatively fetched and executed.

Branch prediction can be thought of as a two step process. First, the predictor determines the direction of the branch (taken or not). This is a binary decision. Then, the predictor determines the actual target address. Similarly, memory dependence prediction can be thought of as a two step process. First, the predictor determines whether there is a dependence. Then it determines which this dependence is.

See also

Related Research Articles

Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.

In computer science, predication is an architectural feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (predicated) non-branch instructions associated with a predicate, a Boolean value used by the instruction to control whether the instruction is allowed to modify the architectural state or not. If the predicate specified in the instruction is true, the instruction modifies the architectural state; otherwise, the architectural state is unchanged. For example, a predicated move instruction will only modify the destination if the predicate is true. Thus, instead of using a conditional branch to select an instruction or a sequence of instructions to execute based on the predicate that controls whether the branch occurs, the instructions to be executed are associated with that predicate, so that they will be executed, or not executed, based on whether that predicate is true or false.

<span class="mw-page-title-main">Instruction-level parallelism</span> Ability of computer instructions to be executed simultaneously with correct results

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order.

Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit.

In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

Explicitly parallel instruction computing (EPIC) is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures. It was the basis for Intel and HP development of the Intel Itanium architecture, and HP later asserted that "EPIC" was merely an old term for the Itanium architecture. EPIC permits microprocessors to execute software instructions in parallel by using the compiler, rather than complex on-die circuitry, to control parallel instruction execution. This was intended to allow simple performance scaling without resorting to higher clock frequencies.

In computer engineering, out-of-order execution is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

Memory disambiguation is a set of techniques employed by high-performance out-of-order execution microprocessors that execute memory access instructions out of program order. The mechanisms for performing memory disambiguation, implemented using digital logic inside the microprocessor core, detect true dependencies between memory operations at execution time and allow the processor to recover when a dependence has been violated. They also eliminate spurious memory dependencies and allow for greater instruction-level parallelism by allowing safe out-of-order execution of loads and stores.

Runahead is a technique that allows a microprocessor to pre-process instructions during cache miss cycles instead of stalling. The pre-processed instructions are used to generate instruction and data stream prefetches by detecting cache misses before they would otherwise occur by using the idle execution resources to calculate instruction and data stream fetch addresses using the available information that is independent of the cache miss.

<span class="mw-page-title-main">Alpha 21264</span>

The Alpha 21264 is a Digital Equipment Corporation RISC microprocessor launched on 19 October 1998. The 21264 implemented the Alpha instruction set architecture (ISA).

The Power Processing Element (PPE) comprises a Power Processing Unit (PPU) and a 512 KB L2 cache. In most instances the PPU is used in a PPE. The PPU is a 64-bit dual-threaded in-order PowerPC 2.02 microprocessor core designed by IBM for use primarily in the game consoles PlayStation 3 and Xbox 360, but has also found applications in high performance computing in supercomputers such as the record setting IBM Roadrunner.

Processor Consistency is one of the consistency models used in the domain of concurrent computing.

Latency oriented processor architecture is the microarchitecture of a microprocessor designed to serve a serial computing thread with a low latency. This is typical of most central processing units (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases. Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total throughput of the system, rather than the service latencies for all individual threads that they work on.

<span class="mw-page-title-main">Trace cache</span>

In computer architecture, a trace cache or execution trace cache is a specialized instruction cache which stores the dynamic stream of instructions known as trace. It helps in increasing the instruction fetch bandwidth and decreasing power consumption by storing traces of instructions that have already been fetched and decoded. A trace processor is an architecture designed around the trace cache and processes the instructions at trace level granularity. The formal mathematical theory of traces is described by trace monoids.

References

  1. 1 2 3 4 Moshovos, A.; Sohi, G. S. (1997). "Streamlining Inter-Operation Memory Communication via Data Dependence Prediction". Proceedings of 30th Annual International Symposium on Microarchitecture. MICRO'97. pp. 235–245. doi:10.1109/MICRO.1997.645814.
  2. 1 2 3 Moshovos, Andreas; Breach, Scott E.; Vijaykumar, T. N.; Sohi, Gurindar S. (1997). "Dynamic Speculation and Synchronization of Data Dependences". Proceedings of the 24th annual international symposium on Computer architecture. ISCA'97. pp. 181–193. doi:10.1145/264107.264189. Also as technical report, Computer Sciences Department, University of Wisconsin–Madison, March 1996.
  3. 1 2 Memory Dependence Prediction, Moshovos, Ph.D. Thesis, Computer Sciences Department, University of Wisconsin–Madison, Dec. 1998.
  4. Chrysos, G. Z.; Emer, J. S. (1998). "Memory Dependence Prediction Using Store Sets". Proceedings 25th Annual International Symposium on Computer Architecture. ISCA'98. pp. 142–153. doi:10.1109/ISCA.1998.694770.
  5. Apparatus to dynamically control the out-of-order execution of load-store instructions in a processor capable of dispatching, issuing and executing multiple instructions in a single processor cycle, Hesson, LeBlanc and Ciavaglia, IBM, US Patent 5,615,350, March 1997.