Speculative execution

Last updated

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

Contents

The objective is to provide more concurrency if extra resources are available. This approach is employed in a variety of areas, including branch prediction in pipelined processors, value prediction for exploiting value locality, prefetching memory and files, and optimistic concurrency control in database systems. [1] [2] [3]

Speculative multithreading is a special case of speculative execution.

Overview

Modern pipelined microprocessors use speculative execution to reduce the cost of conditional branch instructions using schemes that predict the execution path of a program based on the history of branch executions. [2] In order to improve performance and utilization of computer resources, instructions can be scheduled at a time when it has not yet been determined that the instructions will need to be executed, ahead of a branch. [4]

Variants

Speculative computation was a related earlier concept. [5]

Eager execution

Eager execution is a form of speculative execution where both sides of the conditional branch are executed; however, the results are committed only if the predicate is true. With unlimited resources, eager execution (also known as oracle execution) would in theory provide the same performance as perfect branch prediction. With limited resources, eager execution should be employed carefully, since the number of resources needed grows exponentially with each level of branch executed eagerly. [6]

Predictive execution

Predictive execution is a form of speculative execution where some outcome is predicted and execution proceeds along the predicted path until the actual result is known. If the prediction is true, the predicted execution is allowed to commit; however, if there is a misprediction, execution has to be unrolled and re-executed. Common forms of this include branch predictors and memory dependence prediction. A generalized form is sometimes referred to as value prediction. [7]

Runahead

Runahead is a technique that allows a computer processor to speculatively pre-process instructions during cache miss cycles. The pre-processed instructions are used to generate instruction and data stream prefetches by executing instructions leading to cache misses (typically called long latency loads) before they would normally occur, effectively hiding memory latency. In runahead, the processor uses the idle execution resources to calculate instruction and data stream addresses using the available information that is independent of a cache miss. Once the processor has resolved the initial cache miss, all runahead results are discarded, and the processor resumes execution as normal. The primary use case of the technique is to mitigate the effects of the memory wall. The technique may also be used for other purposes, such as pre-computing branch outcomes to achieve highly accurate branch prediction. [8]

Lazy execution

Lazy execution is the opposite of eager execution, and does not involve speculation. The incorporation of speculative execution into implementations of the Haskell programming language, a lazy language, is a current research topic. Eager Haskell, a variant of the language, is designed around the idea of speculative execution. A 2003 PhD thesis made GHC support a kind of speculative execution with an abortion mechanism to back out in case of a bad choice called optimistic execution. [9] It was deemed too complicated. [10]

Security vulnerabilities

Starting in 2017, a series of security vulnerabilities were found in the implementations of speculative execution on common processor architectures which effectively enabled an elevation of privileges.

These include:

See also

Related Research Articles

<span class="mw-page-title-main">Thread (computing)</span> Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

<span class="mw-page-title-main">Instruction-level parallelism</span> Ability of computer instructions to be executed simultaneously with correct results

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

<span class="mw-page-title-main">Branch predictor</span> Digital circuit

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures.

A barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading. Unlike simultaneous multithreading in modern superscalar architectures, it generally does not allow execution of multiple instructions in one cycle.

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems.

Memory dependence prediction is a technique, employed by high-performance out-of-order execution microprocessors that execute memory access operations out of program order, to predict true dependencies between loads and stores at instruction execution time. With the predicted dependence information, the processor can then decide to speculatively execute certain loads and stores out of order, while preventing other loads and stores from executing out-of-order. Later in the pipeline, memory disambiguation techniques are used to determine if the loads and stores were correctly executed and, if not, to recover.

Thread Level Speculation (TLS), also known as Speculative Multi-threading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

In computer architecture, memory-level parallelism (MLP) is the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.

Hardware scout is a technique that uses otherwise idle processor execution resources to perform prefetching during cache misses. When a thread is stalled by a cache miss, the processor pipeline checkpoints the register file, switches to runahead mode, and continues to issue instructions from the thread that is waiting for memory. The thread of execution in run-ahead mode is known as a scout thread. When the data returns from memory, the processor restores the register file contents from the checkpoint, and switches back to normal execution mode.

Runahead is a technique that allows a computer processor to speculatively pre-process instructions during cache miss cycles. The pre-processed instructions are used to generate instruction and data stream prefetches by executing instructions leading to cache misses before they would normally occur, effectively hiding memory latency. In runahead, the processor uses the idle execution resources to calculate instruction and data stream addresses using the available information that is independent of a cache miss. Once the processor has resolved the initial cache miss, all runahead results are discarded, and the processor resumes execution as normal. The primary use case of the technique is to mitigate the effects of the memory wall. The technique may also be used for other purposes, such as pre-computing branch outcomes to achieve highly accurate branch prediction.

The Alpha 21464 is an unfinished microprocessor that implements the Alpha instruction set architecture (ISA) developed by Digital Equipment Corporation and later by Compaq after it acquired Digital. The microprocessor was also known as EV8. Slated for a 2004 release, it was canceled on 25 June 2001 when Compaq announced that Alpha would be phased out in favor of Itanium by 2004. When it was canceled, the Alpha 21464 was at a late stage of development but had not been taped out.

<span class="mw-page-title-main">International Symposium on Microarchitecture</span>


The IEEE/ACM International Symposium on Microarchitecture® (MICRO) is an annual academic conference on microarchitecture, generally viewed as the top-tier academic conference on computer architecture. It is not to be confused with a micro-conference. Particularly within the domains of microarchitecture and Code generation (compiler), MICRO is unrivaled and esteemed as the premier forum. Association for Computing Machinery's Special Interest Group on Microarchitecture and Institute of Electrical and Electronics Engineers Computer Society are technical sponsors.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

Latency oriented processor architecture is the microarchitecture of a microprocessor designed to serve a serial computing thread with a low latency. This is typical of most central processing units (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases. Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total throughput of the system, rather than the service latencies for all individual threads that they work on.

<span class="mw-page-title-main">Trace cache</span>

In computer architecture, a trace cache or execution trace cache is a specialized instruction cache which stores the dynamic stream of instructions known as trace. It helps in increasing the instruction fetch bandwidth and decreasing power consumption by storing traces of instructions that have already been fetched and decoded. A trace processor is an architecture designed around the trace cache and processes the instructions at trace level granularity. The formal mathematical theory of traces is described by trace monoids.

<span class="mw-page-title-main">Pacman (security vulnerability)</span> Processor security vulnerability

Pacman is a side-channel vulnerability in certain ARM CPUs that was made public by Massachusetts Institute of Technology security researchers on June 10, 2021. It affects the pointer authentication (PAC) mechanism in many ARMv8.3 chips, including Apple's M1 CPU. Pacman creates an 'oracle' that lets an attacker guess a pointer's PAC signature without crashing the program if the guess is wrong. PAC signatures are typically less than 16 bits wide, so an attacker can use the oracle to guess the signature in 216 tries or fewer. It is unfixable without hardware changes because it is caused by the inherent design of CPU caches and branch predictors.

References

  1. Lampson, Butler (2006). "Lazy and Speculative Execution in Computer Systems". In Momenzadeh, Mariam; Shvartsman, Alexander A. (eds.). Principles of Distributed Systems. 10th International Conference on Principles of Distributed Systems. Lecture Notes in Computer Science. Vol. 4305. Bordeaux, France: Springer. pp. 1–2. doi:10.1007/11945529_1. ISBN   978-3-540-49991-6.
  2. 1 2 Raghavan, Prabhakar; Shachnai, Hadas; Yaniv, Mira (1998). "Dynamic schemes for speculative execution of code". Proceedings of the Sixth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. IEEE. pp. 309–314. doi:10.1109/MASCOT.1998.693711 . Retrieved 18 January 2011.
  3. Kung, H. T.; John T. Robinson (June 1981). "On optimistic methods for concurrency control" (PDF). ACM Trans. Database Syst. Vol. 6. Archived (PDF) from the original on August 31, 2019.
  4. Bernd Krieg-Brückner (1992). ESOP '92: 4th European Symposium on Programming, Rennes, France, February 26-28, 1992: proceedings. Springer. pp. 56–57. ISBN   978-3-540-55253-6 . Retrieved 18 January 2011.
  5. Randy B. Osborne (1990-03-21). "Speculative Computation in Multilisp". Parallel Lisp: Languages and Systems (PS). Lecture Notes in Computer Science. Vol. 441. Digital Equipment Corporation Research Lab. pp. 103–137. doi:10.1007/BFb0024152. ISBN   3-540-52782-6. Archived from the original on 2017-02-07. Retrieved 2018-01-26.
  6. Jurij Šilc; Borut Robič; Theo Ungerer (1999). Processor architecture: from dataflow to superscalar and beyond . Springer. pp.  148–150. ISBN   978-3-540-64798-0 . Retrieved 21 January 2011.
  7. Mark D., Hill; Norman P., Jouppi; Gourindar S., Sohi (2000). Readings in Computer Architecture. Morgan Kaufman. ISBN   9781558605398 . Retrieved 5 January 2018.
  8. Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN   978-1-4503-8557-2. S2CID   239011545.
  9. Jones, Simon Peyton; Ennals, Robert (1 August 2003). "Optimistic Evaluation: a fast evaluation strategy for non-strict programs" . Retrieved 15 May 2019 via www.microsoft.com.{{cite journal}}: Cite journal requires |journal= (help)
  10. "[Haskell] Optimistic Evaluation?". 31 August 2006.