Runahead

Last updated February 04, 2024

Runahead is a technique that allows a computer processor to speculatively pre-process instructions during cache miss cycles. The pre-processed instructions are used to generate instruction and data stream prefetches by executing instructions leading to cache misses (typically called long latency loads) before they would normally occur, effectively hiding memory latency. In runahead, the processor uses the idle execution resources to calculate instruction and data stream addresses using the available information that is independent of a cache miss. Once the processor has resolved the initial cache miss, all runahead results are discarded, and the processor resumes execution as normal. The primary use case of the technique is to mitigate the effects of the memory wall. The technique may also be used for other purposes, such as pre-computing branch outcomes to achieve highly accurate branch prediction.^[1]

The principal hardware cost is a means of checkpointing the register file state. Typically, runahead processors will also contain a small additional cache, which allows runahead store operations to execute without modifying actual memory. Certain implementations also use dedicated hardware acceleration units to execute specific slices of pre-processed instructions.^[2]^[3]

Runahead was initially investigated in the context of an in-order microprocessor;^[4] however, this technique has been extended for use with out-of-order microprocessors.^[5]

Triggering

In principle, any event can trigger runahead, though typically the entry condition is a last level data cache miss that makes it to the head of the re-order buffer.^[5] In a normal out-of-order processor, such long latency load instructions block retirement of all younger instructions until the miss is serviced and the load is retired.

When a processor enters runahead mode, it checkpoints all architectural registers and records the address of the load instruction that caused entry into runahead. All instructions in the pipeline are then marked as runahead. Because the value returned from a cache miss cannot be known ahead of time, it is possible for pre-processed instructions to be dependent upon unknown or invalid data. Registers containing such data, or data dependent on it, are denoted by adding an "invalid" or INV bit to every register in the register file. Instructions that use or write such invalid data are also marked with an INV bit. If the instruction that initiated runahead was a load, it is issued a bogus result and marked as INV, allowing it to mark its destination register as INV and drain out of the pipeline.

Pre-processing instructions

In runahead mode, the processor continues to execute instructions after the instruction that initiated runahead. However, runahead is considered a speculative state in which the processor only attempts to generate additional data and instruction cache misses which are effectively prefetches. The designer can opt to allow runahead to skip instructions that are not present in the instruction cache with the understanding that the quality of any prefetches generated will be reduced since the effect of the missing instructions is unknown.

Registers that are the target of an instruction that has one or more source registers marked INV are marked INV. This allows the processor to know which register values can (probably) be trusted during runahead mode. Branch instructions that cannot be resolved due to INV source registers are simply assumed to have been predicted correctly. In case the branch was mispredicted, the processor continues executing wrong-path instructions until it reaches a branch independent point, potentially executing wrong-path loads that pollute cache with useless data entries. Valid branch instruction outcomes can be saved for later use as highly accurate predictions during normal operation.

Since runahead is a speculative state, store instructions cannot be allowed to modify memory. In order to communicate store results to dependent loads, a very small cache only accessed by runahead loads and misses, called a runahead cache, can be used.^[5] This cache is functionally similar to a normal cache, but contains INV bits to track which data is invalid. INV stores set the INV bit of their corresponding target cache line, while valid stores reset the INV bit of the cache line. Any runahead load instruction must check both real and runahead cache. If the load hits in runahead cache, it will discard the real cache result and use the runahead cache data, potentially becoming invalid if the cache line was marked with a INV bit. Because the runahead cache is separate from the memory hierarchy, there is no place to evict old data to. Therefore, in case of a cache conflict, the old data is simply dropped from the cache. Note that because of the limited size of the runahead cache, it is not possible to perfectly track INV data during runahead mode (as INV data may be overwritten by valid data in a cache conflict). In practice, this is not crucial since all results computed during runahead mode are discarded.

Exiting

As with entering runahead, any event can in principle be cause for exiting runahead. Though in the case of a runahead period initiated by a cache miss, it is typically exited once the cache miss has been serviced.

When the processor exits runahead, all instructions younger than and including the instruction that initiated runahead are squashed and drained out of the pipeline. The architectural register file is then restored from the checkpoint. A predetermined register aliasing table (RAT) is then copied into both the front- and backend RAT. Finally, the processor is redirected to the address of the instruction that initiated runahead. The processor then resumes execution in normal mode.

Register file checkpoint options

The simplest method of checkpointing the architectural register file (ARF) is to simply perform a full copy of the entire physical register file (PRF) (because the PRF is a superset of the ARF) to a chekpoint register file (CRF) when the processor enters runahead mode. When runahead is exited, the processor can then perform a full copy from the CRF to the PRF. However, there are more efficient options available.

One way to eliminate the copy operations is to write to both the PRF and CRF during normal operation, but only to the PRF in runahead mode. This approach can eliminate the checkpointing overhead that would otherwise be incurred on initiating runahead if the CRF and PRF are written to in parallel, but still requires the processor to restore the PRF when runahead is exited.

Because the only registers that need to be checkpointed are the architectural registers, the CRF only needs to contain as many registers as there are architectural registers, as defined by the instruction set architecture. Since processors typically contain far more physical registers than architectural registers, this significantly shrinks the size of the CRF.

An even more aggressive approach is to rely only upon the operand forwarding paths of the microarchitecture to provide modified values during runahead mode.^{[ citation needed ]} The register file is then "checkpointed" by disabling writes to the register file during runahead.

Optimizations

While runahead is intended to increase processor performance, pre-processing instructions when the processor would otherwise have been idle decreases the processor's energy efficiency due to an increase in dynamic power draw. Additionally, entering and exiting runahead incurs a performance overhead, as register checkpointing and particularly flushing the pipeline may take many cycles to complete. Therefore, it is not wise to initiate runahead at every opportunity.

Some optimizations that improve the energy efficiency of runahead are:

Only entering runahead if the processor is expected to execute long latency loads during runahead, thereby reducing short, unproductive runahead periods.^[6]
Limiting the length of runahead periods to only run as long as they are expected to generate useful results.^[7]
Only pre-processing instructions that eventually lead to load instructions.^[8]
Only using free processor resources to pre-process instructions.^[9]
Buffering micro-operations that were decoded during runahead for reuse in normal mode.^[9]

Side effects

Runahead has been found to improve soft error rates in processors as a side effect. While a processor is waiting for a cache miss, the entire state of the processor is vulnerable to soft errors while the cache miss is outstanding. By continuing execution, runahead unintentionally reduces the amount of time the processor state is vulnerable to soft errors, thereby reducing soft error rates.^[10]

Related Research Articles

In computer science, an instruction set architecture (ISA) is a part of the abstract model of a computer, which generally defines how software controls the CPU. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.

IA-64 is the instruction set architecture (ISA) of the discontinued Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was subsequently implemented by Intel in collaboration with HP. The first Itanium processor, codenamed Merced, was released in 2001.

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computer engineering, out-of-order execution is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

The AMD Am29000, commonly shortened to 29k, is a family of 32-bit RISC microprocessors and microcontrollers developed and fabricated by Advanced Micro Devices (AMD). Based on the seminal Berkeley RISC, the 29k added a number of significant improvements. They were, for a time, the most popular RISC chips on the market, widely used in laser printers from a variety of manufacturers.

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

<span class="mw-page-title-main">Micro-operation</span> Low-level instructions used in some designs to implement complex machine instructions

In computer central processing units, micro-operations are detailed low-level instructions used in some designs to implement complex machine instructions.

In computer architecture, memory-level parallelism (MLP) is the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.

Hardware scout is a technique that uses otherwise idle processor execution resources to perform prefetching during cache misses. When a thread is stalled by a cache miss, the processor pipeline checkpoints the register file, switches to runahead mode, and continues to issue instructions from the thread that is waiting for memory. The thread of execution in run-ahead mode is known as a scout thread. When the data returns from memory, the processor restores the register file contents from the checkpoint, and switches back to normal execution mode.

The SPARC64 V (Zeus) is a SPARC V9 microprocessor designed by Fujitsu. The SPARC64 V was the basis for a series of successive processors designed for servers, and later, supercomputers.

An instruction window in computer architecture refers to the set of instructions which can execute out-of-order in a speculative processor.

The IEEE/ACM International Symposium on Microarchitecture^® (MICRO) is an annual academic conference on microarchitecture, generally viewed as the top-tier academic conference on computer architecture. It is not to be confused with a micro-conference. Particularly within the domains of microarchitecture and Code generation (compiler), MICRO is unrivaled and esteemed as the premier forum. Association for Computing Machinery's Special Interest Group on Microarchitecture and Institute of Electrical and Electronics Engineers Computer Society are technical sponsors.

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from NVIDIA that received support for Microsoft's rendering API Direct3D 12 feature_level 11.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

<span class="mw-page-title-main">Trace cache</span>

In computer architecture, a trace cache or execution trace cache is a specialized instruction cache which stores the dynamic stream of instructions known as trace. It helps in increasing the instruction fetch bandwidth and decreasing power consumption by storing traces of instructions that have already been fetched and decoded. A trace processor is an architecture designed around the trace cache and processes the instructions at trace level granularity. The formal mathematical theory of traces is described by trace monoids.

Intel microcode is microcode that runs inside x86 processors made by Intel. Since the P6 microarchitecture introduced in the mid-1990s, the microcode programs can be patched by the operating system or BIOS firmware to work around bugs found in the CPU after release. Intel had originally designed microcode updates for processor debugging under its design for testing (DFT) initiative.

The ARM Cortex-A77 is a central processing unit implementing the ARMv8.2-A 64-bit instruction set designed by ARM Holdings' Austin design centre. ARM announced an increase of 23% and 35% in integer and floating point performance, respectively. Memory bandwidth increased 15% relative to the A76.

Trevor Mudge is a computer scientist, academic and researcher. He is the Bredt Family Chair of Computer Science and Engineering, and Professor of Electrical Engineering and Computer Science at the University of Michigan.

References

↑ Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.
↑ Hashemi, Milad; Mutlu, Onur; Patt, Yale N. (October 2016). "Continuous runahead: Transparent hardware acceleration for memory intensive workloads". 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1–12. doi:10.1109/MICRO.2016.7783764. ISBN 978-1-5090-3508-3. S2CID 439575.
↑ Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.
↑ Dundas, James D. and Mudge, Trevor N. (September 1996). "Using stall cycles to improve microprocessor performance". Technical report. Department of Electrical Engineering and Computer Science, University of Michigan.
1 2 3 Mutlu, O.; Stark, J.; Wilkerson, C.; Patt, Y.N. (February 2003). "Runahead execution: An alternative to very large instruction windows for out-of-order processors". The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. pp. 129–140. doi:10.1109/HPCA.2003.1183532. ISBN 0-7695-1871-0. S2CID 9016814.
↑ Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009), Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret (eds.), "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor", High Performance Embedded Architectures and Compilers, Berlin, Heidelberg: Springer Berlin Heidelberg, vol. 5409, pp. 110–124, doi:10.1007/978-3-540-92990-1_10, ISBN 978-3-540-92989-5 , retrieved 2023-06-02
↑ Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009). Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret; Ungerer, Theo (eds.). MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor. Lecture Notes in Computer Science. Vol. 5409. Berlin, Heidelberg: Springer. pp. 110–124. doi:10.1007/978-3-540-92990-1_10. ISBN 978-3-540-92990-1.{{cite book}}: |journal= ignored (help)
↑ Hashemi, Milad; Patt, Yale N. (2015-12-05). "Filtered runahead execution with a runahead buffer". Proceedings of the 48th International Symposium on Microarchitecture. MICRO-48. New York, NY, USA: Association for Computing Machinery. pp. 358–369. doi:10.1145/2830772.2830812. ISBN 978-1-4503-4034-2. S2CID 2897777.
1 2 Naithani, Ajeya; Feliu, Josué; Adileh, Almutaz; Eeckhout, Lieven (February 2020). "Precise Runahead Execution". 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). pp. 397–410. doi:10.1109/HPCA47549.2020.00040. hdl:1854/LU-8668193. ISBN 978-1-7281-6149-5. S2CID 215817567.
↑ Naithani, Ajeya; Eeckhout, Lieven (April 2022). "Reliability-Aware Runahead". 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE. pp. 772–785. doi:10.1109/HPCA53966.2022.00062. ISBN 978-1-6654-2027-3. S2CID 248865294.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.

[2] Hashemi, Milad; Mutlu, Onur; Patt, Yale N. (October 2016). "Continuous runahead: Transparent hardware acceleration for memory intensive workloads". 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1–12. doi:10.1109/MICRO.2016.7783764. ISBN 978-1-5090-3508-3. S2CID 439575.

[3] Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.

[4] Dundas, James D. and Mudge, Trevor N. (September 1996). "Using stall cycles to improve microprocessor performance". Technical report. Department of Electrical Engineering and Computer Science, University of Michigan.

[:0-5] 1 2 3 Mutlu, O.; Stark, J.; Wilkerson, C.; Patt, Y.N. (February 2003). "Runahead execution: An alternative to very large instruction windows for out-of-order processors". The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. pp. 129–140. doi:10.1109/HPCA.2003.1183532. ISBN 0-7695-1871-0. S2CID 9016814.

[6] Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009), Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret (eds.), "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor", High Performance Embedded Architectures and Compilers, Berlin, Heidelberg: Springer Berlin Heidelberg, vol. 5409, pp. 110–124, doi:10.1007/978-3-540-92990-1_10, ISBN 978-3-540-92989-5 , retrieved 2023-06-02

[7] Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009). Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret; Ungerer, Theo (eds.). MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor. Lecture Notes in Computer Science. Vol. 5409. Berlin, Heidelberg: Springer. pp. 110–124. doi:10.1007/978-3-540-92990-1_10. ISBN 978-3-540-92990-1.{{cite book}}: |journal= ignored (help)

[8] Hashemi, Milad; Patt, Yale N. (2015-12-05). "Filtered runahead execution with a runahead buffer". Proceedings of the 48th International Symposium on Microarchitecture. MICRO-48. New York, NY, USA: Association for Computing Machinery. pp. 358–369. doi:10.1145/2830772.2830812. ISBN 978-1-4503-4034-2. S2CID 2897777.

[:1-9] 1 2 Naithani, Ajeya; Feliu, Josué; Adileh, Almutaz; Eeckhout, Lieven (February 2020). "Precise Runahead Execution". 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). pp. 397–410. doi:10.1109/HPCA47549.2020.00040. hdl:1854/LU-8668193. ISBN 978-1-7281-6149-5. S2CID 215817567.

[10] Naithani, Ajeya; Eeckhout, Lieven (April 2022). "Reliability-Aware Runahead". 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE. pp. 772–785. doi:10.1109/HPCA53966.2022.00062. ISBN 978-1-6654-2027-3. S2CID 248865294.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]