Speculative multithreading

Last updated February 26, 2024

Thread Level Speculation (TLS), also known as Speculative Multi-threading, or Speculative Parallelization,^[1] is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

Description

TLS extracts threads from serial code and executes them speculatively in parallel with a safe thread. The speculative thread will need to be discarded or re-run if its presumptions on the input state prove to be invalid. It is a dynamic (runtime) parallelization technique that can uncover parallelism that static (compile-time) parallelization techniques may fail to exploit because at compile time thread independence cannot be guaranteed. For the technique to achieve the goal of reducing overall execute time, there must be available CPU resource that can be efficiently executed in parallel with the main safe thread.^[2]

TLS assumes optimistically that a given portion of code (generally loops) can be safely executed in parallel. To do so, it divides the iteration space into chunks that are executed in parallel by different threads. A hardware or software monitor ensures that sequential semantics are kept (in other words, that the execution progresses as if the loop were executing sequentially). If a dependence violation appears, the speculative framework may choose to stop the entire parallel execution and restart it; to stop and restart the offending threads and all their successors, in order to be fed with correct data; or to stop exclusively the offending thread and its successors that have consumed incorrect data from it.^[3]

Related Research Articles

Mesa is a programming language developed in the mid 1970s at the Xerox Palo Alto Research Center in Palo Alto, California, United States. The language name was a pun based upon the programming language catchphrases of the time, because Mesa is a "high level" programming language.

In computing, a virtual machine (VM) is the virtualization or emulation of a computer system. Virtual machines are based on computer architectures and provide the functionality of a physical computer. Their implementations may involve specialized hardware, software, or a combination of the two. Virtual machines differ and are organized by their function, shown here:

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

In computing, just-in-time (JIT) compilation is compilation during execution of a program rather than before execution. This may consist of source code translation but is more commonly bytecode translation to machine code, which is then executed directly. A system implementing a JIT compiler typically continuously analyses the code being executed and identifies parts of the code where the speedup gained from compilation or recompilation would outweigh the overhead of compiling that code.

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems.

In computer science, ahead-of-time compilation is the act of compiling an (often) higher-level programming language into an (often) lower-level language before execution of a program, usually at build-time, to reduce the amount of work needed to be performed at run time.

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

In computer architecture, memory-level parallelism (MLP) is the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.

Explicit Multi-Threading (XMT) is a computer science paradigm for building and programming parallel computers designed around the parallel random-access machine (PRAM) parallel computational model. A more direct explanation of XMT starts with the rudimentary abstraction that made serial computing simple: that any single instruction available for execution in a serial program executes immediately. A consequence of this abstraction is a step-by-step (inductive) explication of the instruction available next for execution. The rudimentary parallel abstraction behind XMT, dubbed Immediate Concurrent Execution (ICE) in Vishkin (2011), is that indefinitely many instructions available for concurrent execution execute immediately. A consequence of ICE is a step-by-step (inductive) explication of the instructions available next for concurrent execution. Moving beyond the serial von Neumann computer, the aspiration of XMT is that computer science will again be able to augment mathematical induction with a simple one-line computing abstraction.

SequenceL is a general purpose functional programming language and auto-parallelizing compiler and tool set, whose primary design objectives are performance on multi-core processor hardware, ease of programming, platform portability/optimization, and code clarity and readability. Its main advantage is that it can be used to write straightforward code that automatically takes full advantage of all the processing power available, without programmers needing to be concerned with identifying parallelisms, specifying vectorization, avoiding race conditions, and other challenges of manual directive-based programming approaches such as OpenMP.

Kathryn S. McKinley is an American computer scientist noted for her research on compilers, runtime systems, and computer architecture. She is also known for her leadership in broadening participation in computing. McKinley was co-chair of CRA-W from 2011 to 2014.

In parallel computing, work stealing is a scheduling strategy for multithreaded computer programs. It solves the problem of executing a dynamically multithreaded computation, one that can "spawn" new threads of execution, on a statically multithreaded computer, with a fixed number of processors. It does so efficiently in terms of execution time, memory usage, and inter-processor communication.

Automatic bug-fixing is the automatic repair of software bugs without the intervention of a human programmer. It is also commonly referred to as automatic patch generation, automatic bug repair, or automatic program repair. The typical goal of such techniques is to automatically generate correct patches to eliminate bugs in software programs without causing software regression.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

References

↑ Estebanez, Alvaro (2017). "A Survey on Thread-Level Speculation Techniques". ACM Computing Surveys. 49 (2): 1–39. doi:10.1145/2938369. S2CID 423292.
↑ Martínez, José F.; Torrellas, Josep (2002). "Speculative synchronization" (PDF). Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS-X) - ASPLOS '02. ACM. p. 18. doi:10.1145/605397.605400. ISBN 1581135742. S2CID 9189828. Archived from the original (PDF) on 2018-11-18.
↑ García Yaguez, Alvaro (2014). "Squashing Alternatives for Software-based Speculative Parallelization" . IEEE Transactions on Computers. 63 (7): 1826–1839. doi:10.1109/TC.2013.46. S2CID 14081801.

Yiapanis, Paraskevas; Rosas-Ham, Demian; Brown, Gavin; Lujan, Mikel (2013). "Optimizing Software Runtime Systems for Speculative Parallelization". ACM Transactions on Architecture and Code Optimization. 9 (4): 1–27. doi: 10.1145/2400682.2400698 .
Llanos, Diego R. (2007). "New scheduling strategies for randomized incremental algorithms in the context of speculative parallelization". IEEE Transactions on Computers. 56 (6): 839–852. CiteSeerX 10.1.1.77.5496 . doi:10.1109/TC.2007.1030. S2CID 3181243.
Johnson, Nick P.; Kim, Hanjun; Prabhu, Prakash; Zaks, Ayal; August, David I. (2012). "Speculative separation for privatization and reductions" (PDF). Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI '12. pp. 359–370. doi:10.1145/2254064.2254107.
Bhowmik, Anasua; Franklin, Manoj (2002). "A General Compiler Framework for Speculative Multithreading". Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures. SPAA '02. pp. 99–108. doi:10.1145/564870.564885.
Bruening, Derek; Devabhaktuni, Srikrishna; Amarasinghe, Saman (2000). Softspec: Software-based Speculative Parallelism (PDF). FDDO-3. pp. 1–10.
Chen, Michael K.; Olukotun, Kunle (1998). "Exploiting Method-Level Parallelism in Single-Threaded Java Programs". International Conference on Parallel Architectures and Compilation Techniques. PACT 1998. pp. 176–184. doi:10.1109/PACT.1998.727190.
Chen, Michael K.; Olukotun, Kunle (2003). "The Jrpm System for Dynamically Parallelizing Java Programs". Proceedings of the 30th annual international symposium on Computer architecture. ISCA '03. pp. 434–446. doi:10.1145/859618.859668.
Cintra, Marcelo; Llanos, Diego R. (2003). "Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors". Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming. PPoPP '03. pp. 13–24. doi:10.1145/781498.781501.
Cook, Jonathan J. (2002). "Reverse Execution of Java Bytecode". The Computer Journal. 45 (6): 608–619. CiteSeerX 10.1.1.20.4765 . doi:10.1093/comjnl/45.6.608.
Quinones, Carlos Garcia; Madriles, Carlos; Sanchez, Jesus; Marcuello, Pedro; Gonzalez, Antonio; Tullsen, Dean M. (2005). "Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-Computation Slices". Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. PLDI '05. pp. 269–279. doi:10.1145/1065010.1065043.
Hu, Shiwen; Bhargava, Ravi; John, Lizy Kurian (2003). "The Role of Return Value Prediction in Exploiting Speculative Method-Level Parallelism" (PDF). JILP. 5: 1–21.
Kazi, Iffat H. (2000). A Dynamically Adaptive Parallelization Model Based on Speculative Multithreading (Ph.D. thesis). University of Minnesota. pp. 1–188.
Pickett, Christopher J.F.; Verbrugge, Clark (2005). "SableSpMT: A Software Framework for Analysing Speculative Multithreading in Java". Proceedings of the 6th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering. PASTE '05. pp. 59–66. doi:10.1145/1108792.1108809.
Pickett, Christopher J.F.; Verbrugge, Clark (2005). "Software Thread Level Speculation for the Java Language and Virtual Machine Environment" (PDF). Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing. LCPC '05. LNCS. Vol. 4339. pp. 304–318. doi:10.1007/978-3-540-69330-7_21.
Porter, Leo; Choi, Bumyong; Tullsen, Dean M. (2009). "Mapping Out a Path from Hardware Transactional Memory to Speculative Multithreading". 18th International Conference on Parallel Architectures and Compilation Techniques. PACT '09. pp. 313–324. CiteSeerX 10.1.1.153.2503 . doi:10.1109/PACT.2009.37.
Rundberg, Peter; Stenstrom, Per (2001). "An All-Software Thread-Level Data Dependence Speculation System for Multiprocessors" (PDF). JILP. 3: 1–28.
Steffan, J. Gregory; Colohan, Christopher; Zhai, Antonia; Mowry, Todd C. (2005). "The STAMPede Approach to Thread-Level Speculation". ACM Transactions on Computer Systems. 23 (3): 253–300. CiteSeerX 10.1.1.79.4317 . doi:10.1145/1082469.1082471. S2CID 10499545.
Whaley, John; Kozyrakis, Christos (2005). "Heuristics for Profile-driven Method-level Speculative Parallelization". International Conference on Parallel Processing. ICPP 2005. pp. 147–156. CiteSeerX 10.1.1.77.3989 . doi:10.1109/ICPP.2005.44.
Renau, Jose; Strauss, Karin; Ceze, Luis; Liu, Wei; Sarangi, Smruti; Tuck, James; Torrellas, Josep (2006). "Energy-Efficient Thread-Level Speculation" (PDF). IEEE Micro. 26 (1): 80–91. doi:10.1109/MM.2006.11. S2CID 206472480.
Yoshizoe, Kazuki; Matsumoto, Takashi; Hiraki, Kei (1998). "Speculative Parallel Execution on JVM". UK Workshop on HPNC. pp. 1–20.
Oancea, Cosmin E.; Mycroft, Alan; Harris, Tim (2009). "A Lightweight In-Place Implementation for Software Thread-Level Speculation" (PDF). Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. SPAA '09. pp. 1–10. doi:10.1145/1583991.1584050.

This computer science article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Estebanez, Alvaro (2017). "A Survey on Thread-Level Speculation Techniques". ACM Computing Surveys. 49 (2): 1–39. doi:10.1145/2938369. S2CID 423292.

[JFM-JT-2] Martínez, José F.; Torrellas, Josep (2002). "Speculative synchronization" (PDF). Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS-X) - ASPLOS '02. ACM. p. 18. doi:10.1145/605397.605400. ISBN 1581135742. S2CID 9189828. Archived from the original (PDF) on 2018-11-18.

[3] García Yaguez, Alvaro (2014). "Squashing Alternatives for Software-based Speculative Parallelization" . IEEE Transactions on Computers. 63 (7): 1826–1839. doi:10.1109/TC.2013.46. S2CID 14081801.

[1]

[2]

[3]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Speculative multithreading

Contents

Description

Related Research Articles

References

Further reading