Work stealing

Last updated

In parallel computing, work stealing is a scheduling strategy for multithreaded computer programs. It solves the problem of executing a dynamically multithreaded computation, one that can "spawn" new threads of execution, on a statically multithreaded computer, with a fixed number of processors (or cores). It does so efficiently in terms of execution time, memory usage, and inter-processor communication.

Contents

In a work stealing scheduler, each processor in a computer system has a queue of work items (computational tasks, threads) to perform. Each work item consists of a series of instructions, to be executed sequentially, but in the course of its execution, a work item may also spawn new work items that can feasibly be executed in parallel with its other work. These new items are initially put on the queue of the processor executing the work item. When a processor runs out of work, it looks at the queues of the other processors and "steals" their work items. In effect, work stealing distributes the scheduling work over idle processors, and as long as all processors have work to do, no scheduling overhead occurs. [1]

Work stealing contrasts with work sharing, another popular scheduling approach for dynamic multithreading, where each work item is scheduled onto a processor when it is spawned. Compared to this approach, work stealing reduces the amount of process migration between processors, because no such migration occurs when all processors have work to do. [2]

The idea of work stealing goes back to the implementation of the Multilisp programming language and work on parallel functional programming languages in the 1980s. [2] It is employed in the scheduler for the Cilk programming language, [3] the Java fork/join framework, [4] the .NET Task Parallel Library, [5] and the Rust Tokio runtime. [6] [7]

Execution model

Work stealing is designed for a "strict" fork–join model of parallel computation, which means that a computation can be viewed as a directed acyclic graph with a single source (start of computation) and a single sink (end of computation). Each node in this graph represents either a fork or a join. Forks produce multiple logically parallel computations, variously called "threads" [2] or "strands". [8] Edges represent serial computation. [9] [note 1]

As an example, consider the following trivial fork–join program in Cilk-like syntax:

function f(a, b):     c ← fork g(a)     d ← h(b)     join     return c + d  function g(a):     return a × 2  function h(a):     b ← fork g(a)     c ← a + 1     join     return b + c

The function call f(1, 2) gives rise to the following computation graph:

Fork-join computation.svg

In the graph, when two edges leave a node, the computations represented by the edge labels are logically parallel: they may be performed either in parallel, or sequentially. The computation may only proceed past a join node when the computations represented by its incoming edges are complete. The work of a scheduler, now, is to assign the computations (edges) to processors in a way that makes the entire computation run to completion in the correct order (as constrained by the join nodes), preferably as fast as possible.

Algorithm

The randomized version of the work stealing algorithm presented by Blumofe and Leiserson maintains several threads of execution and schedules these onto processors. Each of the processors has a double-ended queue (deque) of threads. Call the ends of the deque "top" and "bottom".

Each processor that has a current thread to execute, executes the instructions in the thread one by one, until it encounters an instruction that causes one of four "special" behaviors: [2] :10

Initially, a computation consists of a single thread and is assigned to some processor, while the other processors start off idle. Any processor that becomes idle starts the actual process of work stealing, which means the following:

Child stealing vs. continuation stealing

Note that, in the rule for spawn, Blumofe and Leiserson suggest that the "parent" thread execute its new thread, as if performing a function call (in the C-like program f(x); g(y);, the function call to f completes before the call to g is performed). This is called "continuation stealing", because the continuation of the function can be stolen while the spawned thread is executed, and is the scheduling algorithm used in Cilk Plus. [8] It is not the only way to implement work stealing; the alternative strategy is called "child stealing" and is easier to implement as a library, without compiler support. [8] Child stealing is used by Threading Building Blocks, Microsoft's Task Parallel Library and OpenMP, although the latter gives the programmer control over which strategy is used. [8]

Efficiency

Several variants of work stealing have been proposed. The randomized variant due to Blumofe and Leiserson executes a parallel computation in expected time on processors; here, is the work, or the amount of time required to run the computation on a serial computer, and is the span, the amount of time required on an infinitely parallel machine. [note 2] This means that, in expectation, the time required is at most a constant factor times the theoretical minimum. [2] However, the running time (in particular, the number of steals executed) can be exponential in in the worst case. [10] A localized variant, in which a processor attempts to steal back its own work whenever it is free, has also been analyzed theoretically and practically. [11] [12]

Space usage

A computation scheduled by the Blumofe–Leiserson version of work stealing uses stack space, if were the stack usage of the same computation on a single processor, [2] fitting the authors' own earlier definition of space efficiency. [13] This bound requires continuation stealing; in a child stealing scheduler, it does not hold, as can be seen from the following example: [8]

for i = 0 to n:     fork f(i) join

In a child-stealing implementation, all "forked" calls to f are put in a work queue that thus grows to size n, which can be made arbitrarily large.

Multiprogramming variant

The work stealing algorithm as outlined earlier, and its analysis, assume a computing environment where a computation is scheduled onto a set of dedicated processors. In a multiprogramming (multi-tasking) environment, the algorithm must be modified to instead schedule computation tasks onto a pool of worker threads, which in turn are scheduled onto the actual processors by an operating system scheduler. At any given time, the OS scheduler will assign to the work stealing process some number PAP of the P processors in the computer, because other processes may be using the remaining processors. In this setting, work stealing with a pool of P worker threads has the problem that workers acting as thieves may cause livelock: they may block the execution of workers that would actually spawn useful tasks. [14] [15]

A variant of work stealing has been devised for this situation, which executes a computation in expected time

where Pavg is the average number of processors allocated to the computation by the OS scheduler over the computation's running time. [16] The multiprogramming work-scheduler differs from the traditional version in two respects:

Attempts to improve on the multiprogramming work stealer have focused on cache locality issues [12] and improved queue data structures. [17]

Alternatives

Several scheduling algorithms for dynamically multithreaded computations compete with work stealing. Besides the traditional work sharing approach, there is a scheduler called parallel depth-first (PDF) that improves on the space bounds of work stealing, [18] as well giving better performance in some situations where the cores of a chip multiprocessor share a cache. [1]

Notes

  1. In the original presentation, serial computations were represented as nodes as well, and a directed edge represented the relation "is followed by".
  2. See analysis of parallel algorithms for definitions.

Related Research Articles

In computer science, a double-ended queue is an abstract data type that generalizes a queue, for which elements can be added to or removed from either the front (head) or back (tail). It is also often called a head-tail linked list, though properly this refers to a specific data structure implementation of a deque.

<span class="mw-page-title-main">Process (computing)</span> Particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

<span class="mw-page-title-main">Thread (computing)</span> Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems. In Modern Operating Systems, Tanenbaum shows that many distinct models of process organization are possible. In many cases, a thread is a component of a process. The multiple threads of a given process may be executed concurrently, sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

In computing, scheduling is the action of assigning resources to perform tasks. The resources may be processors, network links or expansion cards. The tasks may be threads, processes or data flows.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.

In computer engineering, out-of-order execution is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

<span class="mw-page-title-main">Charles E. Leiserson</span> American computer scientist

Charles Eric Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof. As part of this effort, he developed the Cilk multithreaded language. He invented the fat-tree interconnection network, a hardware-universal interconnection network used in many supercomputers, including the Connection Machine CM5, for which he was network architect. He helped pioneer the development of VLSI theory, including the retiming method of digital optimization with James B. Saxe and systolic arrays with H. T. Kung. He conceived of the notion of cache-oblivious algorithms, which are algorithms that have no tuning parameters for cache size or cache-line length, but nevertheless use cache near-optimally. He developed the Cilk language for multithreaded programming, which uses a provably good work-stealing algorithm for scheduling. Leiserson coauthored the standard algorithms textbook Introduction to Algorithms together with Thomas H. Cormen, Ronald L. Rivest, and Clifford Stein.

<span class="mw-page-title-main">Task (computing)</span> Unit of execution or work in software

In computing, a task is a unit of execution or a unit of work. The term is ambiguous; precise alternative terms include process, light-weight process, thread, step, request, or query. In the adjacent diagram, there are queues of incoming work to do and outgoing completed work, and a thread pool of threads to perform this work. Either the work units themselves or the threads that perform the work can be referred to as "tasks", and these can be referred to respectively as requests/responses/threads, incoming tasks/completed tasks/threads, or requests/responses/tasks.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

Thread Level Speculation (TLS), also known as Speculative Multithreading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

<span class="mw-page-title-main">Parallel Extensions</span>

Parallel Extensions was the development name for a managed concurrency library developed by a collaboration between Microsoft Research and the CLR team at Microsoft. The library was released in version 4.0 of the .NET Framework. It is composed of two parts: Parallel LINQ (PLINQ) and Task Parallel Library (TPL). It also consists of a set of coordination data structures (CDS) – sets of data structures used to synchronize and co-ordinate the execution of concurrent tasks.

Uzi Vishkin is a computer scientist at the University of Maryland, College Park, where he is Professor of Electrical and Computer Engineering at the University of Maryland Institute for Advanced Computer Studies (UMIACS). Uzi Vishkin is known for his work in the field of parallel computing. In 1996, he was inducted as a Fellow of the Association for Computing Machinery, with the following citation: "One of the pioneers of parallel algorithms research, Dr. Vishkin's seminal contributions played a leading role in forming and shaping what thinking in parallel has come to mean in the fundamental theory of Computer Science."

Explicit Multi-Threading (XMT) is a computer science paradigm for building and programming parallel computers designed around the parallel random-access machine (PRAM) parallel computational model. A more direct explanation of XMT starts with the rudimentary abstraction that made serial computing simple: that any single instruction available for execution in a serial program executes immediately. A consequence of this abstraction is a step-by-step (inductive) explication of the instruction available next for execution. The rudimentary parallel abstraction behind XMT, dubbed Immediate Concurrent Execution (ICE) in Vishkin (2011), is that indefinitely many instructions available for concurrent execution execute immediately. A consequence of ICE is a step-by-step (inductive) explication of the instructions available next for concurrent execution. Moving beyond the serial von Neumann computer, the aspiration of XMT is that computer science will again be able to augment mathematical induction with a simple one-line computing abstraction.

<span class="mw-page-title-main">Fork–join model</span> Way of setting up and executing parallel computer programs

In parallel computing, the fork–join model is a way of setting up and executing parallel programs, such that execution branches off in parallel at designated points in the program, to "join" (merge) at a subsequent point and resume sequential execution. Parallel sections may fork recursively until a certain task granularity is reached. Fork–join can be considered a parallel design pattern. It was formulated as early as 1963.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

In computer science, the analysis of parallel algorithms is the process of finding the computational complexity of algorithms executed in parallel – the amount of time, storage, or other resources needed to execute them. In many respects, analysis of parallel algorithms is similar to the analysis of sequential algorithms, but is generally more involved because one must reason about the behavior of multiple cooperating threads of execution. One of the primary goals of parallel analysis is to understand how a parallel algorithm's use of resources changes as the number of processors is changed.

References

  1. 1 2 Chen, Shimin; Gibbons, Phillip B.; Kozuch, Michael; Liaskovitis, Vasileios; Ailamaki, Anastassia; Blelloch, Guy E.; Falsafi, Babak; Fix, Limor; Hardavellas, Nikos; Mowry, Todd C.; Wilkerson, Chris (2007). Scheduling threads for constructive cache sharing on CMPs (PDF). Proc. ACM Symp. on Parallel Algorithms and Architectures. pp. 105–115.
  2. 1 2 3 4 5 6 Blumofe, Robert D.; Leiserson, Charles E. (1999). "Scheduling multithreaded computations by work stealing" (PDF). J ACM. 46 (5): 720–748. doi:10.1145/324133.324234. S2CID   5428476.
  3. Blumofe, Robert D.; Joerg, Christopher F.; Kuszmaul, Bradley C.; Leiserson, Charles E.; Randall, Keith H.; Zhou, Yuli (1996). "Cilk: An efficient multithreaded runtime system". Journal of Parallel and Distributed Computing. 37 (1): 55–69. doi: 10.1006/jpdc.1996.0107 .
  4. Doug Lea (2000). A Java fork/join framework (PDF). ACM Conf. on Java.
  5. Leijen, Daan; Schulte, Wolfram; Burckhardt, Sebastian (2009). "The Design of a Task Parallel Library". ACM SIGPLAN Notices. 44 (10): 227. CiteSeerX   10.1.1.146.4197 . doi:10.1145/1639949.1640106.
  6. "What is Tokio? · Tokio". tokio.rs. Retrieved 2020-05-27.
  7. Krill, Paul (2021-01-08). "Tokio Rust runtime reaches 1.0 status". InfoWorld. Retrieved 2021-12-26.
  8. 1 2 3 4 5 Robison, Arch (15 January 2014). A Primer on Scheduling Fork–Join Parallelism with Work Stealing (PDF) (Technical report). ISO/IEC JTC 1/SC 22/WG 21—The C++ Standards Committee. N3872.
  9. Halpern, Pablo (24 September 2012). Strict Fork–Join Parallelism (PDF) (Technical report). ISO/IEC JTC 1/SC 22/WG 21—The C++ Standards Committee. N3409=12-0099.
  10. Leiserson, Charles E.; Schardl, Tao B.; Suksompong, Warut (2016). "Upper Bounds on Number of Steals in Rooted Trees". Theory of Computing Systems. 58 (2): 223–240. arXiv: 1706.08219 . doi:10.1007/s00224-015-9613-9. S2CID   424692.
  11. Suksompong, Warut; Leiserson, Charles E.; Schardl, Tao B. (2016). "On the efficiency of localized work stealing". Information Processing Letters. 116 (2): 100–106. arXiv: 1804.04773 . doi:10.1016/j.ipl.2015.10.002. S2CID   1180480.
  12. 1 2 Acar, Umut A.; Blelloch, Guy E.; Blumofe, Robert D. (2002). "The Data Locality of Work Stealing" (PDF). Theory of Computing Systems. 35 (3): 321–347. CiteSeerX   10.1.1.19.3459 . doi:10.1007/s00224-002-1057-3. S2CID   10235838.
  13. Blumofe, Robert D.; Leiserson, Charles E. (1998). "Space-efficient scheduling of multithreaded computations". SIAM J. Comput. 27 (1): 202–229. CiteSeerX   10.1.1.48.9822 . doi:10.1137/s0097539793259471.
  14. Ding, Xiaoning; Wang, Kaibo; Gibbons, Phillip B.; Zhang, Xiaodong (2012). BWS: Balanced Work Stealing for Time-Sharing Multicores (PDF). EuroSys.
  15. Blumofe, Robert D.; Papadopoulos, Dionisios (1998). The Performance of Work Stealing in Multiprogrammed Environments (Technical report). University of Texas at Austin, Department of Computer Sciences. CiteSeerX   10.1.1.48.2247 .
  16. Arora, Nimar S.; Blumofe, Robert D.; Plaxton, C. Greg (2001). "Thread scheduling for multiprogrammed multiprocessors" (PDF). Theory of Computing Systems. 34 (2): 115–144. doi:10.1007/s002240011004.
  17. Chase, David R.; Lev, Yosef (2005). Dynamic Circular Work-Stealing Deque. ACM Symp. on Parallelism in Algorithms and Architectures. CiteSeerX   10.1.1.170.1097 .
  18. Blelloch, Guy E.; Gibbons, Phillip B.; Matias, Yossi (1999). "Provably efficient scheduling for languages with fine-grained parallelism" (PDF). Journal of the ACM . 46 (2): 281–321. CiteSeerX   10.1.1.48.8238 . doi:10.1145/301970.301974. S2CID   47102937.