Threading Building Blocks

Last updated

Threading Building Blocks
Developer(s) Intel
Stable release
2019 Update 8 / June 6, 2019;10 months ago (2019-06-06) [1]
Repository OOjs UI icon edit-ltr-progressive.svg
Written in C++
Operating system FreeBSD, Linux, Solaris, OS X, Windows, Android
Type library or framework
License dual: commercial / open source (Apache 2.0), plus Freeware [2]

Threading Building Blocks (TBB) is a C++ template library developed by Intel for parallel programming on multi-core processors. Using TBB, a computation is broken down into tasks that can run in parallel. The library manages and schedules threads to execute these tasks.



A TBB program creates, synchronizes and destroys graphs of dependent tasks according to algorithms, i.e. high-level parallel programming paradigms (a.k.a. Algorithmic Skeletons). Tasks are then executed respecting graph dependencies. This approach groups TBB in a family of techniques for parallel programming aiming to decouple the programming from the particulars of the underlying machine.

TBB implements work stealing to balance a parallel workload across available processing cores in order to increase core utilization and therefore scaling. Initially, the workload is evenly divided among the available processor cores. If one core completes its work while other cores still have a significant amount of work in their queue, TBB reassigns some of the work from one of the busy cores to the idle core. This dynamic capability decouples the programmer from the machine, allowing applications written using the library to scale to utilize the available processing cores with no changes to the source code or the executable program file. In a 2008 assessment of the work stealing implementation in TBB, researchers from Princeton University found that it was suboptimal for large numbers of processors cores, causing up to 47% of computing time spent in scheduling overhead when running certain benchmarks on a 32-core system. [3]

TBB, like the STL (and the part of the C++ standard library based on it), uses templates extensively. This has the advantage of low-overhead polymorphism, since templates are a compile-time construct which modern C++ compilers can largely optimize away.

Intel TBB is available commercially as a binary distribution with support, [4] and as open-source software in both source and binary forms. [5]

TBB does not provide guarantees of determinism or freedom from data races. [6]

Library contents

TBB is a collection of components for parallel programming:

Systems supported

The TBB commercial release 3.0 supports Windows (XP or newer), OS X (version 10.5.8 or higher) and Linux using Visual C++ (version 8.0 or higher, on Windows only), Intel C++ Compiler (version 11.1 or higher) or the GNU Compiler Collection (gcc). [7] Additionally, the TBB open source community has contributed patches for Solaris, [8] PowerPC, Xbox 360, QNX Neutrino, and FreeBSD.

See also


  1. "Intel® Threading Building Blocks Github Releases".
  2. "No Cost Options for Intel Parallel Studio XE, Support yourself, Royalty-Free".
  3. Contreras, Gilberto; Martonosi, Margaret (2008). Characterizing and improving the performance of Intel Threading Building Blocks (PDF). IEEE Int'l Symp. on Workload Characterization.
  4. Intel Threading Building Blocks Commercial Version Homepage
  5. Threading Building Blocks Open Source Project Homepage
  6. Bocchino Jr., Robert L.; Adve, Vikram S.; Adve, Sarita V.; Snir, Marc (2009). Parallel Programming Must Be Deterministic by Default. USENIX Workshop on Hot Topics in Parallelism.
  7. "Intel Threading Building Blocks - Release Notes Version 3.0" . Retrieved 2011-08-08.
  8. "Using Intel's Threaded Building Blocks (TBB) With Sun Studio Express" . Retrieved 2008-05-08.

Related Research Articles

Thread (computing) smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. Multiple threads can exist within one process, executing concurrently and sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

Parallel computing programming paradigm in which many calculations or the execution of processes are carried out simultaneously

Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but it's gaining broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

OpenMP open standard for parallelizing

The application programming interface (API) OpenMP supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.

Thread pool

In computer programming, a thread pool is a software design pattern for achieving concurrency of execution in a computer program. Often also called a replicated workers or worker-crew model, a thread pool maintains multiple threads waiting for tasks to be allocated for concurrent execution by the supervising program. By maintaining a pool of threads, the model increases performance and avoids latency in execution due to frequent creation and destruction of threads for short-lived tasks. The number of available threads is tuned to the computing resources available to the program, such as a parallel task queue after completion of execution.

Cilk, Cilk++ and Cilk Plus are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.

In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become false. Monitors also have a mechanism for signaling other threads that their condition has been met. A monitor consists of a mutex (lock) object and condition variables. A condition variable essentially is a container of threads that are waiting for a certain condition. Monitors provide a mechanism for threads to temporarily give up exclusive access in order to wait for some condition to be met, before regaining exclusive access and resuming their task.

In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language.

Multi-core processor Microprocessor with more than one processing unit

A multi-core processor is a computer processor integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions, as if the computer had several processors. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core. A multi-core processor implements multiprocessing in a single physical package. Designers may couple cores in a multi-core device tightly or loosely. For example, cores may or may not share caches, and they may implement message passing or shared-memory inter-core communication methods. Common network topologies to interconnect cores include bus, ring, two-dimensional mesh, and crossbar. Homogeneous multi-core systems include only identical cores; heterogeneous multi-core systems have cores that are not identical. Just as with single-processor systems, cores in multi-core systems may implement architectures such as VLIW, superscalar, vector, or multithreading.

Task parallelism is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors. In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by running many different tasks at the same time on the same data. A common type of task parallelism is pipelining which consists of moving a single set of data through a series of separate tasks where each task can execute independently of the others.

The Sieve C++ Parallel Programming System is a C++ compiler and parallel runtime designed and released by Codeplay that aims to simplify the parallelization of code so that it may run efficiently on multi-processor or multi-core systems. It is an alternative to other well-known parallelisation methods such as OpenMP, the RapidMind Development Platform and Threading Building Blocks (TBB).

Relaxed sequential in computer science is an execution model describing the ability for a parallel program to run sequentially. If a parallel program has a valid sequential execution it is said to follow a relaxed sequential execution model. It does not need to be efficient.

Parallel Extensions

Parallel Extensions was the development name for a managed concurrency library developed by a collaboration between Microsoft Research and the CLR team at Microsoft. The library was released in version 4.0 of the .NET Framework. It is composed of two parts: Parallel LINQ (PLINQ) and Task Parallel Library (TPL). It also consists of a set of coordination data structures (CDS) – sets of data structures used to synchronize and co-ordinate the execution of concurrent tasks.

Grand Central Dispatch, is a technology developed by Apple Inc. to optimize application support for systems with multi-core processors and other symmetric multiprocessing systems. It is an implementation of task parallelism based on the thread pool pattern. The fundamental idea is to move the management of the thread pool out of the hands of the developer, and closer to the operating system. The developer injects "work packages" into the pool oblivious of the pool's architecture. This model improves simplicity, portability and performance.

Intel Parallel Studio XE is a software development product developed by Intel that facilitates native code development on Windows, macOS and Linux in C++ and Fortran for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors.

Concurrent Collections is a programming model for software frameworks to expose parallelism in applications. The Concurrent Collections conception originated from tagged stream processing development with HP TStreams.

Intel Array Building Blocks was a C++ library developed by Intel Corporation for exploiting data parallel portions of programs to take advantage of multi-core processors, graphics processing units and Intel Many Integrated Core Architecture processors. ArBB provides a generalized vector parallel programming solution designed to avoid direct dependencies on particular low-level parallelism mechanisms or hardware architectures. ArBB is oriented to applications that require data-intensive mathematical computations. By default, ArBB programs cannot create data races or deadlocks.

Intel Parallel Building Blocks (PBB) was a collection of three programming solutions designed for multithreaded parallel computing. PBB consisted of Cilk Plus, Threading Building Blocks (TBB) and Intel Array Building Blocks (ArBB).

In parallel computing, work stealing is a scheduling strategy for multithreaded computer programs. It solves the problem of executing a dynamically multithreaded computation, one that can "spawn" new threads of execution, on a statically multithreaded computer, with a fixed number of processors. It does so efficiently in terms of execution time, memory usage, and inter-processor communication.

Fork–join model way of designing and executing parallel computations

In parallel computing, the fork–join model is a way of setting up and executing parallel programs, such that execution branches off in parallel at designated points in the program, to "join" (merge) at a subsequent point and resume sequential execution. Parallel sections may fork recursively until a certain task granularity is reached. Fork–join can be considered a parallel design pattern. It was formulated as early as 1963.