Threading Building Blocks

Threading Building Blocks
Developer(s)	Intel
Stable release	2021.8 / February 17, 2023;17 months ago
Repository	github.com/oneapi-src/oneTBB ;
Written in	C++
Operating system	FreeBSD, Linux, Solaris, macOS, Windows, Android
Type	library or framework
License	dual: commercial / open source (Apache 2.0), plus Freeware
Website	github.com/oneapi-src/oneTBB ; intel.com/oneTBB ;

Last updated July 28, 2024

oneAPI Threading Building Blocks (oneTBB; formerly Threading Building Blocks or TBB) is a C++ template library developed by Intel for parallel programming on multi-core processors. Using TBB, a computation is broken down into tasks that can run in parallel. The library manages and schedules threads to execute these tasks.

Overview

A oneTBB program creates, synchronizes, and destroys graphs of dependent tasks according to algorithms, i.e. high-level parallel programming paradigms (a.k.a. Algorithmic Skeletons). Tasks are then executed respecting graph dependencies. This approach groups TBB in a family of techniques for parallel programming aiming to decouple the programming from the particulars of the underlying machine.

oneTBB implements work stealing to balance a parallel workload across available processing cores in order to increase core utilization and therefore scaling. Initially, the workload is evenly divided among the available processor cores. If one core completes its work while other cores still have a significant amount of work in their queue, oneTBB reassigns some of the work from one of the busy cores to the idle core. This dynamic capability decouples the programmer from the machine, allowing applications written using the library to scale to utilize the available processing cores with no changes to the source code or the executable program file. In a 2008 assessment of the work stealing implementation in TBB, researchers from Princeton University found that it was suboptimal for large numbers of processors cores, causing up to 47% of computing time spent in scheduling overhead when running certain benchmarks on a 32-core system.^[3]

oneTBB, like the STL (and the part of the C++ standard library based on it), uses templates extensively. This has the advantage of low-overhead polymorphism, since templates are a compile-time construct which modern C++ compilers can largely optimize away.

oneTBB is available commercially as a binary distribution with support,^[4] and as open-source software in both source and binary forms.

oneTBB does not provide guarantees of determinism or freedom from data races.^[5]

Library contents

oneTBB is a collection of components for parallel programming:

Basic algorithms: parallel_for, parallel_reduce, parallel_scan
Advanced algorithms: parallel_pipeline, parallel_sort
Containers: concurrent_queue, concurrent_priority_queue, concurrent_vector, concurrent_hash_map, concurrent_unordered_map, concurrent_unordered_set, concurrent_map, concurrent_set
Memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator
Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive_mutex
Timing: portable fine grained global time stamp
Task scheduler: direct access to control the creation and activation of tasks

Notes

↑ "oneAPI Threading Building Blocks Github Releases". GitHub .
↑ "No Cost Options for Intel Support yourself, Royalty-Free".
↑ Contreras, Gilberto; Martonosi, Margaret (2008). Characterizing and improving the performance of Intel Threading Building Blocks (PDF). IEEE Int'l Symp. on Workload Characterization.
↑ https://software.intel.com/en-us/intel-tbb Intel Threading Building Blocks Commercial Version Homepage
↑ Bocchino Jr., Robert L.; Adve, Vikram S.; Adve, Sarita V.; Snir, Marc (2009). Parallel Programming Must Be Deterministic by Default. USENIX Workshop on Hot Topics in Parallelism.

Related Research Articles

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exist multiple unique implementations for different applications.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

In concurrent programming, a monitor is a synchronization construct that prevents threads from concurrently accessing a shared object's state and allows them to wait for the state to change. They provide a mechanism for threads to temporarily give up exclusive access in order to wait for some condition to be met, before regaining exclusive access and resuming their task. A monitor consists of a mutex (lock) and at least one condition variable. A condition variable is explicitly 'signalled' when the object's state is modified, temporarily passing the mutex to another thread 'waiting' on the conditional variable.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

Task parallelism is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors. In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by running many different tasks at the same time on the same data. A common type of task parallelism is pipelining, which consists of moving a single set of data through a series of separate tasks where each task can execute independently of the others.

Parallel Extensions was the development name for a managed concurrency library developed by a collaboration between Microsoft Research and the CLR team at Microsoft. The library was released in version 4.0 of the .NET Framework. It is composed of two parts: Parallel LINQ (PLINQ) and Task Parallel Library (TPL). It also consists of a set of coordination data structures (CDS) – sets of data structures used to synchronize and co-ordinate the execution of concurrent tasks.

<span class="mw-page-title-main">Grand Central Dispatch</span> Technology developed by Apple Inc

Grand Central Dispatch is a technology developed by Apple Inc. to optimize application support for systems with multi-core processors and other symmetric multiprocessing systems. It is an implementation of task parallelism based on the thread pool pattern. The fundamental idea is to move the management of the thread pool out of the hands of the developer, and closer to the operating system. The developer injects "work packages" into the pool oblivious of the pool's architecture. This model improves simplicity, portability and performance.

Intel Parallel Studio XE was a software development product developed by Intel that facilitated native code development on Windows, macOS and Linux in C++ and Fortran for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors.

Concurrent Collections is a programming model for software frameworks to expose parallelism in applications. The Concurrent Collections conception originated from tagged stream processing development with HP TStreams.

STAPL (Standard Template Adaptive Parallel Library) is a library for C++, similar and compatible to STL. It provides parallelism support for writing applications for systems with shared or distributed memory.

Intel Array Building Blocks was a C++ library developed by Intel Corporation for exploiting data parallel portions of programs to take advantage of multi-core processors, graphics processing units and Intel Many Integrated Core Architecture processors. ArBB provides a generalized vector parallel programming solution designed to avoid direct dependencies on particular low-level parallelism mechanisms or hardware architectures. ArBB is oriented to applications that require data-intensive mathematical computations. By default, ArBB programs cannot create data races or deadlocks.

Tachyon is a parallel/multiprocessor ray tracing software. It is a parallel ray tracing library for use on distributed memory parallel computers, shared memory computers, and clusters of workstations. Tachyon implements rendering features such as ambient occlusion lighting, depth-of-field focal blur, shadows, reflections, and others. It was originally developed for the Intel iPSC/860 by John Stone for his M.S. thesis at University of Missouri-Rolla. Tachyon subsequently became a more functional and complete ray tracing engine, and it is now incorporated into a number of other open source software packages such as VMD, and SageMath. Tachyon is released under a permissive license.

In parallel computing, the fork–join model is a way of setting up and executing parallel programs, such that execution branches off in parallel at designated points in the program, to "join" (merge) at a subsequent point and resume sequential execution. Parallel sections may fork recursively until a certain task granularity is reached. Fork–join can be considered a parallel design pattern. It was formulated as early as 1963.

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

References

Voss, Michael; Asenjo, Rafael; Reinders, James (2019), Pro TBB, Apress, doi: 10.1007/978-1-4842-4398-5 , ISBN 978-1-4842-4397-8, S2CID 195847637
Reinders, James (July 2007), Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism (Paperback ed.), Sebastopol: O'Reilly Media, ISBN 978-0-596-51480-8
Voss, M. (October 2006), Demystify Scalable Parallelism with Intel Threading Building Blocks' Generic Parallel Algorithms, archived from the original on 2012-02-05, retrieved 2007-06-06
Voss, M. (December 2006), Enable Safe, Scalable Parallelism with Intel Threading Building Blocks' Concurrent Containers, archived from the original on 2012-02-05, retrieved 2007-06-06
Hudson, Richard L.; Saha, Bratin; Adl-Tabatabai, Ali-Reza; Hertzberg, Benjamin C. (2006), "McRT-Malloc", Proceedings of the 2006 international symposium on Memory management - ISMM '06, pp. 74–83, doi:10.1145/1133956.1133967, ISBN 978-1595932211, S2CID 9120368

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "oneAPI Threading Building Blocks Github Releases". GitHub .

[freelib-2] "No Cost Options for Intel Support yourself, Royalty-Free".

[contreras-3] Contreras, Gilberto; Martonosi, Margaret (2008). Characterizing and improving the performance of Intel Threading Building Blocks (PDF). IEEE Int'l Symp. on Workload Characterization.

[TBB.com-4] ttps://software.intel.com/en-us/intel-tbb Intel Threading Building Blocks Commercial Version Homepage

[5] Bocchino Jr., Robert L.; Adve, Vikram S.; Adve, Sarita V.; Snir, Marc (2009). Parallel Programming Must Be Deterministic by Default. USENIX Workshop on Hot Topics in Parallelism.

[1]

[2]

[3]

[4]

[5]

v t e Intel software
Items in italics are no longer maintained or have planned end-of-life dates.
Development	Parallel Studio C++ Compiler Fortran Compiler Advisor Inspector INTERP/80 VTune
Components	Data Analytics Library (DAL) Integrated Performance Primitives (IPP) Math Kernel Library (MKL) Threading Building Blocks (TBB)
Open source	Data Analytics Library (DAL) Threading Building Blocks (TBB) Tizen OpenVINO
Software programs	Telekinesys Research ¹ Havok ¹ Vision ¹
Organizations	Developer Zone Research
¹Sold to Microsoft

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing