Thread block

Last updated March 07, 2019

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads varies with available shared memory. 'The number of threads in a thread block is also limited by the architecture to a total of 512 threads per block.^[1]' The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors.

In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. Process synchronization refers to the idea that multiple processes are to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action. Data synchronization refers to the idea of keeping multiple copies of a dataset in coherence with one another, or to maintain data integrity. Process synchronization primitives are commonly used to implement data synchronization.

CUDA is a parallel computing platform and programming model that higher level languages can use to exploit parallelism. In CUDA, the kernel is executed with the aid of threads. The thread is an abstract entity that represents the execution of the kernel. A kernel is a small program or a function. Multi threaded applications use many such threads that are running at the same time, to organize parallel computation. Every thread has an index, which is used for calculating memory address locations and also for taking control decisions.

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU. The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but it's gaining broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

A Programming model refers to the style of programming where execution is invoked by making what appear to be library calls. Examples include the POSIX Threads library and Hadoop's MapReduce. In both cases, the execution model is different from that of the base language in which the code is written. For example, the C programming language has no execution model for input/output or thread behavior. But such behavior can be invoked from C syntax, by making what appears to be a call to a normal C library.

Dimensions

CUDA operates on a heterogeneous programming model which is used to run host device application programs. It has an execution model that is similar to OpenCL. In this model, we start executing an application on the host device which is usually a CPU core. The device is a throughput oriented device, i.e., a GPU core which performs parallel computations. Kernel functions are used to do these parallel executions. Once these kernel functions are executed the control is passed back to the host device that resumes serial execution.

Heterogeneous computing refers to systems that use more than one kind of processor or cores. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

A central processing unit (CPU), also called a central processor or main processor, is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions. The computer industry has used the term "central processing unit" at least since the early 1960s. Traditionally, the term "CPU" refers to a processor, more specifically to its processing unit and control unit (CU), distinguishing these core elements of a computer from external components such as main memory and I/O circuitry.

As many parallel applications involve multidimensional data, it is convenient to organize thread blocks into 1D, 2D or 3D arrays of threads. The blocks in a grid must be able to be executed independently, as communication or cooperation between blocks in a grid is not possible. 'When a kernel is launched the number of threads per thread block, and the number of thread blocks is specified, this, in turn, defines the total number of CUDA threads launched.^[2]' The maximum x, y and z dimensions of a block are 512, 512 and 64, and it should be allocated such that x × y × z ≤ 512, which is the maximum number of threads per block.^[2] Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension.^[2] The limitation on the number of threads in a block is actually imposed because the number of registers that can be allocated across all threads is limited.^[2]

Indexing

1D-Indexing

Every thread in CUDA is associated with a particular index so that it can calculate and access memory locations in an array.

Consider an example in which there is an array of 512 elements. One of the organization structure is taking a grid with a single block that has a 512 threads. Consider that there is an array C of 512 elements that is made of element wise multiplication of two arrays A and B which are both 512 elements each. Every thread has an index i and it performs the multiplication of i^th element of A and B and then store the result in the i^th element of C. i is calculated by using blockIdx (which is 0 in this case as there is only one block), blockDim (512 in this case as the block has 512 elements) and threadIdx that varies from 0 to 511 for each block.

Thread Hierarchy in CUDA Programming Block-thread.svg — Thread Hierarchy in CUDA Programming

The thread index i is calculated by the following formula :

$i=blockIdx.x*blockDim.x+threadIdx.x$

blockIdx.x is the x dimension block identifier

blockDim.x is the x dimension of the block dimension

threadIdx.x is the x dimension of the thread identifier

Thus ‘i’ will have values ranging from 0 to 511 that covers the entire array.

If we want to consider computations for an array that is larger than 512 we can have multiple blocks with 512 threads each. Let us consider an example with 1024 array elements. In this case we have 2 thread blocks with 512 threads each. Thus the thread identifiers' values will vary from 0 to 511, the block identifier will vary from 0 to 1 and the block dimension will be 512. Thus the first block will get index values from 0 to 511 and the last one will have index values from 512 to 1023.

Thus each thread will first calculate the index of memory that it has to access and then proceed with the calculation. Consider an example in which elements from arrays A and B are added in parallel by using threads and the results is stored in an array C. The corresponding code in a thread is shown below :^[4]

__global__voidvecAddKernel(float*A,float*B,float*C,intn){intindex=blockIdx.x*blockDim.x+threadIdx.x;if(index<n){C[index]=A[index]+B[index];}}

2D-Indexing

In the same way in particularly complex grids, the blockId as well as the threadId need to be calculated by each thread depending on geometry of the grid. Consider, a 2-dimensional Grid with 2-dimensional blocks. The threadId and the blockId will be calculated by the following formulae :

$blockId=blockIdx.x+blockIdx.y*gridDim.x;$ $threadId=blockId*(blockDim.x*blockDim.y)+(threadIdx.y*blockDim.x)+threadIdx.x;$ ^[5]

Hardware perspective

Although we have stated the hierarchy of threads, we should note that, threads, thread blocks and grid are essentially a programmer’s perspective. In order to get a complete gist of thread block, it is critical to know it from a hardware perspective. Hardware groups threads that execute the same instruction in to warps. Several warps constitute a thread block. Several thread blocks are assigned to a Streaming Multiprocessor (SM). Several SM constitute the whole GPU unit (which executes the whole Kernel Grid).

Streaming multiprocessors

Each architecture in GPU (say Kepler or Fermi) consists of several SM or Streaming Multiprocessors. These are general purpose processors with a low clock rate target and a small cache. The primary task of an SM is that it must execute several thread blocks in parallel. As soon as one of its thread block has completed execution, it takes up the serially next thread block. In general, SMs support instruction-level parallelism but not branch prediction.^[7]

An illustration of a streaming multiprocessor and its resources. Streaming-Multiprocessor.jpg — An illustration of a streaming multiprocessor and its resources.

To achieve this purpose, an SM contains the following:^[7]

Execution cores. (single precision floating-point units, double precision floating-point units, special function units (SFUs)).
Caches:

L1 cache. (for reducing memory access latency).
Shared memory. (for shared data between threads).
Constant cache (for broadcasting of reads from a read-only memory).
Texture cache. (for aggregating bandwidth from texture memory).

Schedulers for warps. (these are for issuing instructions to warps based on a particular scheduling policies).
A substantial number of registers. (an SM may be running a large number of active threads at a time, so it is a must to have registers in thousands.)

The hardware schedules thread blocks to an SM. In general an SM can handle multiple thread blocks at the same time. An SM may contains up to 8 thread blocks in total. A thread ID is assigned to a thread by its respective SM.

Whenever an SM executes a thread block, all the threads inside the thread block are executed at the same time. Hence to free a memory of a thread block inside the SM, it is critical that the entire set of threads in the block have concluded execution. Each thread block is divided in scheduled units known as a warp. These are discussed in detail in the following section.

An Illustration of a double warp scheduler implemented in the Fermi micro-architecture of NVidia. Warp-Scheduler-Gpu.jpg — An Illustration of a double warp scheduler implemented in the Fermi micro-architecture of NVidia.

The warp scheduler of SM decides which of the warp gets prioritized during issuance of instructions.^[10]

Some of the warp prioritizing policies have also been discussed in the following sections.

Warps

On the hardware side, a thread block is composed of ‘warps’. A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM.

Consider a warp of 32 threads executing an instruction. If one or both of its operands are not ready, a process called ‘context switching’ takes place which transfers control to another warp.^[11] When an instruction has no data dependencies, that is, both of its operands are ready, the respective warp is considered to be ready for execution. If more than one warps are eligible for execution, the parent SM uses a warp scheduling policy for deciding which warp gets the next fetched instruction.

Different policies for scheduling warps that are eligible for execution are discussed below:^[12]

Round Robin (RR) - Instructions are fetched in round robin manner. RR makes sure - SMs are kept busy and no clock cycles are wasted on memory latencies.
Least Recently Fetched (LRF) - In this policy, warp for which instruction has not been fetched for the longest time gets priority in the fetching of an instruction.
Fair (FAIR)^[12] - In this policy, the scheduler makes sure all warps are given ‘fair’ opportunity in the number of instruction fetched for them. It fetched instruction to a warp for which minimum number of instructions have been fetched.
Thread block-based CAWS^[13] (criticality aware warp scheduling) - The emphasis of this scheduling policy is on improving the execution time of the thread blocks. It allocated more time resources to the warp that shall take the longest time to execute. By giving priority to the most critical warp, this policy allows thread blocks to finish faster, such that the resources become available quicker.

Related Research Articles

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. Multiple threads can exist within one process, executing concurrently and sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. In addition, even a single GPU-CPU framework provides advantages that multiple CPUs on their own do not offer due to the specialization in each chip.

In computing, hardware acceleration is the use of computer hardware specially made to perform some functions more efficiently than is possible in software running on a general-purpose CPU. Any transformation of data or routine that can be computed, can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in application-specific hardware designed or programmed to compute the operation than specified in software and performed on a general-purpose computer processor. Each approach has advantages and disadvantages. The implementation of computing tasks in hardware to decrease latency and increase throughput is known as hardware acceleration.

Automatic parallelization, also auto parallelization, autoparallelization, or parallelization, the last one of which implies automation when used in context, refers to converting sequential code into multi-threaded or vectorized code in order to utilize multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. The goal of automatic parallelization is to relieve programmers from the hectic and error-prone manual parallelization process. Though the quality of automatic parallelization has improved in the past several decades, fully automatic parallelization of sequential programs by compilers remains a grand challenge due to its need for complex program analysis and the unknown factors during compilation.

Stream processing is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming, that allows some applications to more easily exploit a limited form of parallel processing. Such applications can use multiple computational units, such as the floating point unit on a graphics processing unit or field-programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

Stencil codes are a class of iterative kernels which update array elements according to some fixed pattern, called a stencil. They are most commonly found in the codes of computer simulations, e.g. for computational fluid dynamics in the context of scientific and engineering applications. Other notable examples include solving partial differential equations, the Jacobi kernel, the Gauss–Seidel method, image processing and cellular automata. The regular structure of the arrays sets stencil codes apart from other modeling methods such as the Finite element method. Most finite difference codes which operate on regular grids can be formulated as stencil codes.

Manycore processors are specialist multi-core processors designed for a high degree of parallel processing, containing a large number of simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing. As of November 2018, the world's third fastest supercomputer, the Chinese Sunway TaihuLight, obtains its performance from 40,960 SW26010 manycore processors, each containing 256 cores.

Parallel Thread Execution is a pseudo-assembly language used in Nvidia's CUDA programming environment. The nvcc compiler translates code written in CUDA, a C++-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into a binary code which can be run on the processing cores.

Graphics Core Next (GCN) is the codename for both a series of microarchitectures as well as for an instruction set. GCN was developed by AMD for their GPUs as the successor to TeraScale microarchitecture/instruction set. The first product featuring GCN was launched in 2011.

C++ Accelerated Massive Parallelism is a native programming model that contains elements that span the C++ programming language and its runtime library. It provides an easy way to write programs that compile and execute on data-parallel hardware, such as graphics cards (GPUs).

Fermi is the codename for a GPU microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40 nm, mobile Fermi GPUs in 40 nm and 28 nm. Fermi is the oldest microarchitecture from NVIDIA that received support for the Microsoft's rendering API Direct3D 12 feature_level 11.

Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler also found use in the GK20A, the GPU component of the Tegra K1 SoC, as well as in the Quadro Kxxx series, the Quadro NVS 510, and Nvidia Tesla computing modules. Kepler was followed by the Maxwell microarchitecture and used alongside Maxwell in the GeForce 700 series and GeForce 800M series.

Maxwell is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Kepler microarchitecture. The Maxwell architecture was introduced in later models of the GeForce 700 series and is also used in the GeForce 800M series, GeForce 900 series, and Quadro Mxxx series, all manufactured in 28 nm.

Pascal (microarchitecture) GPU microarchitecture developed by Nvidia

Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 17, 2016 and June 10, 2016 respectively. Pascal is manufactured using the 16nm FinFET process.

Single instruction, multiple thread (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading.

In computing, a compute kernel is a routine compiled for high throughput accelerators, separate from but used by a main program. They are sometimes called compute shaders, sharing execution units with vertex shaders and pixel shaders on GPUs, but are not limited to execution on one class of device, or graphics APIs.

References

↑ "CUDA Overview". cuda.ce.rit.edu. Retrieved 2016-09-21.
1 2 3 4 "CUDA Thread Model". www.olcf.ornl.gov. Retrieved 2016-09-21.
↑ "Thread Hierarchy in CUDA Programming" . Retrieved 2016-09-21.
↑ Kirk, David; Hwu, Wen-mei W (January 28, 2010). Programming Massively Parallel Processors: A Hands-on Approach.
↑ "Thread Indexing Cheatsheet" (PDF). Retrieved 2016-09-21.
↑ "Thread Optimizations (University of Mayland)" (PDF).
1 2 Wilt, Nicholas (2013). The CUDA Handbook: A Comprehensive Guide to GPU Programming.
↑ "Thread Optimizations (University of Mayland)" (PDF).
↑ "Thread Optimizations (University of Mayland)" (PDF).
↑ "GPU Computing with CUDA Lecture 2 - CUDA Memories" (PDF).
↑ "Memory Issues in CUDA and Execution Scheduling in CUDA" (PDF).
1 2 "Effect of Instruction Fetch and Memory Scheduling on GPU Performance" (PDF).
↑ "CAWS: Criticality-Aware Warp Scheduling for GPGPU Workloads" (PDF).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "CUDA Overview". cuda.ce.rit.edu. Retrieved 2016-09-21.

[:0-2] 1 2 3 4 "CUDA Thread Model". www.olcf.ornl.gov. Retrieved 2016-09-21.

[3] "Thread Hierarchy in CUDA Programming" . Retrieved 2016-09-21.

[4] Kirk, David; Hwu, Wen-mei W (January 28, 2010). Programming Massively Parallel Processors: A Hands-on Approach.

[5] "Thread Indexing Cheatsheet" (PDF). Retrieved 2016-09-21.

[6] "Thread Optimizations (University of Mayland)" (PDF).

[:1-7] 1 2 Wilt, Nicholas (2013). The CUDA Handbook: A Comprehensive Guide to GPU Programming.

[8] "Thread Optimizations (University of Mayland)" (PDF).

[9] "Thread Optimizations (University of Mayland)" (PDF).

[10] "GPU Computing with CUDA Lecture 2 - CUDA Memories" (PDF).

[11] "Memory Issues in CUDA and Execution Scheduling in CUDA" (PDF).

[:2-12] 1 2 "Effect of Instruction Fetch and Memory Scheduling on GPU Performance" (PDF).

[13] "CAWS: Criticality-Aware Warp Scheduling for GPGPU Workloads" (PDF).