Thread block (CUDA programming)

Last updated

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. [1] Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

Contents

Multiple blocks are combined to form a grid. All the blocks in the same grid contain the same number of threads. The number of threads in a block is limited, but grids can be used for computations that require a large number of thread blocks to operate in parallel and to use all available multiprocessors.

CUDA is a parallel computing platform and programming model that higher level languages can use to exploit parallelism. In CUDA, the kernel is executed with the aid of threads. The thread is an abstract entity that represents the execution of the kernel. A kernel is a function that compiles to run on a special device. Multi threaded applications use many such threads that are running at the same time, to organize parallel computation. Every thread has an index, which is used for calculating memory address locations and also for taking control decisions.

Dimensions

CUDA operates on a heterogeneous programming model which is used to run host device application programs. It has an execution model that is similar to OpenCL. In this model, we start executing an application on the host device which is usually a CPU core. The device is a throughput oriented device, i.e., a GPU core which performs parallel computations. Kernel functions are used to do these parallel executions. Once these kernel functions are executed the control is passed back to the host device that resumes serial execution.

As many parallel applications involve multidimensional data, it is convenient to organize thread blocks into 1D, 2D or 3D arrays of threads. The blocks in a grid must be able to be executed independently, as communication or cooperation between blocks in a grid is not possible. 'When a kernel is launched the number of threads per thread block, and the number of thread blocks is specified, this, in turn, defines the total number of CUDA threads launched. [2] ' The maximum x, y and z dimensions of a block are 1024, 1024 and 64, and it should be allocated such that x × y × z ≤ 1024, which is the maximum number of threads per block. [3] Blocks can be organized into one, two or three-dimensional grids of up to 231-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively. [3] Unlike the maximum threads per block, there is not a blocks per grid limit distinct from the maximum grid dimensions.

Indexing

1D-indexing

Every thread in CUDA is associated with a particular index so that it can calculate and access memory locations in an array.

Consider an example in which there is an array of 512 elements. One of the organization structure is taking a grid with a single block that has a 512 threads. Consider that there is an array C of 512 elements that is made of element wise multiplication of two arrays A and B which are both 512 elements each. Every thread has an index i and it performs the multiplication of ith element of A and B and then store the result in the ith element of C. i is calculated by using blockIdx (which is 0 in this case as there is only one block), blockDim (512 in this case as the block has 512 elements) and threadIdx that varies from 0 to 511 for each block.

Thread Hierarchy in CUDA Programming Block-thread.svg
Thread Hierarchy in CUDA Programming

The thread index i is calculated by the following formula :

blockIdx.x is the x dimension block identifier

blockDim.x is the x dimension of the block dimension

threadIdx.x is the x dimension of the thread identifier

Thus ‘i’ will have values ranging from 0 to 511 that covers the entire array.

If we want to consider computations for an array that is larger than 1024 we can have multiple blocks with 1024 threads each. Consider an example with 2048 array elements. In this case we have 2 thread blocks with 1024 threads each. Thus the thread identifiers' values will vary from 0 to 1023, the block identifier will vary from 0 to 1 and the block dimension will be 1024. Thus the first block will get index values from 0 to 1023 and the last one will have index values from 1024 to 2047.

Thus each thread will first calculate the index of memory that it has to access and then proceed with the calculation. Consider an example in which elements from arrays A and B are added in parallel by using threads and the results is stored in an array C. The corresponding code in a thread is shown below : [5]

__global__voidvecAddKernel(float*A,float*B,float*C,intn){intindex=blockIdx.x*blockDim.x+threadIdx.x;if(index<n){C[index]=A[index]+B[index];}}

2D-indexing

In the same way in particularly complex grids, the blockId as well as the threadId need to be calculated by each thread depending on geometry of the grid. Consider, a 2-dimensional Grid with 2-dimensional blocks. The threadId and the blockId will be calculated by the following formulae :

[6]

Hardware perspective

Although we have stated the hierarchy of threads, we should note that, threads, thread blocks and grid are essentially a programmer's perspective. In order to get a complete gist of thread block, it is critical to know it from a hardware perspective. The hardware groups threads that execute the same instruction into warps. Several warps constitute a thread block. Several thread blocks are assigned to a Streaming Multiprocessor (SM). Several SM constitute the whole GPU unit (which executes the whole Kernel Grid).[ citation needed ]

A pictorial correlation of a programmer's perspective versus a hardware perspective of a thread block in GPU Software-Perspective for thread block.jpg
A pictorial correlation of a programmer's perspective versus a hardware perspective of a thread block in GPU

Streaming multiprocessors

Each architecture in GPU (say Kepler or Fermi) consists of several SM or Streaming Multiprocessors. These are general purpose processors with a low clock rate target and a small cache. An SM is able to execute several thread blocks in parallel. As soon as one of its thread blocks has completed execution, it takes up the serially next thread block. In general, SMs support instruction-level parallelism but not branch prediction. [8]

An illustration of a streaming multiprocessor and its resources Streaming-Multiprocessor.jpg
An illustration of a streaming multiprocessor and its resources

To achieve this purpose, an SM contains the following: [8]

  1. L1 cache. (for reducing memory access latency).
  2. Shared memory. (for shared data between threads).
  3. Constant cache (for broadcasting of reads from a read-only memory).
  4. Texture cache. (for aggregating bandwidth from texture memory).

The hardware schedules thread blocks to an SM. In general an SM can handle multiple thread blocks at the same time. An SM may contain up to 8 thread blocks in total. A thread ID is assigned to a thread by its respective SM.

Whenever an SM executes a thread block, all the threads inside the thread block are executed at the same time. Hence to free a memory of a thread block inside the SM, it is critical that the entire set of threads in the block have concluded execution. Each thread block is divided in scheduled units known as a warp. These are discussed in detail in the following section.

An Illustration of a double warp scheduler implemented in the Fermi micro-architecture of Nvidia Warp-Scheduler-Gpu.jpg
An Illustration of a double warp scheduler implemented in the Fermi micro-architecture of Nvidia

The warp scheduler of SM decides which of the warp gets prioritized during issuance of instructions. [11] Some of the warp prioritizing policies have also been discussed in the following sections.

Warps

On the hardware side, a thread block is composed of ‘warps’. (This term is from weaving. [12] ) A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM. [13]

Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes. Thus a new block is not launched on an SM until there is sufficient number of free registers for all warps of the new block, and until there is enough free shared memory for the new block.

Consider a warp of 32 threads executing an instruction. If one or both of its operands are not ready (e.g. have not yet been fetched from global memory), a process called ‘context switching’ takes place which transfers control to another warp. [14] When switching away from a particular warp, all the data of that warp remains in the register file so that it can be quickly resumed when its operands become ready. When an instruction has no outstanding data dependencies, that is, both of its operands are ready, the respective warp is considered to be ready for execution. If more than one warps are eligible for execution, the parent SM uses a warp scheduling policy for deciding which warp gets the next fetched instruction.

Different policies for scheduling warps that are eligible for execution are discussed below: [15]

  1. Round Robin (RR) - Instructions are fetched in round robin manner. RR makes sure that SMs are kept busy and no clock cycles are wasted on memory latencies.
  2. Least Recently Fetched (LRF) - In this policy, warp for which instruction has not been fetched for the longest time gets priority in the fetching of an instruction.
  3. Fair (FAIR) [15] - In this policy, the scheduler makes sure all warps are given ‘fair’ opportunity in the number of instruction fetched for them. It fetches instruction to a warp for which minimum number of instructions have been fetched.
  4. Thread block-based CAWS [16] (criticality aware warp scheduling) - The emphasis of this scheduling policy is on improving the execution time of the thread blocks. It allocated more time resources to the warp that shall take the longest time to execute. By giving priority to the most critical warp, this policy allows thread blocks to finish faster, such that the resources become available quicker.

Traditional CPU thread context "switching" requires saving and restoring allocated register values and the program counter to off-chip memory (or cache) and is therefore a much more heavyweight operation than with warp context switching. All of a warp's register values (including its program counter) remain in the register file, and the shared memory (and cache) remain in place too since these are shared between all the warps in the thread block.

In order to take advantage of the warp architecture, programming languages and developers need to understand how to coalesce memory accesses and how to manage control flow divergence. If each thread in a warp takes a different execution path or if each thread accesses significantly divergent memory then the benefits of the warp architecture are lost and performance will significantly degrade.

Related Research Articles

<span class="mw-page-title-main">Thread (computing)</span> Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

CUDA is a proprietary and closed-source parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels.

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

<span class="mw-page-title-main">OpenCL</span> Open standard for programming heterogenous computing systems, such as CPUs or GPUs

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

Parallel Thread Execution is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's CUDA programming environment. The NVCC compiler translates code written in CUDA, a C++-like language, into PTX instructions, and the graphics driver contains a compiler which translates the PTX instructions into the executable binary code which can be run on the processing cores of Nvidia GPUs. The GNU Compiler Collection also has basic ability for PTX generation in the context of OpenMP offloading. Inline PTX assembly can be used in CUDA.

<span class="mw-page-title-main">GeForce 400 series</span> Series of GPUs by Nvidia

Serving as the introduction of Fermi, the GeForce 400 series is a series of graphics processing units developed by Nvidia. Its release was originally slated in November 2009; however, after delays, it was released on March 26, 2010, with availability following in April 2010.

Graphics Core Next (GCN) is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was launched on January 9, 2012.

<span class="mw-page-title-main">Fermi (microarchitecture)</span> GPU microarchitecture by Nvidia

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from NVIDIA that received support for Microsoft's rendering API Direct3D 12 feature_level 11.

<span class="mw-page-title-main">Kepler (microarchitecture)</span> GPU microarchitecture by Nvidia

Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler also found use in the GK20A, the GPU component of the Tegra K1 SoC, as well as in the Quadro Kxxx series, the Quadro NVS 510, and Nvidia Tesla computing modules. Kepler was followed by the Maxwell microarchitecture and used alongside Maxwell in the GeForce 700 series and GeForce 800M series.

<span class="mw-page-title-main">Maxwell (microarchitecture)</span> GPU microarchitecture by Nvidia

Maxwell is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Kepler microarchitecture. The Maxwell architecture was introduced in later models of the GeForce 700 series and is also used in the GeForce 800M series, GeForce 900 series, and Quadro Mxxx series, as well as some Jetson products, all manufactured with TSMC's 28 nm process.

<span class="mw-page-title-main">Pascal (microarchitecture)</span> GPU microarchitecture by Nvidia

Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 17, 2016, and June 10, 2016, respectively. Pascal was manufactured using TSMC's 16 nm FinFET process, and later Samsung's 14 nm FinFET process.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is parallel to Ada Lovelace.

CuPy is an open source library for GPU-accelerated computing with Python programming language, providing support for multi-dimensional arrays, sparse matrices, and a variety of numerical algorithms implemented on top of them. CuPy shares the same API set as NumPy and SciPy, allowing it to be a drop-in replacement to run NumPy/SciPy code on GPU. CuPy supports NVIDIA CUDA GPU platform, and AMD ROCm GPU platform starting in v9.0.

References

  1. "Chapter 4. Hardware Implementation, The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor".
  2. "CUDA Thread Model". www.olcf.ornl.gov. Archived from the original on 2016-09-23. Retrieved 2016-09-21.
  3. 1 2 "CUDA Toolkit Documentation: Features and Technical Specifications". docs.nvidia.com. Retrieved 2022-05-24.
  4. "Thread Hierarchy in CUDA Programming" . Retrieved 2016-09-21.
  5. Kirk, David; Hwu, Wen-mei W (January 28, 2010). Programming Massively Parallel Processors: A Hands-on Approach.
  6. "Thread Indexing Cheatsheet" (PDF). Retrieved 2016-09-21.
  7. "Thread Optimizations (University of Mayland)" (PDF).
  8. 1 2 Wilt, Nicholas (2013). The CUDA Handbook: A Comprehensive Guide to GPU Programming.
  9. "Thread Optimizations (University of Mayland)" (PDF).
  10. "Thread Optimizations (University of Mayland)" (PDF).
  11. "GPU Computing with CUDA Lecture 2 - CUDA Memories" (PDF).
  12. "Parallel Thread Execution ISA Version 6.0". Developer Zone: CUDA Toolkit Documentation. NVIDIA Corporation. 22 September 2017. Archived from the original on 28 October 2017. Retrieved 27 October 2017.
  13. "Using CUDA Warp-Level Primitives". Nvidia. 2018-01-15. Retrieved 2020-04-08. NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion
  14. "Memory Issues in CUDA and Execution Scheduling in CUDA" (PDF).
  15. 1 2 "Effect of Instruction Fetch and Memory Scheduling on GPU Performance" (PDF).
  16. "CAWS: Criticality-Aware Warp Scheduling for GPGPU Workloads" (PDF).