Parallel Thread Execution

Last updated January 17, 2024

Parallel Thread Execution (PTX or NVPTX^[1]) is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's CUDA programming environment. The NVCC compiler translates code written in CUDA, a C++-like language, into PTX instructions (an assembly language represented as ASCII text), and the graphics driver contains a compiler which translates the PTX instructions into the executable binary code^[2] which can be run on the processing cores of Nvidia GPUs. The GNU Compiler Collection also has basic ability for PTX generation in the context of OpenMP offloading.^[3] Inline PTX assembly can be used in CUDA.^[4]

Registers

PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form

.reg.u32%r<335>;// declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer

It is a three-argument assembly language, and almost all instructions explicitly list the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:

shr.u64%rd14,%rd12,32;// shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14cvt.u64.u32%rd142,%r112;// convert an unsigned 32-bit integer to 64-bit

There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is

@%p14bra$label;// branch to $label

The setp.cc.type instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a set instruction, where set.le.u32.u64%r101,%rd12,%rd28 sets the 32-bit register %r101 to 0xffffffff if the 64-bit register %rd12 is less than or equal to the 64-bit register %rd28. Otherwise %r101 is set to 0x00000000.

There are a few predefined identifiers that denote pseudoregisters. Among others, %tid, %ntid, %ctaid, and %nctaid contain, respectively, thread indices, block dimensions, block indices, and grid dimensions.^[5]

State spaces

Load (ld) and store (st) commands refer to one of several distinct state spaces (memory banks), e.g. ld.param. There are eight state spaces:^[5]

.reg: registers
.sreg: special, read-only, platform-specific registers
.const: shared, read-only memory
.global: global memory, shared by all threads
.local: local memory, private to each thread
.param: parameters passed to the kernel
.shared: memory shared between threads in a block
.tex: global texture memory (deprecated)

Shared memory is declared in the PTX file via lines at the start of the form:

.shared.align8.b8pbatch_cache[15744];// define 15,744 bytes, aligned to an 8-byte boundary

Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and Nvidia's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.^[6]

Related Research Articles

A fat binary is a computer executable program or library which has been expanded with code native to multiple instruction sets which can consequently be run on multiple processor types. This results in a file larger than a normal one-architecture binary file, thus the name.

LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes. The name LLVM originally stood for Low Level Virtual Machine, though the project has expanded and the name is no longer officially an initialism.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

CUDA is a proprietary and closed source parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

Intel oneAPI DPC++/C++ Compiler and Intel C++ Compiler Classic are Intel’s C, C++, SYCL, and Data Parallel C++ (DPC++) compilers for Intel processor-based systems, available for Windows, Linux, and macOS operating systems.

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

Nvidia OptiX is a ray tracing API that was first developed around 2009. The computations are offloaded to the GPUs through either the low-level or the high-level API introduced with CUDA. CUDA is only available for Nvidia's graphics products. Nvidia OptiX is part of Nvidia GameWorks. OptiX is a high-level, or "to-the-algorithm" API, meaning that it is designed to encapsulate the entire algorithm of which ray tracing is a part, not just the ray tracing itself. This is meant to allow the OptiX engine to execute the larger algorithm with great flexibility without application-side changes.

Graphics Core Next (GCN) is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was launched on January 9, 2012.

<span class="mw-page-title-main">GeForce 700 series</span> Series of GPUs by Nvidia

The GeForce 700 series is a series of graphics processing units developed by Nvidia. While mainly a refresh of the Kepler microarchitecture, some cards use Fermi (GF) and later cards use Maxwell (GM). GeForce 700 series cards were first released in 2013, starting with the release of the GeForce GTX Titan on February 19, 2013, followed by the GeForce GTX 780 on May 23, 2013. The first mobile GeForce 700 series chips were released in April 2013.

In computer architecture, 256-bit integers, memory addresses, or other data units are those that are 256 bits wide. Also, 256-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers, address buses, or data buses of that size. There are currently no mainstream general-purpose processors built to operate on 256-bit integers or addresses, though a number of processors do operate on 256-bit data.

Nvidia CUDA Compiler (NVCC) is a proprietary compiler by Nvidia intended for use with CUDA. CUDA code runs on both the CPU and GPU. NVCC separates these two parts and sends host code to a C compiler like GCC or Intel C++ Compiler (ICC) or Microsoft Visual C++ Compiler, and sends the device code to the GPU. The device code is further compiled by NVCC. NVCC is based on LLVM. According to Nvidia provided documentation, nvcc in version 7.0 supports many language constructs that are defined by the C++11 standard and a few C99 features as well. In version 9.0 several more constructs from the C++14 standard are supported.

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from NVIDIA that received support for Microsoft's rendering API Direct3D 12 feature_level 11.

Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler also found use in the GK20A, the GPU component of the Tegra K1 SoC, as well as in the Quadro Kxxx series, the Quadro NVS 510, and Nvidia Tesla computing modules. Kepler was followed by the Maxwell microarchitecture and used alongside Maxwell in the GeForce 700 series and GeForce 800M series.

<span class="mw-page-title-main">Maxwell (microarchitecture)</span> GPU microarchitecture by Nvidia

Maxwell is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Kepler microarchitecture. The Maxwell architecture was introduced in later models of the GeForce 700 series and is also used in the GeForce 800M series, GeForce 900 series, and Quadro Mxxx series, as well as some Jetson products, all manufactured with TSMC's 28 nm process.

<span class="mw-page-title-main">Pascal (microarchitecture)</span> GPU microarchitecture by Nvidia

Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 17, 2016, and June 10, 2016, respectively. Pascal was manufactured using TSMC's 16 nm FinFET process, and later Samsung's 14 nm FinFET process.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language (eDSL) based on pure C++17. It is a standard developed by Khronos Group, announced in March 2014.

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), OpenCL.

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is parallel to Ada Lovelace.

References

↑ "User Guide for NVPTX Back-end — LLVM 7 documentation". llvm.org.
↑ "CUDA Binary Utilities". docs.nvidia.com. Retrieved 2019-10-19.
↑ "nvptx". GCC Wiki.
↑ "Inline PTX Assembly in CUDA". docs.nvidia.com. Retrieved 2019-11-03.
1 2 "PTX ISA Version 2.3" (PDF).
↑ "GPUOCelot: A dynamic compilation framework for PTX". github.com. 7 November 2022.

External links

PTX ISA page on NVIDIA Developer Zone

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "User Guide for NVPTX Back-end — LLVM 7 documentation". llvm.org.

[2] "CUDA Binary Utilities". docs.nvidia.com. Retrieved 2019-10-19.

[3] "nvptx". GCC Wiki.

[4] "Inline PTX Assembly in CUDA". docs.nvidia.com. Retrieved 2019-11-03.

[ptx-isa-5] 1 2 "PTX ISA Version 2.3" (PDF).

[6] "GPUOCelot: A dynamic compilation framework for PTX". github.com. 7 November 2022.

[1]

[2]

[3]

[4]

[5]

[6]