Parallel Thread Execution

Last updated

Parallel Thread Execution (PTX, or NVPTX [1] ) is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's CUDA programming environment. The nvcc compiler translates code written in CUDA, a C++-like language, into PTX instructions, and the graphics driver contains a compiler which translates the PTX instructions into a binary code [2] which can be run on the processing cores of Nvidia GPUs. The GNU Compiler Collection also has basic ability for PTX generation in the context of OpenMP offloading. [3] Inline PTX assembly can be used in CUDA. [4]



PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form

.reg.u32%r<335>;            // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer

It is a three-argument assembly language, and almost all instructions explicitly list the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:

shr.u64%rd14,%rd12,32;     // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14cvt.u64.u32%rd142,%r112;    // convert an unsigned 32-bit integer to 64-bit

There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is

@%p14bra$label;             // branch to $label

The instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a set instruction, where set.le.u32.u64%r101,%rd12,%rd28 sets the 32-bit register %r101 to 0xffffffff if the 64-bit register %rd12 is less than or equal to the 64-bit register %rd28. Otherwise %r101 is set to 0x00000000.

There are a few predefined identifiers that denote pseudoregisters. Among others, %tid, %ntid, %ctaid, and %nctaid contain, respectively, thread indices, block dimensions, block indices, and grid dimensions. [5]

State spaces

Load (ld) and store (st) commands refer to one of several distinct state spaces (memory banks), e.g. ld.param. There are eight state spaces: [5]

Shared memory is declared in the PTX file via lines at the start of the form:

.shared.align8.b8pbatch_cache[15744]; // define 15,744 bytes, aligned to an 8-byte boundary

Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and NVIDIA's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained. [6]

See also

Related Research Articles

MIPS is a reduced instruction set computer (RISC) instruction set architecture (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States.

In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit CPU and ALU architectures are those that are based on registers, address buses, or data buses of that size. 64-bit microcomputers are computers in which 64-bit microprocessors are the norm. From the software perspective, 64-bit computing means the use of code with 64-bit virtual memory addresses. However, not all 64-bit instruction sets support full 64-bit virtual memory addresses; x86-64 and ARMv8, for example, support only 48 bits of virtual address, with the remaining 16 bits of the virtual address required to be all 0's or all 1's, and several 64-bit instruction sets support fewer than 64 bits of physical memory address.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. In addition, even a single GPU-CPU framework provides advantages that multiple CPUs on their own do not offer due to the specialization in each chip.

CUDA Parallel computing platform and programming model

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach termed GPGPU. The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

The Portland Group company that produced a set of commercially available Fortran, C and C++ compilers

PGI was a company that produced a set of commercially available Fortran, C and C++ compilers for high-performance computing systems. On July 29, 2013, NVIDIA Corporation acquired The Portland Group, Inc. As of August 5, 2020, the "PGI Compilers and Tools" technology is a part of the NVIDIA HPC SDK product available as a free download from NVIDIA.

OpenCL Open standard for programming heterogenous computing systems, such as CPUs or GPUs

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

OptiX ray tracing API

Nvidia OptiX is a ray tracing API. The computations are offloaded to the GPUs through either the low-level or the high-level API introduced with CUDA. CUDA is only available for Nvidia's graphics products. Nvidia OptiX is part of Nvidia GameWorks. OptiX is a high-level, or "to-the-algorithm" API, meaning that it is designed to encapsulate the entire algorithm of which ray tracing is a part, not just the ray tracing itself. This is meant to allow the OptiX engine to execute the larger algorithm with great flexibility without application-side changes.

In computer software and hardware, find first set (ffs) or find first one is a bit operation that, given an unsigned machine word, designates the index or position of the least significant bit set to one in the word counting from the least significant bit position. A nearly equivalent operation is count trailing zeros (ctz) or number of trailing zeros (ntz), which counts the number of zero bits following the least significant one bit. The complementary operation that finds the index or position of the most significant set bit is log base 2, so called because it computes the binary logarithm ⌊log2(x)⌋. This is closely related to count leading zeros (clz) or number of leading zeros (nlz), which counts the number of zero bits preceding the most significant one bit. There are two common variants of find first set, the POSIX definition which starts indexing of bits at 1, herein labelled ffs, and the variant which starts indexing of bits at zero, which is equivalent to ctz and so will be called by that name.

Graphics Core Next codename for a series of microarchitectures and an instruction set

Graphics Core Next (GCN) is the codename for both a series of microarchitectures as well as for an instruction set architecture. GCN was developed by AMD for their GPUs as the successor to TeraScale microarchitecture/instruction set. The first product featuring GCN was launched on January 9, 2012.

OpenACC is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems.

The GeForce 700 series is a series of graphics processing units developed by Nvidia. While mainly a refresh of the Kepler microarchitecture, some cards use Fermi (GF) and later cards use Maxwell (GM). GeForce 700 series cards were first released in 2013, starting with the release of the GeForce GTX Titan on February 19, 2013, followed by the GeForce GTX 780 on May 23, 2013. The first mobile GeForce 700 series chips were released in April 2013.

Nvidia CUDA Compiler (NVCC) is a proprietary compiler by Nvidia intended for use with CUDA. CUDA code runs on both the CPU and GPU. NVCC separates these two parts and sends host code to a C compiler like GCC or Intel C++ Compiler (ICC) or Microsoft Visual C Compiler, and sends the device code to the GPU. The device code is further compiled by NVCC. NVCC is based on LLVM. According to Nvidia provided documentation, nvcc in version 7.0 supports many language constructs that are defined by the C++11 standard and a few C99 features as well. In version 9.0 several more constructs from the C++14 standard are supported.

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40 nm, mobile Fermi GPUs in 40 nm and 28 nm. Fermi is the oldest microarchitecture from NVIDIA that received support for the Microsoft's rendering API Direct3D 12 feature_level 11.

The GeForce 800M Series is a family of graphics processing units by Nvidia for laptop PCs. It consists of rebrands of mobile versions of the GeForce 700 series and some newer chips that are lower end compared to the rebrands.

GeForce 900 series series of GPUs

The GeForce 900 series is a family of graphics processing units developed by Nvidia, succeeding the GeForce 700 series and serving as the high-end introduction to the Maxwell microarchitecture, named after James Clerk Maxwell. They are produced with TSMC's 28 nm process.

Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler also found use in the GK20A, the GPU component of the Tegra K1 SoC, as well as in the Quadro Kxxx series, the Quadro NVS 510, and Nvidia Tesla computing modules. Kepler was followed by the Maxwell microarchitecture and used alongside Maxwell in the GeForce 700 series and GeForce 800M series.

Maxwell is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Kepler microarchitecture. The Maxwell architecture was introduced in later models of the GeForce 700 series and is also used in the GeForce 800M series, GeForce 900 series, and Quadro Mxxx series, all manufactured with TSMC's 28 nm process.

Pascal (microarchitecture) GPU microarchitecture designed by Nvidia

Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 17, 2016 and June 10, 2016 respectively. Pascal was manufactured using TSMC's 16 nm FinFET process, and later Samsung's 14 nm FinFET process.

RISC-V is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. Unlike most other ISA designs, the RISC-V ISA is provided under open source licenses that do not require fees to use. A number of companies are offering or have announced RISC-V hardware, open source operating systems with RISC-V support are available and the instruction set is supported in several popular software toolchains.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads varies with available shared memory. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of July 2019, with CUDA toolkit 10 and recent devices including Volta, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.


  1. "User Guide for NVPTX Back-end — LLVM 7 documentation".
  2. "CUDA Binary Utilities". Retrieved 2019-10-19.
  3. "nvptx". GCC Wiki.
  4. "Inline PTX Assembly in CUDA". Retrieved 2019-11-03.
  5. 1 2 "PTX ISA Version 2.3" (PDF).
  6. "GPUOCelot: A dynamic compilation framework for PTX".