Parallel Thread Execution (PTX or NVPTX [1] ) is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's Compute Unified Device Architecture (CUDA) programming environment. The Nvidia CUDA Compiler (NVCC) translates code written in CUDA, a C++-like language, into PTX instructions (an assembly language represented as American Standard Code for Information Interchange (ASCII) text), and the graphics driver contains a compiler which translates PTX instructions into executable binary code, [2] which can run on the processing cores of Nvidia graphics processing units (GPUs). The GNU Compiler Collection also has basic ability to generate PTX in the context of OpenMP offloading. [3] Inline PTX assembly can be used in CUDA. [4]
PTX uses an arbitrarily large processor register set; the output from the compiler is almost pure static single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form
.reg.u32%r<335>;// declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:
shr.u64%rd14,%rd12,32;// shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14cvt.u64.u32%rd142,%r112;// convert an unsigned 32-bit integer to 64-bit
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is
@%p14bra$label;// branch to $label
The setp.cc.type
instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a set
instruction, where set.le.u32.u64%r101,%rd12,%rd28
sets the 32-bit register %r101
to 0xffffffff
if the 64-bit register %rd12
is less than or equal to the 64-bit register %rd28
. Otherwise %r101
is set to 0x00000000
.
There are a few predefined identifiers that denote pseudoregisters. Among others, %tid, %ntid, %ctaid
, and %nctaid
contain, respectively, thread indices, block dimensions, block indices, and grid dimensions. [5]
Load (ld
) and store (st
) commands refer to one of several distinct state spaces (memory banks), e.g. ld.param
. There are eight state spaces: [5]
.reg
.sreg
.const
.global
.local
.param
.shared
.tex
Shared memory is declared in the PTX file via lines at the start of the form:
.shared.align8.b8pbatch_cache[15744];// define 15,744 bytes, aligned to an 8-byte boundary
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and Nvidia's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained. [6]
A fat binary is a computer executable program or library which has been expanded with code native to multiple instruction sets which can consequently be run on multiple processor types. This results in a file larger than a normal one-architecture binary file, thus the name.
LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes. The name LLVM originally stood for Low Level Virtual Machine, though the project has expanded and the name is no longer officially an initialism.
General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.
In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations. CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.
Intel oneAPI DPC++/C++ Compiler and Intel C++ Compiler Classic are Intel’s C, C++, SYCL, and Data Parallel C++ (DPC++) compilers for Intel processor-based systems, available for Windows, Linux, and macOS operating systems.
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.
Nvidia OptiX is a ray tracing API that was first developed around 2009. The computations are offloaded to the GPUs through either the low-level or the high-level API introduced with CUDA. CUDA is only available for Nvidia's graphics products. Nvidia OptiX is part of Nvidia GameWorks. OptiX is a high-level, or "to-the-algorithm" API, meaning that it is designed to encapsulate the entire algorithm of which ray tracing is a part, not just the ray tracing itself. This is meant to allow the OptiX engine to execute the larger algorithm with great flexibility without application-side changes.
Graphics Core Next (GCN) is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was launched on January 9, 2012.
The GeForce 700 series is a series of graphics processing units developed by Nvidia. While mainly a refresh of the Kepler microarchitecture, some cards use Fermi (GF) and later cards use Maxwell (GM). GeForce 700 series cards were first released in 2013, starting with the release of the GeForce GTX Titan on February 19, 2013, followed by the GeForce GTX 780 on May 23, 2013. The first mobile GeForce 700 series chips were released in April 2013.
In computer architecture, 256-bit integers, memory addresses, or other data units are those that are 256 bits wide. Also, 256-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers, address buses, or data buses of that size. There are currently no mainstream general-purpose processors built to operate on 256-bit integers or addresses, though a number of processors do operate on 256-bit data.
Nvidia CUDA Compiler (NVCC) is a compiler by Nvidia intended for use with CUDA. It is proprietary software.
Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.
Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the Tegra K1 SoC, and in the Quadro Kxxx series, the Quadro NVS 510, and Tesla computing modules.
Maxwell is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Kepler microarchitecture. The Maxwell architecture was introduced in later models of the GeForce 700 series and is also used in the GeForce 800M series, GeForce 900 series, and Quadro Mxxx series, as well as some Jetson products.
Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 27, 2016, and June 10, 2016, respectively. Pascal was manufactured using TSMC's 16 nm FinFET process, and later Samsung's 14 nm FinFET process.
Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.
A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.
ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP, and OpenCL.
Zig is an imperative, general-purpose, statically typed, compiled system programming language designed by Andrew Kelley. It is intended as a successor to the language C, with the intent of being even smaller and simpler to program in, while offering more function. It is free and open-source software, released under an MIT License.
Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is parallel to Ada Lovelace. It is the latest generation of the line of products formerly branded as Nvidia Tesla and since rebranded as Nvidia Data Center GPUs.