Mixed-precision arithmetic

Last updated October 19, 2024

Mixed-precision arithmetic is a form of floating-point arithmetic that uses numbers with varying widths in a single operation.

Overview

A common usage of mixed-precision arithmetic is for operating on inaccurate numbers with a small width and expanding them to a larger, more accurate representation. For example, two half-precision or bfloat16 (16-bit) floating-point numbers may be multiplied together to result in a more accurate single-precision (32-bit) float.^[1] In this way, mixed-precision arithmetic approximates arbitrary-precision arithmetic, albeit with a low number of possible precisions.

Iterative algorithms (like gradient descent) are good candidates for mixed-precision arithmetic. In an iterative algorithm like square root, a coarse integral guess can be made and refined over many iterations until the error in precision makes it such that the smallest addition or subtraction to the guess is still too coarse to be an acceptable answer. When this happens, the precision can be increased to something more precise, which allows for smaller increments to be used for the approximation.

Supercomputers such as Summit utilize mixed-precision arithmetic to be more efficient with regards to memory and processing time, as well as power consumption.^[1]^[2]^[3]

Floating point format

A floating-point number is typically packed into a single bit-string, as the sign bit, the exponent field, and the significand or mantissa, from left to right. As an example, a IEEE 754 standard 32-bit float ("FP32", "float32", or "binary32") is packed as follows:

The IEEE 754 binary floats are:

Type	Sign	Exponent	Significand	Total	Exponent bias	Bits precision	Number of decimal digits
Type	Bits				Exponent bias	Bits precision	Number of decimal digits
Half (IEEE 754-2008)	1	5	10	16	15	11	~3.3
Single	1	8	23	32	127	24	~7.2
Double	1	11	52	64	1023	53	~15.9
x86 extended precision	1	15	64	80	16383	64	~19.2
Quad	1	15	112	128	16383	113	~34.0

Machine learning

Mixed-precision arithmetic is used in the field of machine learning, since gradient descent algorithms can use coarse and efficient half-precision floats for certain tasks, but can be more accurate if they use more precise but slower single-precision floats. Some platforms, including Nvidia, Intel, and AMD CPUs and GPUs, provide mixed-precision arithmetic for this purpose, using coarse floats when possible, but expanding them to higher precision when necessary.^[1]^[2]^[4]^[5]

Automatic mixed precision

PyTorch implements automatic mixed-precision (AMP), which performs autocasting, gradient scaling, and loss scaling.^[6]^[7]

The weights are stored in a master copy at a high precision, usually in FP32.
Autocasting means automatically converting a floating-point number between different precisions, such as from FP32 to FP16, during training. For example, matrix multiplications can often be performed in FP16 without loss of accuracy, even if the master copy weights are stored in FP32. Low-precision weights are used during forward pass.
Gradient scaling means multiplying gradients by a constant factor during training, typically before the weight optimizer update. This is done to prevent the gradients from underflowing to zero when using low-precision data types like FP16. Mathematically, if the unscaled gradient is $\mathbf {g}$ , the scaled gradient is $s\mathbf {g}$ where $s$ is the scaling factor. Within the optimizer update, the scaled gradient is cast to a higher precision before it is scaled down (no longer underflowing, as it is in a higher precision) to update the weights.
Loss scaling means multiplying the loss function by a constant factor during training, typically before backpropagation. This is done to prevent the gradients from underflowing to zero when using low-precision data types. If the unscaled loss is ${\mathcal {L}}$ , the scaled loss is $k{\mathcal {L}}$ where $k$ is the scaling factor. Since gradient scaling and loss scaling are mathematically equivalent by ${\frac {\partial (k{\mathcal {L}})}{\partial \mathbf {w} }}=k{\frac {\partial {\mathcal {L}}}{\partial \mathbf {w} }}$ , loss scaling is an implementation of gradient scaling.

PyTorch AMP uses exponential backoff to automatically adjust the scale factor for loss scaling. That is, it periodically increase the scale factor. Whenever the gradients contain a NaN (indicating overflow), the weight update is skipped, and the scale factor is decreased.

Related Research Articles

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

Floating point operations per second is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.

Archimedes' principle states that the upward buoyant force that is exerted on a body immersed in a fluid, whether fully or partially, is equal to the weight of the fluid that the body displaces. Archimedes' principle is a law of physics fundamental to fluid mechanics. It was formulated by Archimedes of Syracuse.

In information theory, the cross-entropy between two probability distributions $and, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

Machine epsilon or machine precision is an upper bound on the relative approximation error due to rounding in floating point number systems. This value characterizes computer arithmetic in the field of numerical analysis, and by extension in the subject of computational science. The quantity is also called macheps and it has the symbols Greek epsilon $.$

In computing, minifloats are floating-point values represented with very few bits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations. CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

Tesla is the codename for a GPU microarchitecture developed by Nvidia, and released in 2006, as the successor to Curie microarchitecture. It was named after the pioneering electrical engineer Nikola Tesla. As Nvidia's first microarchitecture to implement unified shaders, it was used with GeForce 8 series, GeForce 9 series, GeForce 100 series, GeForce 200 series, and GeForce 300 series of GPUs, collectively manufactured in 90 nm, 80 nm, 65 nm, 55 nm, and 40 nm. It was also in the GeForce 405 and in the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.

<span class="mw-page-title-main">Pascal (microarchitecture)</span> GPU microarchitecture by Nvidia

Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 27, 2016, and June 10, 2016, respectively. Pascal was manufactured using TSMC's 16 nm FinFET process, and later Samsung's 14 nm FinFET process.

Volta is the codename, but not the trademark, for a GPU microarchitecture developed by Nvidia, succeeding Pascal. It was first announced on a roadmap in March 2013, although the first product was not announced until May 2017. The architecture is named after 18th–19th century Italian chemist and physicist Alessandro Volta. It was Nvidia's first chip to feature Tensor Cores, specially designed cores that have superior deep learning performance over regular CUDA cores. The architecture is produced with TSMC's 12 nm FinFET process. The Ampere microarchitecture is the successor to Volta.

Block floating point (BFP) is a method used to provide an arithmetic approaching floating point while using a fixed-point processor. BFP assigns a group of significands to a single exponent, rather than single significand being assigned its own exponent. BFP can be advantageous to limit space use in hardware to perform the same functions as floating-point algorithms, by reusing the exponent; some operations over multiple values between blocks can also be done with a reduced amount of computation.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

Ampere is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to both the Volta and Turing architectures. It was officially announced on May 14, 2020 and is named after French mathematician and physicist André-Marie Ampère.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

Tesla Dojo is a supercomputer designed and built by Tesla for computer vision video processing and recognition. It is used for training Tesla's machine learning models to improve its Full Self-Driving (FSD) advanced driver-assistance system. According to Tesla, it went into production in July 2023.

Model compression is a machine learning technique for reducing the size of trained models. Large models can achieve high accuracy, but often at the cost of significant resource requirements. Compression techniques aim to compress models without significant performance reduction. Smaller models require less storage space, and consume less memory and compute during inference.

References

1 2 3 "Difference Between Single-, Double-, Multi-, Mixed-Precision". NVIDIA Blog. 15 November 2019. Retrieved 30 December 2020.
1 2 Abdelfattah, Ahmad; Anzt, Hartwig; Boman, Erik G.; Carson, Erin; Cojean, Terry; Dongarra, Jack; Gates, Mark; Grützmacher, Thomas; Higham, Nicholas J.; Li, Sherry; Lindquist, Neil; Liu, Yang; Loe, Jennifer; Luszczek, Piotr; Nayak, Pratik; Pranesh, Sri; Rajamanickam, Siva; Ribizel, Tobias; Smith, Barry; Swirydowicz, Kasia; Thomas, Stephen; Tomov, Stanimire; Tsai, Yaohung M.; Yamazaki, Ichitaro; Urike Meier Yang (2020). "A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic". arXiv: 2007.06674 [cs.MS].
↑ Holt, Kris (8 June 2018). "The US again has the world's most powerful supercomputer". Engadget. Retrieved 20 July 2018.
↑ Micikevicius, Paulius; Narang, Sharan; Alben, Jonah; Diamos, Gregory; Elsen, Erich; Garcia, David; Ginsburg, Boris; Houston, Michael; Kuchaiev, Oleksii (2018-02-15). "Mixed Precision Training". arXiv: 1710.03740 [cs.AI].
↑ "Mixed-Precision Training of Deep Neural Networks". NVIDIA Technical Blog. 2017-10-11. Retrieved 2024-09-10.
↑ "Mixed Precision — PyTorch Training Performance Guide". residentmario.github.io. Retrieved 2024-09-10.
↑ "What Every User Should Know About Mixed Precision Training in PyTorch". PyTorch. Retrieved 2024-09-10.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[nvidiablog-1] 1 2 3 "Difference Between Single-, Double-, Multi-, Mixed-Precision". NVIDIA Blog. 15 November 2019. Retrieved 30 December 2020.

[escpaper-2] 1 2 Abdelfattah, Ahmad; Anzt, Hartwig; Boman, Erik G.; Carson, Erin; Cojean, Terry; Dongarra, Jack; Gates, Mark; Grützmacher, Thomas; Higham, Nicholas J.; Li, Sherry; Lindquist, Neil; Liu, Yang; Loe, Jennifer; Luszczek, Piotr; Nayak, Pratik; Pranesh, Sri; Rajamanickam, Siva; Ribizel, Tobias; Smith, Barry; Swirydowicz, Kasia; Thomas, Stephen; Tomov, Stanimire; Tsai, Yaohung M.; Yamazaki, Ichitaro; Urike Meier Yang (2020). "A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic". arXiv: 2007.06674 [cs.MS].

[3] Holt, Kris (8 June 2018). "The US again has the world's most powerful supercomputer". Engadget. Retrieved 20 July 2018.

[4] Micikevicius, Paulius; Narang, Sharan; Alben, Jonah; Diamos, Gregory; Elsen, Erich; Garcia, David; Ginsburg, Boris; Houston, Michael; Kuchaiev, Oleksii (2018-02-15). "Mixed Precision Training". arXiv: 1710.03740 [cs.AI].

[5] "Mixed-Precision Training of Deep Neural Networks". NVIDIA Technical Blog. 2017-10-11. Retrieved 2024-09-10.

[6] "Mixed Precision — PyTorch Training Performance Guide". residentmario.github.io. Retrieved 2024-09-10.

[7] "What Every User Should Know About Mixed Precision Training in PyTorch". PyTorch. Retrieved 2024-09-10.

[1]

[2]

[3]

[4]

[5]

[6]

[7]