Subnormal number

Last updated

This page describes subnormal numbers in general. For the details of IEEE subnormal and denormal numbers, see the IEEE section below.

Contents

An unaugmented floating-point system would contain only normalized numbers (indicated in red). Allowing denormalized numbers (blue) extends the system's range. Denormalized numbers on a line.svg
An unaugmented floating-point system would contain only normalized numbers (indicated in red). Allowing denormalized numbers (blue) extends the system's range.

In computer science, subnormal numbers are the subset of denormalized numbers (sometimes called denormals) that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal, while denormal can also refer to numbers outside that range.

Terminology

In some older documents (especially standards documents such as the initial releases of IEEE 754 and the C language), "denormal" is used to refer exclusively to subnormal numbers. This usage persists in various standards documents, especially when discussing hardware that is incapable of representing any other denormalized numbers, but the discussion here uses the term "subnormal" in line with the 2008 revision of IEEE 754. In casual discussions the terms subnormal and denormal are often used interchangeably, in part because there are no denormalized IEEE binary numbers outside the subnormal range.

The term "number" is used rather loosely, to describe a particular sequence of digits, rather than a mathematical abstraction; see Floating Point for details of how real numbers relate to floating point representations. "Representation" rather than "number" may be used when clarity is required.

Definition

Mathematical real numbers may be approximated by multiple floating point representations. One representation is defined as normal, and others are defined as subnormal, denormal, or unnormal by their relationship to normal.

In a normal floating-point value, there are no leading zeros in the significand (also commonly called mantissa); rather, leading zeros are removed by adjusting the exponent (for example, the number 0.0123 would be written as 1.23 × 10−2). Conversely, a denormalized floating point value has a significand with a leading digit of zero. Of these, the subnormal numbers represent values which if normalized would have exponents below the smallest representable exponent (the exponent having a limited range).

The significand (or mantissa) of an IEEE floating-point number is the part of a floating-point number that represents the significant digits. For a positive normalised number it can be represented as m0.m1m2m3...mp−2mp−1 (where m represents a significant digit, and p is the precision) with non-zero m0. Notice that for a binary radix, the leading binary digit is always 1. In a subnormal number, since the exponent is the least that it can be, zero is the leading significant digit (0.m1m2m3...mp−2mp−1), allowing the representation of numbers closer to zero than the smallest normal number. A floating-point number may be recognized as subnormal whenever its exponent is the least value possible.

By filling the underflow gap like this, significant digits are lost, but not as abruptly as when using the flush to zero on underflow approach (discarding all significant digits when underflow is reached). Hence the production of a subnormal number is sometimes called gradual underflow because it allows a calculation to lose precision slowly when the result is small.

In IEEE 754-2008, denormal numbers are renamed subnormal numbers and are supported in both binary and decimal formats. In binary interchange formats, subnormal numbers are encoded with a biased exponent of 0, but are interpreted with the value of the smallest allowed exponent, which is one greater (i.e., as if it were encoded as a 1). In decimal interchange formats they require no special encoding because the format supports unnormalized numbers directly.

Mathematically speaking, the normalized floating-point numbers of a given sign are roughly logarithmically spaced, and as such any finite-sized normal float cannot include zero. The subnormal floats are a linearly spaced set of values, which span the gap between the negative and positive normal floats.

Background

Subnormal numbers provide the guarantee that addition and subtraction of floating-point numbers never underflows; two nearby floating-point numbers always have a representable non-zero difference. Without gradual underflow, the subtraction a  b can underflow and produce zero even though the values are not equal. This can, in turn, lead to division by zero errors that cannot occur when gradual underflow is used. [1]

Subnormal numbers were implemented in the Intel 8087 while the IEEE 754 standard was being written. They were by far the most controversial feature in the K-C-S format proposal that was eventually adopted, [2] but this implementation demonstrated that subnormal numbers could be supported in a practical implementation. Some implementations of floating-point units do not directly support subnormal numbers in hardware, but rather trap to some kind of software support. While this may be transparent to the user, it can result in calculations that produce or consume subnormal numbers being much slower than similar calculations on normal numbers.

IEEE

In IEEE binary floating point formats, subnormals are represented by having a zero exponent field with a non-zero significand field. [3]

No other denormalized numbers exist in the IEEE binary floating point formats, but they do exist in some other formats, including the IEEE decimal floating point formats.

Performance issues

Some systems handle subnormal values in hardware, in the same way as normal values. Others leave the handling of subnormal values to system software ("assist"), only handling normal values and zero in hardware. Handling subnormal values in software always leads to a significant decrease in performance. When subnormal values are entirely computed in hardware, implementation techniques exist to allow their processing at speeds comparable to normal numbers. [4] However, the speed of computation remains significantly reduced on many modern x86 processors; in extreme cases, instructions involving subnormal operands may take as many as 100 additional clock cycles, causing the fastest instructions to run as much as six times slower. [5] [6]

This speed difference can be a security risk. Researchers showed that it provides a timing side channel that allows a malicious web site to extract page content from another site inside a web browser. [7]

Some applications need to contain code to avoid subnormal numbers, either to maintain accuracy, or in order to avoid the performance penalty in some processors. For instance, in audio processing applications, subnormal values usually represent a signal so quiet that it is out of the human hearing range. Because of this, a common measure to avoid subnormals on processors where there would be a performance penalty is to cut the signal to zero once it reaches subnormal levels or mix in an extremely quiet noise signal. [8] Other methods of preventing subnormal numbers include adding a DC offset, quantizing numbers, adding a Nyquist signal, etc. [9] Since the SSE2 processor extension, Intel has provided such a functionality in CPU hardware, which rounds subnormal numbers to zero. [10]

Disabling subnormal floats at the code level

Intel SSE

Intel's C and Fortran compilers enable the DAZ (denormals-are-zero) and FTZ (flush-to-zero) flags for SSE by default for optimization levels higher than -O0. [11] The effect of DAZ is to treat subnormal input arguments to floating-point operations as zero, and the effect of FTZ is to return zero instead of a subnormal float for operations that would result in a subnormal float, even if the input arguments are not themselves subnormal. clang and gcc have varying default states depending on platform and optimization level.

A non-C99-compliant method of enabling the DAZ and FTZ flags on targets supporting SSE is given below, but is not widely supported. It is known to work on Mac OS X since at least 2006. [12]

#include<fenv.h>#pragma STDC FENV_ACCESS ON// Sets DAZ and FTZ, clobbering other CSR settings.// See https://opensource.apple.com/source/Libm/Libm-287.1/Source/Intel/, fenv.c and fenv.h.fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);// fesetenv(FE_DFL_ENV) // Disable both, clobbering other CSR settings.

For other x86-SSE platforms where the C library has not yet implemented this flag, the following may work: [13]

#include<xmmintrin.h>_mm_setcsr(_mm_getcsr()|0x0040);// DAZ_mm_setcsr(_mm_getcsr()|0x8000);// FTZ_mm_setcsr(_mm_getcsr()|0x8040);// Both_mm_setcsr(_mm_getcsr()&~0x8040);// Disable both

The _MM_SET_DENORMALS_ZERO_MODE and _MM_SET_FLUSH_ZERO_MODE macros wrap a more readable interface for the code above. [14]

// To enable DAZ#include<pmmintrin.h>_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);// To enable FTZ#include<xmmintrin.h>_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Most compilers will already provide the previous macro by default, otherwise the following code snippet can be used (the definition for FTZ is analogous):

#define _MM_DENORMALS_ZERO_MASK   0x0040#define _MM_DENORMALS_ZERO_ON     0x0040#define _MM_DENORMALS_ZERO_OFF    0x0000#define _MM_SET_DENORMALS_ZERO_MODE(mode) _mm_setcsr((_mm_getcsr() & ~_MM_DENORMALS_ZERO_MASK) | (mode))#define _MM_GET_DENORMALS_ZERO_MODE()                (_mm_getcsr() &  _MM_DENORMALS_ZERO_MASK)

The default denormalization behavior is mandated by the ABI, and therefore well-behaved software should save and restore the denormalization mode before returning to the caller or calling code in other libraries.

ARM

AArch32 NEON (SIMD) FPU always uses a flush-to-zero mode, which is the same as FTZ + DAZ. For the scalar FPU and in the AArch64 SIMD, the flush-to-zero behavior is optional and controlled by the FZ bit of the control register FPSCR in Arm32 and FPCR in AArch64. [15]

One way to do this can be:

#if defined(__arm64__) || defined(__aarch64__)uint64_tfpcr;asm("mrs %0,   fpcr":"=r"(fpcr));//Load the FPCR registerasm("msr fpcr, %0"::"r"(fpcr|(1<<24)));//Set the 24th bit (FTZ) to 1#endif

Some ARM processors have hardware handling of subnormals.

See also

Notes

    Related Research Articles

    <span class="mw-page-title-main">Floating-point arithmetic</span> Computer approximation for real numbers

    In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:

    IEEE 754-1985 is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.

    A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

    Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

    The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

    In computing, a normal number is a non-zero number in a floating-point representation which is within the balanced range supported by a given floating-point format: it is a floating point number that can be represented without leading zeros in its significand.

    The term arithmetic underflow is a condition in a computer program where the result of a calculation is a number of more precise absolute value than the computer can actually represent in memory on its central processing unit (CPU).

    In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

    Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

    Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

    IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.

    In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

    In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

    Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

    In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

    In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

    decimal128 is a decimal floating-point computer number format that occupies 128 bits in computer memory. Formally introduced in IEEE 754-2008, it is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

    In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC languages, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.

    In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.

    The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

    References

    1. William Kahan. "IEEE 754R meeting minutes, 2002". Archived from the original on 15 October 2016. Retrieved 29 December 2013.
    2. "An Interview with the Old Man of Floating-Point". University of California, Berkeley.
    3. "Denormalized numbers". Caldera International. Retrieved 11 October 2023. (Note that the XenuOS documentation uses denormal where IEEE 754 uses subnormal.)
    4. Schwarz, E.M.; Schmookler, M.; Son Dao Trong (July 2005). "FPU Implementations with Denormalized Numbers" (PDF). IEEE Transactions on Computers. 54 (7): 825–836. doi:10.1109/TC.2005.118. S2CID   26470540.
    5. Dooley, Isaac; Kale, Laxmikant (2006-09-12). "Quantifying the Interference Caused by Subnormal Floating-Point Values" (PDF). Retrieved 2010-11-30.
    6. Fog, Agner. "Instruction tables: Lists of instruction latencies, throughputs and microoperation breakdowns for Intel, AMD and VIA CPUs" (PDF). Retrieved 2011-01-25.
    7. Andrysco, Marc; Kohlbrenner, David; Mowery, Keaton; Jhala, Ranjit; Lerner, Sorin; Shacham, Hovav. "On Subnormal Floating Point and Abnormal Timing" (PDF). Retrieved 2015-10-05.
    8. Serris, John (2002-04-16). "Pentium 4 denormalization: CPU spikes in audio applications". Archived from the original on February 25, 2012. Retrieved 2015-04-29.
    9. de Soras, Laurent (2005-04-19). "Denormal numbers in floating point signal processing applications" (PDF).
    10. Casey, Shawn (2008-10-16). "x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ)" . Retrieved 2010-09-03.
    11. "Intel® MPI Library – Documentation". Intel.
    12. "Re: Macbook pro performance issue". Apple Inc. Archived from the original on 2016-08-26.
    13. "Re: Changing floating point state (Was: double vs float performance)". Apple Inc. Archived from the original on 2014-01-15. Retrieved 2013-01-24.
    14. "C++ Compiler for Linux* Systems User's Guide". Intel.
    15. "Aarch64 Registers". Arm.

    Further reading