Arithmetic underflow

Last updated

The term arithmetic underflow (also floating point underflow, or just underflow) is a condition in a computer program where the result of a calculation is a number of more precise absolute value than the computer can actually represent in memory on its central processing unit (CPU).

Contents

Arithmetic underflow can occur when the true result of a floating point operation is smaller in magnitude (that is, closer to zero) than the smallest value representable as a normal floating point number in the target datatype. [1] Underflow can in part be regarded as negative overflow of the exponent of the floating point value. For example, if the exponent part can represent values from 128 to 127, then a result with a value less than 128 may cause underflow.

For integers, the term "integer underflow" typically refers to a special kind of integer overflow or integer wraparound condition whereby the result of subtraction would result in a value less than the minimum allowed for a given integer type, i.e. the ideal result was closer to negative infinity than the output type's representable value closest to negative infinity. [2] [3] [4] [5] [6]

Underflow gap

The interval between zero and fminN, where fminN is the smallest positive normal floating point value, is called the underflow gap. This is because the size of this interval is many orders of magnitude larger than the distance between adjacent normal floating point values just outside the gap. For instance, if the floating point datatype can represent 20 bits, the underflow gap is 221 times larger than the absolute distance between adjacent floating point values just outside the gap. [7]

In older designs, the underflow gap had just one usable value, zero. When an underflow occurred, the true result was replaced by zero (either directly by the hardware, or by system software handling the primary underflow condition). This replacement is called "flush to zero".

The 1984 edition of IEEE 754 introduced subnormal numbers. The subnormal numbers (including zero) fill the underflow gap with values where the absolute distance between adjacent values is the same as for adjacent values just outside the underflow gap. This enables "gradual underflow", where a nearest subnormal value is used, just as a nearest normal value is used when possible. Even when using gradual underflow, the nearest value may be zero. [8]

The absolute distance between adjacent floating point values just outside the gap is called the machine epsilon, typically characterized by the largest value whose sum with the value 1 will result in the answer with value 1 in that floating point scheme. [9] This can be written as , where is a function which converts the real value into the floating point representation. While the machine epsilon is not to be confused with the underflow level (assuming subnormal numbers), it is closely related. The machine epsilon is dependent on the number of bits which make up the significand, whereas the underflow level depends on the number of digits which make up the exponent field. In most floating point systems, the underflow level is smaller than the machine epsilon.

Handling of underflow

The occurrence of an underflow may set a ("sticky") status bit, raise an exception, at the hardware level generate an interrupt, or may cause some combination of these effects.

As specified in IEEE 754, the underflow condition is only signaled if there is also a loss of precision. Typically this is determined as the final result being inexact. However, if the user is trapping on underflow, this may happen regardless of consideration for loss of precision. The default handling in IEEE 754 for underflow (as well as other exceptions) is to record as a floating point status that underflow has occurred. This is specified for the application-programming level, but often also interpreted as how to handle it at the hardware level.

See also

Related Research Articles

<span class="mw-page-title-main">Floating-point arithmetic</span> Computer approximation for real numbers

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:

IEEE 754-1985 is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.

In computing, NaN, standing for Not a Number, is a particular value of a numeric data type which is undefined as a number, such as the result of 0/0. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computer science, subnormal numbers are the subset of denormalized numbers that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal, while denormal can also refer to numbers outside that range.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

<span class="mw-page-title-main">Integer overflow</span> Computer arithmetic error

In computer programming, an integer overflow occurs when an arithmetic operation attempts to create a numeric value that is outside of the range that can be represented with a given number of digits – either higher than the maximum or lower than the minimum representable value.

ISO/IEC 10967, Language independent arithmetic (LIA), is a series of standards on computer arithmetic. It is compatible with ISO/IEC/IEEE 60559:2011, more known as IEEE 754-2008, and much of the specifications are for IEEE 754 special values (though such values are not required by LIA itself, unless the parameter iec559 is true). It was developed by the working group ISO/IEC JTC1/SC22/WG11, which was disbanded in 2011.

Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are equivalent. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 and +0, regarded as equal by the numerical comparison operations but with possible different behaviors in particular operations. This occurs in the sign-magnitude and ones' complement signed number representations for integers, and in most floating-point number representations. The number 0 is usually encoded as +0, but can still be represented by +0, −0, or 0.

In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

In computing, tapered floating point (TFP) is a format similar to floating point, but with variable-sized entries for the significand and exponent instead of the fixed-length entries found in normal floating-point formats. In addition to this, tapered floating-point formats provide a fixed-size pointer entry indicating the number of digits in the exponent entry. The number of digits of the significand entry results from the difference of the fixed total length minus the length of the exponent and pointer entries.

References

  1. Coonen, Jerome T (1980). "An implementation guide to a proposed standard for floating-point arithmetic". Computer. 13 (1): 68–79. doi:10.1109/mc.1980.1653344. S2CID   206445847.
  2. "CWE - CWE-191: Integer Underflow (Wrap or Wraparound) (3.1)". cwe.mitre.org.
  3. "Overflow And Underflow of Data Types in Java - DZone Java". dzone.com.
  4. Mir, Tabish (4 April 2017). "Integer Overflow/Underflow and Floating Point Imprecision". medium.com.
  5. "Integer underflow and buffer overflow processing MP4 metadata in libstagefright". Mozilla.
  6. "Avoiding Buffer Overflows and Underflows". developer.apple.com.
  7. Sun Microsystems (2005). Numerical Computation Guide. Oracle. Retrieved 21 April 2018.
  8. Demmel, James (1984). "Underflow and the Reliability of Numerical Software". SIAM Journal on Scientific and Statistical Computing. 5 (4): 887–919. doi:10.1137/0905062.
  9. Heath, Michael T. (2002). Scientific Computing (Second ed.). New York: McGraw-Hill. p. 20. ISBN   0-07-239910-4.