IEEE 754

Last updated

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

Contents

The standard defines:

IEEE 754-2008, published in August 2008, includes nearly all of the original IEEE 754-1985 standard, plus the IEEE 854-1987 Standard for Radix-Independent Floating-Point Arithmetic. The current version, IEEE 754-2019, was published in July 2019. [1] It is a minor revision of the previous version, incorporating mainly clarifications, defect fixes and new recommended operations.

History

The first standard for floating-point arithmetic, IEEE 754-1985, was published in 1985. It covered only binary floating-point arithmetic.

A new version, IEEE 754-2008, was published in August 2008, following a seven-year revision process, chaired by Dan Zuras and edited by Mike Cowlishaw. It replaced both IEEE 754-1985 (binary floating-point arithmetic) and IEEE 854-1987 Standard for Radix-Independent Floating-Point Arithmetic. The binary formats in the original standard are included in this new standard along with three new basic formats, one binary and two decimal. To conform to the current standard, an implementation must implement at least one of the basic formats as both an arithmetic format and an interchange format.

The international standard ISO/IEC/IEEE 60559:2011 (with content identical to IEEE 754-2008) has been approved for adoption through ISO/IEC JTC 1/SC 25 under the ISO/IEEE PSDO Agreement [2] [3] and published. [4]

The current version, IEEE 754-2019 published in July 2019, is derived from and replaces IEEE 754-2008, following a revision process started in September 2015, chaired by David G. Hough and edited by Mike Cowlishaw. It incorporates mainly clarifications (e.g. totalOrder) and defect fixes (e.g. minNum), but also includes some new recommended operations (e.g. augmentedAddition). [5] [6]

The international standard ISO/IEC 60559:2020 (with content identical to IEEE 754-2019) has been approved for adoption through ISO/IEC JTC 1/SC 25 and published. [7]

The next projected revision of the standard is in 2028. [8]

Formats

An IEEE 754 format is a "set of representations of numerical values and symbols". A format may also include how the set is encoded. [9]

A floating-point format is specified by

A format comprises

For example, if b = 10, p = 7, and emax = 96, then emin = −95, the significand satisfies 0 ≤ c9999999, and the exponent satisfies −101 ≤ q ≤ 90. Consequently, the smallest non-zero positive number that can be represented is 1×10−101, and the largest is 9999999×1090 (9.999999×1096), so the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −b1−emax and b1−emax (here, −1×10−95 and 1×10−95) are the smallest (in magnitude) normal numbers; non-zero numbers between these smallest numbers are called subnormal numbers.

Representation and encoding in memory

Some numbers may have several possible floating-point representations. For instance, if b = 10, and p = 7, then −12.345 can be represented by −12345×10−3, −123450×10−4, and −1234500×10−5. However, for most operations, such as arithmetic operations, the result (value) does not depend on the representation of the inputs.

For the decimal formats, any representation is valid, and the set of these representations is called a cohort. When a result can have several representations, the standard specifies which member of the cohort is chosen.

For the binary formats, the representation is made unique by choosing the smallest representable exponent allowing the value to be represented exactly. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers. For numbers with an exponent in the normal range (the exponent field being neither all ones nor all zeros), the leading bit of the significand will always be 1. Consequently, a leading 1 can be implied rather than explicitly present in the memory encoding, and under the standard the explicitly represented part of the significand will lie between 0 and 1. This rule is called leading bit convention, implicit bit convention, or hidden bit convention. This rule allows the binary format to have an extra bit of precision. The leading bit convention cannot be used for the subnormal numbers as they have an exponent outside the normal exponent range and scale by the smallest represented exponent as used for the smallest normal numbers.

Due to the possibility of multiple encodings (at least in formats called interchange formats), a NaN may carry other information: a sign bit (which has no meaning, but may be used by some operations) and a payload, which is intended for diagnostic information indicating the source of the NaN (but the payload may have other uses, such as NaN-boxing [10] [11] [12] ).

Basic and interchange formats

The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floating-point basic formats (encoded with 32, 64 or 128 bits) and two decimal floating-point basic formats (encoded with 64 or 128 bits). The binary32 and binary64 formats are the single and double formats of IEEE 754-1985 respectively. A conforming implementation must fully implement at least one of the basic formats.

The standard also defines interchange formats , which generalize these basic formats. [13] For the binary formats, the leading bit convention is required. The following table summarizes some of the possible interchange formats (including the basic formats).

SignificandExponentProperties [lower-alpha 2]
NameCommon nameRadixDigits [lower-alpha 3] Decimal digits [lower-alpha 4] MinMaxMAXVALlog10 MAXVALMINVAL>0 (normal)MINVAL>0 (subnorm)Notes
binary16 Half precision2113.31−1415655044.8166.10·10−55.96·10−8Interchange
binary32 Single precision2247.22−126+1273.40·103838.5321.18·10−381.40·10−45Basic
binary64 Double precision25315.95−1022+10231.80·10308308.2552.23·10−3084.94·10−324Basic
binary128 Quadruple precision211334.02−16382+163831.19·1049324932.0753.36·10−49326.48·10−4966Basic
binary256 Octuple precision223771.34−262142+2621431.61·107891378913.2072.48·10−789132.25·10−78984Interchange
decimal32 1077−95+961.0·109797 − 4.34·10−81·10−951·10−101Interchange
decimal64 101616−383+3841.0·10385385 − 4.34·10−171·10−3831·10−398Basic
decimal128 103434−6143+61441.0·1061456145 − 4.34·10−351·10−61431·10−6176Basic

In the table above, integer values are exact, whereas values in decimal notation (e.g. 1.0) are rounded values. The minimum exponents listed are for normal numbers; the special subnormal number representation allows even smaller (in magnitude) numbers to be represented with some loss of precision. For example, the smallest positive number that can be represented in binary64 is 2−1074; contributions to the −1074 figure include the emin value −1022 and all but one of the 53 significand bits (2−1022  (53  1) = 2−1074).

Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as digits × log10base. E.g. binary128 has approximately the same precision as a 34 digit decimal number.

log10 MAXVAL is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point (e.g. 1.698·1038 is near the largest value in binary32, 9.999999·1096 is the largest value in decimal32).

The binary32 (single) and binary64 (double) formats are two of the most common formats used today. The figure below shows the absolute precision for both formats over a range of values. This figure can be used to select an appropriate format given the expected value of a number and the required precision.

Precision of binary32 and binary64 in the range 10 to 10 IEEE754.svg
Precision of binary32 and binary64 in the range 10 to 10

An example of a layout for 32-bit floating point is

Float example.svg

and the 64 bit layout is similar.

Extended and extendable precision formats

The standard specifies optional extended and extendable precision formats, which provide greater precision than the basic formats. [14] An extended precision format extends a basic format by using more precision and more exponent range. An extendable precision format allows the user to specify the precision and exponent range. An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters (b, p, and emax). These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent for the given radix) that it can represent.

The standard recommends that language standards provide a method of specifying p and emax for each supported base b. [15] The standard recommends that language standards and implementations support an extended format which has a greater precision than the largest basic format supported for each radix b. [16] For an extended format with a precision between two basic formats the exponent range must be as great as that of the next wider basic format. So for instance a 64-bit extended precision binary number must have an 'emax' of at least 16383. The x87 80-bit extended format meets this requirement.

The original IEEE 754-1985 standard also had the concept of extended formats, but without any mandatory relation between emin and emax. For example, the Motorola 68881 80-bit format, [17] where emin = − emax, was a conforming extended format, but it became non-conforming in the 2008 revision.

Interchange formats

Interchange formats are intended for the exchange of floating-point data using a bit string of fixed length for a given format.

Binary

For the exchange of binary floating-point numbers, interchange formats of length 16 bits, 32 bits, 64 bits, and any multiple of 32 bits ≥ 128 [lower-alpha 5] are defined. The 16-bit format is intended for the exchange or storage of small numbers (e.g., for graphics).

The encoding scheme for these binary interchange formats is the same as that of IEEE 754-1985: a sign bit, followed by w exponent bits that describe the exponent offset by a bias , and p  1 bits that describe the significand. The width of the exponent field for a k-bit format is computed as w = round(4 log2(k))  13. The existing 64- and 128-bit formats follow this rule, but the 16- and 32-bit formats have more exponent bits (5 and 8 respectively) than this formula would provide (3 and 7 respectively).

As with IEEE 754-1985, the biased-exponent field is filled with all 1 bits to indicate either infinity (trailing significand field = 0) or a NaN (trailing significand field ≠ 0). For NaNs, quiet NaNs and signaling NaNs are distinguished by using the most significant bit of the trailing significand field exclusively, [lower-alpha 6] and the payload is carried in the remaining bits.

Decimal

For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined. As with binary interchange, the encoding scheme for the decimal interchange formats encodes the sign, exponent, and significand. Two different bit-level encodings are defined, and interchange is complicated by the fact that some external indicator of the encoding in use may be required.

The two options allow the significand to be encoded as a compressed sequence of decimal digits using densely packed decimal or, alternatively, as a binary integer. The former is more convenient for direct hardware implementation of the standard, while the latter is more suited to software emulation on a binary computer. In either case, the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and special values (±zero with the minimum exponent, ±infinity, quiet NaNs, and signaling NaNs) have identical encodings.

Rounding rules

The standard defines five rounding rules. The first two rules round to a nearest value; the others are called directed roundings :

Roundings to nearest

At the extremes, a value with a magnitude strictly less than will be rounded to the minimum or maximum finite number (depending on the value's sign). Any numbers with exactly this magnitude are considered ties; this choice of tie may be conceptualized as the midpoint between and , which, were the exponent not limited, would be the next representable floating-point numbers larger in magnitude. Numbers with a magnitude strictly larger than k are rounded to the corresponding infinity. [18]

"Round to nearest, ties to even" is the default for binary floating point and the recommended default for decimal. "Round to nearest, ties to away" is only required for decimal implementations. [19]

Directed roundings

Example of rounding to integers using the IEEE 754 rules
ModeExample value
+11.5+12.5−11.5−12.5
to nearest, ties to even+12.0+12.0−12.0−12.0
to nearest, ties away from zero+12.0+13.0−12.0−13.0
toward 0+11.0+12.0−11.0−12.0
toward +∞+12.0+13.0−11.0−12.0
toward −∞+11.0+12.0−12.0−13.0

Unless specified otherwise, the floating-point result of an operation is determined by applying the rounding function on the infinitely precise (mathematical) result. Such an operation is said to be correctly rounded. This requirement is called correct rounding. [20]

Required operations

Required operations for a supported arithmetic format (including the basic formats) include:

Comparison predicates

The standard provides comparison predicates to compare one floating-point datum to another in the supported arithmetic format. [32] Any comparison with a NaN is treated as unordered. −0 and +0 compare as equal.

Total-ordering predicate

The standard provides a predicate totalOrder, which defines a total ordering on canonical members of the supported arithmetic format. [33] The predicate agrees with the comparison predicates (see section § Comparison predicates) when one floating-point number is less than the other. The main differences are: [34]

The totalOrder predicate does not impose a total ordering on all encodings in a format. In particular, it does not distinguish among different encodings of the same floating-point representation, as when one or both encodings are non-canonical. [33] IEEE 754-2019 incorporates clarifications of totalOrder.

For the binary interchange formats whose encoding follows the IEEE 754-2008 recommendation on placement of the NaN signaling bit, the comparison is identical to one that type puns the floating-point numbers to a sign–magnitude integer (assuming a payload ordering consistent with this comparison), an old trick for FP comparison without an FPU. [35]

Exception handling

The standard defines five exceptions, each of which returns a default value and has a corresponding status flag that is raised when the exception occurs. [lower-alpha 7] No other exception handling is required, but additional non-default alternatives are recommended (see § Alternate exception handling).

The five possible exceptions are

These are the same five exceptions as were defined in IEEE 754-1985, but the division by zero exception has been extended to operations other than the division.

Some decimal floating-point implementations define additional exceptions, [36] [37] which are not part of IEEE 754:

Additionally, operations like quantize when either operand is infinite, or when the result does not fit the destination format, will also signal invalid operation exception. [38]

Special values

Signed zero

In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). In most run-time environments, positive zero is usually printed as "0" and the negative zero as "-0". The two values behave as equal in numerical comparisons, but some operations return different results for +0 and −0. For instance, 1/(−0) returns negative infinity, while 1/(+0) returns positive infinity (so that the identity 1/(1/±∞) = ±∞ is maintained). Other common functions with a discontinuity at x=0 which might treat +0 and −0 differently include log(x), signum(x), and the principal square root of y + xi for any negative number y. As with any approximation scheme, operations involving "negative zero" can occasionally cause confusion. For example, in IEEE 754, x = y does not always imply 1/x = 1/y, as 0 = −0 but 1/0 ≠ 1/(−0). [39]

Subnormal numbers

Subnormal values fill the underflow gap with values where the absolute distance between them is the same as for adjacent values just outside the underflow gap. This is an improvement over the older practice to just have zero in the underflow gap, and where underflowing results were replaced by zero (flush to zero). [40]

Modern floating-point hardware usually handles subnormal values (as well as normal values), and does not require software emulation for subnormals.

Infinities

The infinities of the extended real number line can be represented in IEEE floating-point datatypes, just like ordinary floating-point values like 1, 1.5, etc. They are not error values in any way, though they are often (depends on the rounding) used as replacement values when there is an overflow. Upon a divide-by-zero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or "" if the programming language allows that syntax).

IEEE 754 requires infinities to be handled in a reasonable way, such as

NaNs

IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, ∞×0, or sqrt(−1). In general, NaNs will be propagated, i.e. most operations involving a NaN will result in a NaN, although functions that would give some defined result for any given floating-point value will do so for NaNs as well, e.g. NaN ^ 0 = 1. There are two kinds of NaNs: the default quiet NaNs and, optionally, signaling NaNs. A signaling NaN in any arithmetic operation (including numerical comparisons) will cause an "invalid operation" exception to be signaled.

The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type or source of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to flag uninitialized variables, or extend the floating-point numbers with other special values without slowing down the computations with ordinary values, although such extensions are not common.

Design rationale

William Kahan. A primary architect of the Intel 80x87 floating-point coprocessor and IEEE 754 floating-point standard. William Kahan.jpg
William Kahan. A primary architect of the Intel 80x87 floating-point coprocessor and IEEE 754 floating-point standard.

It is a common misconception that the more esoteric features of the IEEE 754 standard discussed here, such as extended formats, NaN, infinities, subnormals etc., are only of interest to numerical analysts, or for advanced numerical applications. In fact the opposite is true: these features are designed to give safe robust defaults for numerically unsophisticated programmers, in addition to supporting sophisticated numerical libraries by experts. The key designer of IEEE 754, William Kahan notes that it is incorrect to "... [deem] features of IEEE Standard 754 for Binary Floating-Point Arithmetic that ...[are] not appreciated to be features usable by none but numerical experts. The facts are quite the opposite. In 1977 those features were designed into the Intel 8087 to serve the widest possible market... Error-analysis tells us how to design floating-point arithmetic, like IEEE Standard 754, moderately tolerant of well-meaning ignorance among programmers". [41]

A property of the single- and double-precision formats is that their encoding allows one to easily sort them without using floating-point hardware, as if the bits represented sign-magnitude integers, although it is unclear whether this was a design consideration (it seems noteworthy that the earlier IBM hexadecimal floating-point representation also had this property for normalized numbers). With the prevalent two's-complement representation, interpreting the bits as signed integers sorts the positives correctly, but with the negatives reversed; as one possible correction for that, with an xor to flip the sign bit for positive values and all bits for negative values, all the values become sortable as unsigned integers (with −0 < +0). [35]

Recommendations

Alternate exception handling

The standard recommends optional exception handling in various forms, including presubstitution of user-defined default values, and traps (exceptions that change the flow of control in some way) and other exception handling models that interrupt the flow, such as try/catch. The traps and other exception mechanisms remain optional, as they were in IEEE 754-1985.

Clause 9 in the standard recommends additional mathematical operations [45] that language standards should define. [46] None are required in order to conform to the standard.

The following are recommended arithmetic operations, which must round correctly: [47]

The , and functions were not part of the IEEE 754-2008 standard because they were deemed less necessary. [49] and were mentioned, but this was regarded as an error. [5] All three were added in the 2019 revision.

The recommended operations also include setting and accessing dynamic mode rounding direction, [50] and implementation-defined vector reduction operations such as sum, scaled product, and dot product, whose accuracy is unspecified by the standard. [51]

As of 2019, augmented arithmetic operations [52] for the binary formats are also recommended. These operations, specified for addition, subtraction and multiplication, produce a pair of values consisting of a result correctly rounded to nearest in the format and the error term, which is representable exactly in the format. At the time of publication of the standard, no hardware implementations are known, but very similar operations were already implemented in software using well-known algorithms. The history and motivation for their standardization are explained in a background document. [53] [54]

As of 2019, the formerly required minNum, maxNum, minNumMag, and maxNumMag in IEEE 754-2008 are now deprecated due to their non-associativity. Instead, two sets of new minimum and maximum operations are recommended. [55] The first set contains minimum, minimumNumber, maximum and maximumNumber. The second set contains minimumMagnitude, minimumMagnitudeNumber, maximumMagnitude and maximumMagnitudeNumber. The history and motivation for this change are explained in a background document. [56]

Expression evaluation

The standard recommends how language standards should specify the semantics of sequences of operations, and points out the subtleties of literal meanings and optimizations that change the value of a result. By contrast, the previous 1985 version of the standard left aspects of the language interface unspecified, which led to inconsistent behavior between compilers, or different optimization levels in an optimizing compiler.

Programming languages should allow a user to specify a minimum precision for intermediate calculations of expressions for each radix. This is referred to as preferredWidth in the standard, and it should be possible to set this on a per-block basis. Intermediate calculations within expressions should be calculated, and any temporaries saved, using the maximum of the width of the operands and the preferred width if set. Thus, for instance, a compiler targeting x87 floating-point hardware should have a means of specifying that intermediate calculations must use the double-extended format. The stored value of a variable must always be used when evaluating subsequent expressions, rather than any precursor from before rounding and assigning to the variable.

Reproducibility

The IEEE 754-1985 version of the standard allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions). IEEE 754-2008 has reduced these allowances, but a few variations still remain (especially for binary formats). The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language) and describes what needs to be done to achieve reproducible results.

Character representation

The standard requires operations to convert between basic formats and external character sequence formats. [57] Conversions to and from a decimal character format are required for all formats. Conversion to an external character sequence must be such that conversion back using round to nearest, ties to even will recover the original number. There is no requirement to preserve the payload of a quiet NaN or signaling NaN, and conversion from the external character sequence may turn a signaling NaN into a quiet NaN.

The original binary value will be preserved by converting to decimal and back again using: [58]

For other binary formats, the required number of decimal digits is [lower-alpha 8]

where p is the number of significant bits in the binary format, e.g. 237 bits for binary256.

When using a decimal floating-point format, the decimal representation will be preserved using:

Algorithms, with code, for correctly rounded conversion from binary to decimal and decimal to binary are discussed by Gay, [59] and for testing  by Paxson and Kahan. [60]

Hexadecimal literals

The standard recommends providing conversions to and from external hexadecimal-significand character sequences, based on C99's hexadecimal floating point literals. Such a literal consists of an optional sign (+ or -), the indicator "0x", a hexadecimal number with or without a period, an exponent indicator "p", and a decimal exponent with optional sign. The syntax is not case-sensitive. [61] The decimal exponent scales by powers of 2, so for example 0x0.1p-4 is 1/256. [62]

See also

Notes

  1. For example, if the base is 10, the sign is 1 (indicating negative), the significand is 12345, and the exponent is −3, then the value of the number is (−1)1 × 12345 × 10−3 = −1 × 12345 × 0.001 = −12.345.
  2. Approximative values. For exact values see each format's individual Wikipedia entry
  3. Number of digits in the radix used, including any implicit digit, but not counting the sign bit.
  4. Corresponding number of decimal digits, see text for more details.
  5. Contrary to decimal, there is no binary interchange format of 96-bit length. Such a format is still allowed as a non-interchange format, though.
  6. The standard recommends 0 for signaling NaNs, 1 for quiet NaNs, so that a signaling NaNs can be quieted by changing only this bit to 1, while the reverse could yield the encoding of an infinity.
  7. No flag is raised in certain cases of underflow.
  8. As an implementation limit, correct rounding is only guaranteed for the number of decimal digits required plus 3 for the largest supported binary format. For instance, if binary32 is the largest supported binary format, then a conversion from a decimal external sequence with 12 decimal digits is guaranteed to be correctly rounded when converted to binary32; but conversion of a sequence of 13 decimal digits is not; however, the standard recommends that implementations impose no such limit.

Related Research Articles

<span class="mw-page-title-main">Floating-point arithmetic</span> Computer approximation for real numbers

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:

IEEE 754-1985 is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.

In computing, NaN, standing for Not a Number, is a particular value of a numeric data type which is undefined as a number, such as the result of 0/0. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

The significand is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fractional number.

Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are equivalent. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 and +0, regarded as equal by the numerical comparison operations but with possible different behaviors in particular operations. This occurs in the sign-magnitude and ones' complement signed number representations for integers, and in most floating-point number representations. The number 0 is usually encoded as +0, but can still be represented by +0, −0, or 0.

In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.

IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

decimal128 is a decimal floating-point computer number format that occupies 128 bits in computer memory. Formally introduced in IEEE 754-2008, it is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

References

  1. IEEE 754 2019
  2. Haasz, Jodi. "FW: ISO/IEC/IEEE 60559 (IEEE Std 754-2008)". grouper.ieee.org. Archived from the original on 2017-10-27. Retrieved 2018-04-04.
  3. "ISO/IEEE Partner Standards Development Organization (PSDO) Cooperation Agreement" (PDF). ISO. 2007-12-19. Retrieved 2021-12-27.
  4. ISO/IEC JTC 1/SC 25 2011.
  5. 1 2 Cowlishaw, Mike (2013-11-13). "IEEE 754-2008 errata". speleotrove.com. Retrieved 2020-01-24.
  6. "ANSI/IEEE Std 754-2019". ucbtest.org. Retrieved 2024-01-16.
  7. ISO/IEC JTC 1/SC 25 2020.
  8. Riedy, E. Jason (2018-06-26), "Plans for IEEE Standard 754 – 2028" (PDF), 25th IEEE Symposium on Computer Arithmetic, Amherst, MA: IEEE
  9. IEEE 754 2008 , §2.1.27.
  10. "SpiderMonkey Internals". developer.mozilla.org. Retrieved 2018-03-11.
  11. Klemens, Ben (September 2014). 21st Century C: C Tips from the New School. O'Reilly Media, Incorporated. p. 160. ISBN   9781491904442 . Retrieved 2018-03-11.
  12. "zuiderkwast/nanbox: NaN-boxing in C". GitHub . Retrieved 2018-03-11.
  13. IEEE 754 2008 , §3.6.
  14. IEEE 754 2008 , §3.7.
  15. IEEE 754 2008 , §3.7 states: "Language standards should define mechanisms supporting extendable precision for each supported radix."
  16. IEEE 754 2008 , §3.7 states: "Language standards or implementations should support an extended precision format that extends the widest basic format that is supported in that radix."
  17. Motorola MC68000 Family (PDF). Programmer's Reference Manual. NXP Semiconductors. 1992. pp. 1–16, 1–18, 1–23.
  18. IEEE 754 2008 , §4.3.1. "In the following two rounding-direction attributes, an infinitely precise result with magnitude at least shall round to with no change in sign."
  19. IEEE 754 2008 , §4.3.3
  20. IEEE 754 2019 , §2.1
  21. 1 2 3 IEEE 754 2008 , §5.3.1
  22. 1 2 IEEE 754 2008 , §5.4.1
  23. IEEE 754 2008 , §5.4.2
  24. IEEE 754 2008 , §5.4.3
  25. IEEE 754 2008 , §5.3.2
  26. IEEE 754 2008 , §5.3.3
  27. IEEE 754 2008 , §5.5.1
  28. IEEE 754 2008 , §5.10
  29. IEEE 754 2008 , §5.11
  30. IEEE 754 2008 , §5.7.2
  31. IEEE 754 2008 , §5.7.4
  32. IEEE 754 2019 , §5.11
  33. 1 2 3 IEEE 754 2019 , §5.10
  34. "Implement total_cmp for f32, f64 by golddranks · Pull Request #72568 · rust-lang/rust". GitHub. contains relevant quotations from IEEE 754-2008 and -2019. Contains a type-pun implementation and explanation.
  35. 1 2 Herf, Michael (December 2001). "radix tricks". stereopsis: graphics.
  36. "9.4. decimal — Decimal fixed point and floating point arithmetic — Python 3.6.5 documentation". docs.python.org. Retrieved 2018-04-04.
  37. "Decimal Arithmetic - Exceptional conditions". speleotrove.com. Retrieved 2018-04-04.
  38. IEEE 754 2008 , §7.2(h)
  39. Goldberg 1991.
  40. Muller, Jean-Michel; Brisebarre, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge (2010). Handbook of Floating-Point Arithmetic (1 ed.). Birkhäuser. doi:10.1007/978-0-8176-4705-6. ISBN   978-0-8176-4704-9. LCCN   2009939668.
  41. 1 2 Kahan, William Morton; Darcy, Joseph (2001) [1998-03-01]. "How Java's floating-point hurts everyone everywhere" (PDF). Archived (PDF) from the original on 2000-08-16. Retrieved 2003-09-05.
  42. Kahan, William Morton (1981-02-12). "Why do we need a floating-point arithmetic standard?" (PDF). p. 26. Archived (PDF) from the original on 2004-12-04.
  43. Severance, Charles (1998-02-20). "An Interview with the Old Man of Floating-Point".
  44. 1 2 Kahan, William Morton (1996-06-11). "The Baleful Effect of Computer Benchmarks upon Applied Mathematics, Physics and Chemistry" (PDF). Archived (PDF) from the original on 2013-10-13.
  45. IEEE 754 2019 , §9.2
  46. IEEE 754 2008 , Clause 9
  47. IEEE 754 2019 , §9.2.
  48. "Too much power - pow vs powr, powd, pown, rootn, compound". grouper.ieee.org. Retrieved 2024-01-16. Since growth rates can't be less than -1, such rates signal invalid exceptions.
  49. "Re: Missing functions tanPi, asinPi and acosPi". grouper.ieee.org. Archived from the original on 2017-07-06. Retrieved 2018-04-04.
  50. IEEE 754 2008 , §9.3.
  51. IEEE 754 2008 , §9.4.
  52. IEEE 754 2019 , §9.5
  53. Riedy, Jason; Demmel, James. "Augmented Arithmetic Operations Proposed for IEEE-754 2018" (PDF). 25th IEEE Symbosium on Computer Arithmetic (ARITH 2018). pp. 49–56. Archived (PDF) from the original on 2019-07-23. Retrieved 2019-07-23.
  54. "ANSI/IEEE Std 754-2019 – Background Documents". grouper.ieee.org. Retrieved 2024-01-16.
  55. IEEE 754 2019 , §9.6.
  56. Chen, David. "The Removal/Demotion of MinNum and MaxNum Operations from IEEE 754-2018" (PDF). grouper.ieee.org. Retrieved 2024-01-16.
  57. IEEE 754 2008 , §5.12.
  58. IEEE 754 2008 , §5.12.2.
  59. Gay, David M. (1990-11-30), Correctly rounded binary-decimal and decimal-binary conversions, Numerical Analysis Manuscript, Murry Hill, NJ, US: AT&T Laboratories, 90-10
  60. Paxson, Vern; Kahan, William (1991-05-22), A Program for Testing IEEE Decimal–Binary Conversion, Manuscript, CiteSeerX   10.1.1.144.5889
  61. IEEE 754 2008 , §5.12.3
  62. "6.9.3. Hexadecimal floating point literals — Glasgow Haskell Compiler 9.3.20220129 User's Guide". ghc.gitlab.haskell.org. Retrieved 2022-01-29.

Standards

Secondary references

Further reading