IEEE 754-1985 [1] is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. [2] During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
IEEE 754-1985 represents numbers in binary, providing definitions for four levels of precision, of which the two most commonly used are:
Level | Width | Range at full precision | Precision [lower-alpha 1] |
---|---|---|---|
Single precision | 32 bits | ±1.18×10−38 to ±3.4×1038 | Approximately 7 decimal digits |
Double precision | 64 bits | ±2.23×10−308 to ±1.80×10308 | Approximately 16 decimal digits |
The standard also defines representations for positive and negative infinity, a "negative zero", five exceptions to handle invalid results like division by zero, special values called NaNs for representing those exceptions, denormal numbers to represent numbers smaller than shown above, and four rounding modes.
Floating-point numbers in IEEE 754 format consist of three fields: a sign bit, a biased exponent, and a fraction. The following example illustrates the meaning of each.
The decimal number 0.1562510 represented in binary is 0.001012 (that is, 1/8 + 1/32). (Subscripts indicate the number base.) Analogous to scientific notation, where numbers are written to have a single non-zero digit to the left of the decimal point, we rewrite this number so it has a single 1 bit to the left of the "binary point". We simply multiply by the appropriate power of 2 to compensate for shifting the bits left by three positions:
Now we can read off the fraction and the exponent: the fraction is .012 and the exponent is −3.
As illustrated in the pictures, the three fields in the IEEE 754 representation of this number are:
IEEE 754 adds a bias to the exponent so that numbers can in many cases be compared conveniently by the same hardware that compares signed 2's-complement integers. Using a biased exponent, the lesser of two positive floating-point numbers will come out "less than" the greater following the same ordering as for sign and magnitude integers. If two floating-point numbers have different signs, the sign-and-magnitude comparison also works with biased exponents. However, if both biased-exponent floating-point numbers are negative, then the ordering must be reversed. If the exponent were represented as, say, a 2's-complement number, comparison to see which of two numbers is greater would not be as convenient.
The leading 1 bit is omitted since all numbers except zero start with a leading 1; the leading 1 is implicit and doesn't actually need to be stored which gives an extra bit of precision for "free."
The number zero is represented specially:
The number representations described above are called normalized, meaning that the implicit leading binary digit is a 1. To reduce the loss of precision when an underflow occurs, IEEE 754 includes the ability to represent fractions smaller than are possible in the normalized representation, by making the implicit leading digit a 0. Such numbers are called denormal. They don't include as many significant digits as a normalized number, but they enable a gradual loss of precision when the result of an operation is not exactly zero but is too close to zero to be represented by a normalized number.
A denormal number is represented with a biased exponent of all 0 bits, which represents an exponent of −126 in single precision (not −127), or −1022 in double precision (not −1023). [3] In contrast, the smallest biased exponent representing a normal number is 1 (see examples below).
The biased-exponent field is filled with all 1 bits to indicate either infinity or an invalid result of a computation.
Positive and negative infinity are represented thus:
Some operations of floating-point arithmetic are invalid, such as taking the square root of a negative number. The act of reaching an invalid result is called a floating-point exception. An exceptional result is represented by a special code called a NaN, for "Not a Number". All NaNs in IEEE 754-1985 have this format:
Precision is defined as the minimum difference between two successive mantissa representations; thus it is a function only in the mantissa; while the gap is defined as the difference between two successive numbers. [4]
Single-precision numbers occupy 32 bits. In single precision:
Some example range and gap values for given exponents in single precision:
Actual Exponent (unbiased) | Exp (biased) | Minimum | Maximum | Gap |
---|---|---|---|---|
−1 | 126 | 0.5 | ≈ 0.999999940395 | ≈ 5.96046e-8 |
0 | 127 | 1 | ≈ 1.999999880791 | ≈ 1.19209e-7 |
1 | 128 | 2 | ≈ 3.999999761581 | ≈ 2.38419e-7 |
2 | 129 | 4 | ≈ 7.999999523163 | ≈ 4.76837e-7 |
10 | 137 | 1024 | ≈ 2047.999877930 | ≈ 1.22070e-4 |
11 | 138 | 2048 | ≈ 4095.999755859 | ≈ 2.44141e-4 |
23 | 150 | 8388608 | 16777215 | 1 |
24 | 151 | 16777216 | 33554430 | 2 |
127 | 254 | ≈ 1.70141e38 | ≈ 3.40282e38 | ≈ 2.02824e31 |
As an example, 16,777,217 cannot be encoded as a 32-bit float as it will be rounded to 16,777,216. However, all integers within the representable range that are a power of 2 can be stored in a 32-bit float without rounding.
Double-precision numbers occupy 64 bits. In double precision:
Some example range and gap values for given exponents in double precision:
Actual Exponent (unbiased) | Exp (biased) | Minimum | Maximum | Gap |
---|---|---|---|---|
−1 | 1022 | 0.5 | ≈ 0.999999999999999888978 | ≈ 1.11022e-16 |
0 | 1023 | 1 | ≈ 1.999999999999999777955 | ≈ 2.22045e-16 |
1 | 1024 | 2 | ≈ 3.999999999999999555911 | ≈ 4.44089e-16 |
2 | 1025 | 4 | ≈ 7.999999999999999111822 | ≈ 8.88178e-16 |
10 | 1033 | 1024 | ≈ 2047.999999999999772626 | ≈ 2.27374e-13 |
11 | 1034 | 2048 | ≈ 4095.999999999999545253 | ≈ 4.54747e-13 |
52 | 1075 | 4503599627370496 | 9007199254740991 | 1 |
53 | 1076 | 9007199254740992 | 18014398509481982 | 2 |
1023 | 2046 | ≈ 8.98847e307 | ≈ 1.79769e308 | ≈ 1.99584e292 |
The standard also recommends extended format(s) to be used to perform internal computations at a higher precision than that required for the final result, to minimise round-off errors: the standard only specifies minimum precision and exponent requirements for such formats. The x87 80-bit extended format is the most commonly implemented extended format that meets these requirements.
Here are some examples of single-precision IEEE 754 representations:
Type | Sign | Actual Exponent | Exp (biased) | Exponent field | Fraction field | Value |
---|---|---|---|---|---|---|
Zero | 0 | −126 | 0 | 0000 0000 | 000 0000 0000 0000 0000 0000 | 0.0 |
Negative zero | 1 | −126 | 0 | 0000 0000 | 000 0000 0000 0000 0000 0000 | −0.0 |
One | 0 | 0 | 127 | 0111 1111 | 000 0000 0000 0000 0000 0000 | 1.0 |
Minus One | 1 | 0 | 127 | 0111 1111 | 000 0000 0000 0000 0000 0000 | −1.0 |
Smallest denormalized number | * | −126 | 0 | 0000 0000 | 000 0000 0000 0000 0000 0001 | ±2−23× 2−126 = ±2−149 ≈ ±1.4×10−45 |
"Middle" denormalized number | * | −126 | 0 | 0000 0000 | 100 0000 0000 0000 0000 0000 | ±2−1× 2−126 = ±2−127 ≈ ±5.88×10−39 |
Largest denormalized number | * | −126 | 0 | 0000 0000 | 111 1111 1111 1111 1111 1111 | ±(1−2−23) × 2−126 ≈ ±1.18×10−38 |
Smallest normalized number | * | −126 | 1 | 0000 0001 | 000 0000 0000 0000 0000 0000 | ±2−126 ≈ ±1.18×10−38 |
Largest normalized number | * | 127 | 254 | 1111 1110 | 111 1111 1111 1111 1111 1111 | ±(2−2−23) × 2127 ≈ ±3.4×1038 |
Positive infinity | 0 | 128 | 255 | 1111 1111 | 000 0000 0000 0000 0000 0000 | +∞ |
Negative infinity | 1 | 128 | 255 | 1111 1111 | 000 0000 0000 0000 0000 0000 | −∞ |
Not a number | * | 128 | 255 | 1111 1111 | non zero | NaN |
* Sign bit can be either 0 or 1 . |
Every possible bit combination is either a NaN or a number with a unique value in the affinely extended real number system with its associated order, except for the two combinations of bits for negative zero and positive zero, which sometimes require special attention (see below). The binary representation has the special property that, excluding NaNs, any two numbers can be compared as sign and magnitude integers (endianness issues apply). When comparing as 2's-complement integers: If the sign bits differ, the negative number precedes the positive number, so 2's complement gives the correct result (except that negative zero and positive zero should be considered equal). If both values are positive, the 2's complement comparison again gives the correct result. Otherwise (two negative numbers), the correct FP ordering is the opposite of the 2's complement ordering.
Rounding errors inherent to floating point calculations may limit the use of comparisons for checking the exact equality of results. Choosing an acceptable range is a complex topic. A common technique is to use a comparison epsilon value to perform approximate comparisons. [6] Depending on how lenient the comparisons are, common values include 1e-6
or 1e-5
for single-precision, and 1e-14
for double-precision. [7] [8] Another common technique is ULP, which checks what the difference is in the last place digits, effectively checking how many steps away the two values are. [9]
Although negative zero and positive zero are generally considered equal for comparison purposes, some programming language relational operators and similar constructs treat them as distinct. According to the Java Language Specification, [10] comparison and equality operators treat them as equal, but Math.min()
and Math.max()
distinguish them (officially starting with Java version 1.1 but actually with 1.1.1), as do the comparison methods equals()
, compareTo()
and even compare()
of classes Float
and Double
.
The IEEE standard has four different rounding modes; the first is the default; the others are called directed roundings .
The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode. [11] [12] [13]
The following functions must be provided:
NaN
for any x (including NaN
).copysign(x,y)
returns x with the sign of y, so abs(x)
equals copysign(x,1.0)
. This is one of the few operations which operates on a NaN in a way resembling arithmetic. The function copysign
is new in the C99 standard.scalb(y, N)
logb(x)
finite(x)
a predicate for "x is a finite value", equivalent to −Inf < x < Infisnan(x)
a predicate for "x is a NaN", equivalent to "x ≠ x"x <> y
(x .LG. y
in Fortran), which turns out to have different behavior than NOT(x = y)
(x .NE. y
in Fortran, x != y
in C) [14] due to NaN.unordered(x, y)
is true when "x is unordered with y", i.e., either x or y is a NaN.class(x)
nextafter(x,y)
returns the next representable value from x in the direction towards yIn 1976, Intel was starting the development of a floating-point coprocessor. [15] [16] Intel hoped to be able to sell a chip containing good implementations of all the operations found in the widely varying maths software libraries. [15] [17]
John Palmer, who managed the project, believed the effort should be backed by a standard unifying floating point operations across disparate processors. He contacted William Kahan of the University of California, who had helped improve the accuracy of Hewlett-Packard's calculators. Kahan suggested that Intel use the floating point of Digital Equipment Corporation's (DEC) VAX. The first VAX, the VAX-11/780 had just come out in late 1977, and its floating point was highly regarded. However, seeking to market their chip to the broadest possible market, Intel wanted the best floating point possible, and Kahan went on to draw up specifications. [15] Kahan initially recommended that the floating point base be decimal [18] [ unreliable source? ] but the hardware design of the coprocessor was too far along to make that change.
The work within Intel worried other vendors, who set up a standardization effort to ensure a "level playing field". Kahan attended the second IEEE 754 standards working group meeting, held in November 1977. He subsequently received permission from Intel to put forward a draft proposal based on his work for their coprocessor; he was allowed to explain details of the format and its rationale, but not anything related to Intel's implementation architecture. The draft was co-written with Jerome Coonen and Harold Stone, and was initially known as the "Kahan-Coonen-Stone proposal" or "K-C-S format". [15] [16] [17] [19]
As an 8-bit exponent was not wide enough for some operations desired for double-precision numbers, e.g. to store the product of two 32-bit numbers, [20] both Kahan's proposal and a counter-proposal by DEC therefore used 11 bits, like the time-tested 60-bit floating-point format of the CDC 6600 from 1965. [16] [19] [21] Kahan's proposal also provided for infinities, which are useful when dealing with division-by-zero conditions; not-a-number values, which are useful when dealing with invalid operations; denormal numbers, which help mitigate problems caused by underflow; [19] [22] [23] and a better balanced exponent bias, which can help avoid overflow and underflow when taking the reciprocal of a number. [24] [25]
Even before it was approved, the draft standard had been implemented by a number of manufacturers. [26] [27] The Intel 8087, which was announced in 1980, was the first chip to implement the draft standard.
In 1980, the Intel 8087 chip was already released, [28] but DEC remained opposed, to denormal numbers in particular, because of performance concerns and since it would give DEC a competitive advantage to standardise on DEC's format.
The arguments over gradual underflow lasted until 1981 when an expert hired by DEC to assess it sided against the dissenters. DEC had the study done in order to demonstrate that gradual underflow was a bad idea, but the study concluded the opposite, and DEC gave in. In 1985, the standard was ratified, but it had already become the de facto standard a year earlier, implemented by many manufacturers. [16] [19] [5]
In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:
In computing, NaN, standing for Not a Number, is a particular value of a numeric data type which is undefined as a number, such as the result of 0/0. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.
Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point.
In computer science, subnormal numbers are the subset of denormalized numbers that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal, while denormal can also refer to numbers outside that range.
Rounding or rounding off means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $23.4476 with $23.45, the fraction 312/937 with 1/3, or the expression √2 with 1.414.
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The significand is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fractional number.
Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.
In computing, signed number representations are required to encode negative numbers in binary number systems.
Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are equivalent. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 and +0, regarded as equal by the numerical comparison operations but with possible different behaviors in particular operations. This occurs in the sign-magnitude and ones' complement signed number representations for integers, and in most floating-point number representations. The number 0 is usually encoded as +0, but can still be represented by +0, −0, or 0.
In computing, minifloats are floating-point values represented with very few bits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.
In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.
Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC languages, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.