|Computer architecture bit widths|
|Binary floating-point precision|
|Decimal floating-point precision|
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats.Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).
There is a long history of extended floating-point formats reaching back nearly to the middle of the last century. Various manufacturers have used different formats for extended precision for different machines. In many cases the format of the extended precision is not quite the same as a scale-up of the ordinary single- and double-precision formats it is meant to extend. In a few cases the implementation was merely a software-based change in the floating-point data format, but in most cases extended precision was implemented in hardware, either built into the central processor itself, or more often, built into the hardware of an optional, attached processor called a "floating-point unit" (FPU) or "floating-point processor" (FPP), accessible to the CPU as a fast input / output device.
The IBM 1130, sold in 1965,offered two floating-point formats: A 32-bit "standard precision" format and a 40-bit "extended precision" format. Standard precision format contains a 24-bit two's complement significand while extended precision utilizes a 32-bit two's complement significand. The latter format makes full use of the CPU's 32-bit integer operations. The characteristic in both formats is an 8-bit field containing the power of two biased by 128. Floating-point arithmetic operations are performed by software, and double precision is not supported at all. The extended format occupies three 16-bit words, with the extra space simply ignored.
The IBM System/360 supports a 32-bit "short" floating-point format and a 64-bit "long" floating-point format.The 360/85 and follow-on System/370 add support for a 128-bit "extended" format. These formats are still supported in the current design, where they are now called the "hexadecimal floating-point" (HFP) formats.
The Microsoft BASIC port for the 6502 CPU, such as in adaptations like Commodore BASIC, AppleSoft BASIC, KIM-1 BASIC or MicroTAN BASIC, supports an extended 40-bit variant of the floating-point format Microsoft Binary Format (MBF) since 1977.
The IEEE 754 floating-point standard recommends that implementations provide extended precision formats. The standard specifies the minimum requirements for an extended format but does not specify an encoding.The encoding is the implementor's choice.
The IA32, x86-64, and Itanium processors support an 80-bit "double extended" extended precision format with a 64-bit significand. The Intel 8087 math coprocessor was the first x86 device which supported floating-point arithmetic in hardware. It was designed to support a 32-bit "single precision" format and a 64-bit "double-precision" format for encoding and interchanging floating-point numbers. The temporary real (extended) format was designed not to store data at higher precision as such, but rather primarily to allow for the computation of double results more reliably and accurately by minimising overflow and roundoff-errors in intermediate calculations.For example, many floating-point algorithms (e.g. exponentiation) suffer from significant precision loss when computed using the most direct implementations. To mitigate such issues the internal registers in the 8087 were designed to hold intermediate results in an 80-bit "extended precision" format. The 8087 automatically converts numbers to this format when loading floating-point registers from memory and also converts results back to the more conventional formats when storing the registers back into memory. To enable intermediate subexpression results to be saved in extended precision scratch variables and continued across programming language statements, and otherwise interrupted calculations to resume where they were interrupted, it provides instructions which transfer values between these internal registers and memory without performing any conversion, which therefore enables access to the extended format for calculations – also reviving the issue of the accuracy of functions of such numbers, but at a higher precision.
The floating-point units (FPU) on all subsequent x86 processors have supported this format. As a result software can be developed which takes advantage of the higher precision provided by this format. William Kahan, a primary designer of the x87 arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point: "An extended format as wide as we dared (80 bits) was included to serve the same support role as the 13 decimal internal format serves in Hewlett-Packard's 10 decimal calculators." Moreover, Kahan notes that 64 bits was the widest significand across which carry propagation could be done without increasing the cycle time on the 8087, and that the x87 extended precision was designed to be extensible to higher precision in future processors: "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16 byte format. ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed."
The Motorola 6888x math coprocessors and the Motorola 68040 and 68060 processors support this same 64-bit significand extended precision type (similar to the Intel format although padded to a 96-bit format with 16 unused bits inserted between the exponent and significand fields ). The follow-on Coldfire processors do not support this 96-bit extended precision format.
The FPA10 math coprocessor for early ARM processors also supports this extended precision type (similar to the Intel format although padded to a 96-bit format with 16 zero bits inserted between the sign and the exponent fields), but without correct rounding.
The x87 and Motorola 68881 80-bit formats meet the requirements of the IEEE 754 double extended format, as does the IEEE 754 128-bit format.
The x86 extended precision format is an 80-bit format first implemented in the Intel 8087 math coprocessor and is supported by all processors that are based on the x86 design that incorporate a floating-point unit (FPU). This 80-bit format uses one bit for the sign of the significand, 15 bits for the exponent field (i.e. the same range as the 128-bit quadruple precision IEEE 754 format) and 64 bits for the significand. The exponent field is biased by 16383, meaning that 16383 has to be subtracted from the value in the exponent field to compute the actual power of 2.An exponent field value of 32767 (all fifteen bits 1) is reserved so as to enable the representation of special states such as infinity and Not a Number. If the exponent field is zero, the value is a denormal number and the exponent of 2 is −16382.
In the following table, "s" is the value of the sign bit (0 means positive, 1 means negative), "e" is the value of the exponent field interpreted as a positive integer, and "m" is the significand interpreted as a positive binary number where the binary point is located between bits 63 and 62. The "m" field is the combination of the integer and fraction parts in the above diagram.
|All Zeros||Bit 63||Bits 62-0|
|Zero||Zero||Zero. The sign bit gives the sign of the zero.|
|Non-zero||Denormal. The value is (−1)s × m × 2−16382|
|One||Anything||Pseudo Denormal. The 80387 and later properly interpret this value but will not generate it. The value is (−1)s × m × 2−16382|
|All Ones||Bits 63,62||Bits 61-0|
|00||Zero||Pseudo-Infinity. The sign bit gives the sign of the infinity. The 8087 and 80287 treat this as Infinity. The 80387 and later treat this as an invalid operand.|
|Non-zero||Pseudo Not a Number. The sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number. The 80387 and later treat this as an invalid operand.|
|01||Anything||Pseudo Not a Number. The sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number. The 80387 and later treat this as an invalid operand.|
|10||Zero||Infinity. The sign bit gives the sign of the infinity. The 8087 and 80287 treat this as a Signaling Not a Number. The 8087 and 80287 coprocessors used the pseudo-infinity representation for infinities.|
|Non-zero||Signalling Not a Number, the sign bit is meaningless.|
|11||Zero||Floating-point Indefinite, the result of invalid calculations such as square root of a negative number, logarithm of a negative number, 0/0, infinity / infinity, infinity times 0, and others when the processor has been configured to not generate exceptions for invalid operands. The sign bit is meaningless. This is a special case of a Quiet Not a Number.|
|Non-zero||Quiet Not a Number, the sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number.|
|All other values||Bit 63||Bits 62-0|
|Zero||Anything||Unnormal. Only generated on the 8087 and 80287. The 80387 and later treat this as an invalid operand. The value is (−1)s × m × 2e−16383|
|One||Anything||Normalized value. The value is (−1)s × m × 2e−16383|
In contrast to the single and double-precision formats, this format does not utilize an implicit/hidden bit. Rather, bit 63 contains the integer part of the significand and bits 62-0 hold the fractional part. Bit 63 will be 1 on all normalized numbers. There were several advantages to this design when the 8087 was being developed:
The 80-bit floating-point format was widely available by 1984,
extended type, and several Fortran compilers have a
REAL*10 type (analogous to
REAL*8). Such compilers also typically include extended-precision mathematical subroutines, such as square root and trigonometric functions, in their standard libraries.
The 80-bit floating-point format has a range (including subnormals) from approximately 3.65×10−4951 to 1.18×104932. Although log10(264) ≅ 19.266, this format is usually described as giving approximately eighteen significant digits of precision (the floor of log10(263), the minimum guaranteed precision). The use of decimal when talking about binary is unfortunate because most decimal fractions are recurring sequences in binary just as 2/3 is in decimal. Thus, a value such as 10.15 is represented in binary as equivalent to 10.1499996185 etc. in decimal for REAL*4 but 10.15000000000000035527etc. in REAL*8: interconversion will involve approximation except for those few decimal fractions that represent an exact binary value, such as 0.625. For REAL*10, the decimal string is 10.1499999999999999996530553etc. The last 9 digit is the eighteenth fractional digit and thus the twentieth significant digit of the string. Bounds on conversion between decimal and binary for the 80-bit format can be given as follows: if a decimal string with at most 18 significant digits is correctly rounded to an 80-bit IEEE 754 binary floating-point value (as on input) then converted back to the same number of significant decimal digits (as for output), then the final string will exactly match the original; while, conversely, if an 80-bit IEEE 754 binary floating-point value is correctly converted and (nearest) rounded to a decimal string with at least 21 significant decimal digits then converted back to binary format it will exactly match the original. These approximations are particularly troublesome when specifying the best value for constants in formulae to high precision, as might be calculated via arbitrary-precision arithmetic.
A notable example of the need for a minimum of 64 bits of precision in the significand of the extended precision format is the need to avoid precision loss when performing exponentiation on double-precision values. The x86 floating-point units do not provide an instruction that directly performs exponentiation. Instead they provide a set of instructions that a program can use in sequence to perform exponentiation using the equation:
In order to avoid precision loss, the intermediate results "log2(x)" and "y·log2(x)" must be computed with much higher precision, because effectively both the exponent and the significand fields of x must fit into the significand field of the intermediate result. Subsequently the significand field of the intermediate result is split between the exponent and significand fields of the final result when 2intermediate result is calculated. The following discussion describes this requirement in more detail.
With a little unpacking, an IEEE 754 double-precision value can be represented as:
where s is the sign of the exponent (either 0 or 1), E is the unbiased exponent, which is an integer that ranges from 0 to 1023, and M is the significand which is a 53-bit value that falls in the range 1 ≤ M< 2. Negative numbers and zero can be ignored because the logarithm of these values is undefined. For purposes of this discussion M does not have 53 bits of precision because it is constrained to be greater than or equal to one i.e. the hidden bit does not count towards the precision (Note that in situations where M is less than 1, the value is actually a de-normal and therefore may have already suffered precision loss. This situation is beyond the scope of this article).
Taking the log of this representation of a double-precision number and simplifying results in the following:
This result demonstrates that when taking base 2 logarithm of a number, the sign of the exponent of the original value becomes the sign of the logarithm, the exponent of the original value becomes the integer part of the significand of the logarithm, and the significand of the original value is transformed into the fractional part of the significand of the logarithm.
Because E is an integer in the range 0 to 1023, up to 10 bits to the left of the radix point are needed to represent the integer part of the logarithm. Because M falls in the range 1 ≤ M< 2, the value of log2M will fall in the range 0 ≤ log2M< 1 so at least 52 bits are needed to the right of the radix point to represent the fractional part of the logarithm. Combining 10 bits to the left of the radix point with 52 bits to the right of the radix point means that the significand part of the logarithm must be computed to at least 62 bits of precision. In practice values of M less than require 53 bits to the right of the radix point and values of M less than require 54 bits to the right of the radix point to avoid precision loss. Balancing this requirement for added precision to the right of the radix point, exponents less than 512 only require 9 bits to the left of the radix point and exponents less than 256 require only 8 bits to the left of the radix point.
The final part of the exponentiation calculation is computing 2intermediate result. The "intermediate result" consists of an integer part "I" added to a fractional part "F". If the intermediate result is negative then a slight adjustment is needed to get a positive fractional part because both "I" and "F" are negative numbers.
For positive intermediate results:
For negative intermediate results:
Thus the integer part of the intermediate result ("I" or "I−1") plus a bias becomes the exponent of the final result and transformed positive fractional part of the intermediate result: 2F or 21+F becomes the significand of the final result. In order to supply 52 bits of precision to the final result, the positive fractional part must be maintained to at least 52 bits.
In conclusion, the exact number of bits of precision needed in the significand of the intermediate result is somewhat data dependent but 64 bits is sufficient to avoid precision loss in the vast majority of exponentiation computations involving double-precision numbers.
The number of bits needed for the exponent of the extended precision format follows from the requirement that the product of two double-precision numbers should not overflow when computed using the extended format. The largest possible exponent of a double-precision value is 1023 so the exponent of the largest possible product of two double-precision numbers is 2047 (an 11-bit value). Adding in a bias to account for negative exponents means that the exponent field must be at least 12 bits wide.
Combining these requirements: 1 bit for the sign, 12 bits for the biased exponent, and 64 bits for the significand means that the extended precision format would need at least 77 bits. Engineering considerations resulted in the final definition of the 80-bit format (in particular the IEEE 754 standard requires the exponent range of an extended precision format to match that of the next largest, quad, precision format which is 15 bits).
Another example of calculations that benefit from extended precision arithmetic are iterative refinement schemes, used to indirectly clean out errors accumulated in the direct solution during the typically very large number of calculations made for numerical linear algebra.
long doubleusing 80-bit floating-point numbers on x86 systems. However, this is implementation-defined behavior and is not required, but allowed by the standard, as specified for IEEE 754 hardware in the C99 standard "Annex F IEC 60559 floating-point arithmetic". GCC also provides
long-floatusing 80-bit floating-point numbers on x86 systems.
realusing largest floating-point size implemented in hardware, 80 bits for x86 CPUs or double precision, whichever is larger.
EXTENDED10 byte Extended-precision floating-point data type.
In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often used in systems with very small and very large real numbers that require fast processing times. In general, a floating-point number is represented approximately with a fixed number of significant digits and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
IEEE 754-1985 was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
Double-precision floating-point format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computer science, denormal numbers or denormalized numbers fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal number is subnormal.
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The significand is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.
Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.
The IEEE Standard for Radix-Independent Floating-Point Arithmetic, was the first Institute of Electrical and Electronics Engineers (IEEE) international standard for floating-point arithmetic with radices other than 2, including radix 10. IEEE 854 did not specify any data formats, whereas IEEE 754-1985 did specify formats for binary floating point. IEEE 754-1985 and IEEE 854-1987 were both superseded in 2008 by IEEE 754-2008, which specifies floating-point arithmetic for both radix 2 (binary) and radix 10 (decimal), and specifies two alternative formats for radix 10 floating-point values, and even more so with IEEE 754-2019. IEEE 754-2008 also had many other updates to the IEEE floating-point standardisation.
In C and related programming languages,
long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as
double. As with C's other floating-point types, it may not necessarily map to an IEEE format.
In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.
IEEE 754-2008 was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating-point standard, while in 2019 it was updated with a minor revision IEEE 754-2019. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854.
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory.
In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.
Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.
In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC language products, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.