Floating-point formats |
---|
IEEE 754 |
|
Other |
Alternatives |
Tapered floating point |
In computing, decimal64 is a decimal floating-point computer number format that occupies 8 bytes (64 bits) in computer memory.
decimal64 fits well to replace binary64 format in applications where 'small deviations' are unwanted and speed isn't extremely crucial.
In contrast to the binaryxxx data formats the decimalxxx formats provide exact representation of decimal fractions, exact calculations with them and enable human common 'ties away from zero' rounding (in some range, to some precision, to some degree). In a trade-off for reduced performance. They are intended for applications where it's requested to come near to schoolhouse math, such as financial and tax computations. (In short they avoid plenty of problems like 0.2 + 0.1 -> 0.30000000000000004 which happen with binary64 datatypes.)
Decimal64 supports 'normal' values that can have 16 digit precision from ±1.000000000000000×10 −383 to ±9.999999999999999×10 384, plus 'denormal' values with ramp-down relative precision down to ±1 × 10−398, signed zeros, signed infinities and NaN (Not a Number).
The binary format of the same size supports a range from denormal-min ±5×10 −324, over normal-min with full 53-bit precision ±2.2250738585072014×10 −308 to max ±1.7976931348623157×10 +308.
decimal64 values are represented in a 'not normalized' near to 'scientific format', with combining some bits of the exponent with the leading bits of the significand in a 'combination field'.
Sign | Combination | Trailing significand bits |
---|---|---|
1 bit | 13 bits | 50 bits |
s | mmmmmmmmmmmmm | tttttttttttttttttttttttttttttttttttttttttttttttttt |
IEEE 754 allows two alternative encodings for decimal64 values. The standard does not specify how to signify which representation is used, for instance in a situation where decimal64 values are communicated between systems:
Both alternatives provide exactly the same set of representable numbers: 16 digits of significand and 3 × 28 = 768 possible decimal exponent values. (All the possible decimal exponent values storable in a binary64 number are representable in decimal64, and most bits of the significand of a binary64 are stored keeping roughly the same number of decimal digits in the significand.)
In both cases, the most significant 4 bits of the significand (which actually only have 10 possible values) are combined with two bits of the exponent (3 possible values) to use 30 of the 32 possible values of a 5-bit field. The remaining combinations encode infinities and NaNs. BID and DPD use different bits of the combination field for that.
In the cases of Infinity and NaN, all other bits of the encoding are ignored. Thus, it is possible to initialize an array to Infinities or NaNs by filling it with a single byte value.
Because the significand for the IEEE 754 decimal formats is not normalized, most values with less than 16 significant digits have multiple possible representations; 1000000 × 10-2=100000 × 10-1=10000 × 100=1000 × 101 all have the value 10000. These sets of representations for a same value are called cohorts, the different members can be used to denote how many digits of the value are known precisely.
This format uses a binary significand from 0 to 1016 − 1 = 9999999999999999 = 2386F26FC0FFFF16 = 1000111000011011110010011011111100000011111111111111112.The encoding, completely stored on 64 bits, can represent binary significands up to 10 × 250 − 1 = 11258999068426239 = 27FFFFFFFFFFFF16, but values larger than 1016 − 1 are illegal (and the standard requires implementations to treat them as 0, if encountered on input).
As described above, the encoding varies depending on whether the most significant 4 bits of the significand are in the range 0 to 7 (00002 to 01112), or higher (10002 or 10012).
If the 2 after the sign bit are "00", "01", or "10", then the exponent field consists of the 10 bits following the sign bit, and the significand is the remaining 53 bits, with an implicit leading 0 bit. This includes subnormal numbers where the leading significand digit is 0.
If the 2 bits after the sign bit are "11", then the 10-bit exponent field is shifted 2 bits to the right (after both the sign bit and the "11" bits thereafter), and the represented significand is in the remaining 51 bits. In this case there is an implicit (that is, not stored) leading 3-bit sequence "100" for the MSB bits of the true significand (in the remaining lower bits ttt...ttt of the significand, not all possible values are used).
Be aware that the bit numbering used in the tables for e.g. m12 … m0 is in opposite direction than that used in the paper for the IEEE 754 standard G0 … G12.
Combination Field | Exponent | Significand / Description | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
m12 | m11 | m10 | m9 | m8 | m7 | m6 | m5 | m4 | m3 | m2 | m1 | m0 | |||
combination field not! starting with '11', bits ab = 00, 01 or 10 | |||||||||||||||
a | b | c | d | m | m | m | m | m | m | e | f | g | abcdmmmmmm | (0)efgtttttttttttttttttttttttttttttttttttttttttttttttttt Finite number with 'small' significand, being < 9007199254740992, fits into 53 bits. | |
combination field starting with '11', but not 1111, bits ab = 11, bits cd = 00, 01 or 10 | |||||||||||||||
1 | 1 | c | d | m | m | m | m | m | m | e | f | g | cdmmmmmmef | 100gtttttttttttttttttttttttttttttttttttttttttttttttttt Finite number with 'big' significand, being > 9007199254740991, needs 54 bits. | |
combination field starting with '1111', bits abcd = 1111 | |||||||||||||||
1 | 1 | 1 | 1 | 0 | ±Infinity | ||||||||||
1 | 1 | 1 | 1 | 1 | 0 | quiet NaN | |||||||||
1 | 1 | 1 | 1 | 1 | 1 | signaling NaN (with payload in significand) |
In contrast to DPD format below the leading bits of the significand field do not encode the most significant decimal digit; they are, combined with the implicit prefix of 100 for big significands, simply part of a larger pure-binary number.
The resulting 'raw' exponent is a 10 bit binary integer where the leading bits are not '11', thus values 0 ... 1011111111b = 0 ... 767d, appr. bias is to be subtracted. The resulting significand could be a positive binary integer of 54 bits up to 1001 1111111111 1111111111 1111111111 1111111111 1111111111b = 11258999068426239d, but values above 1016 − 1 = 9999999999999999 = 2386F26FC0FFFF16 = 1000111000011011110010011011111100000011111111111111112 are 'illegal' and have to be treated as zeroes. To obtain the individual decimal digits the significand has to be divided by 10 repeatedly.
In the above cases, the value represented is
In this version, the significand is stored as a series of decimal digits. The leading digit is between 0 and 9 (3 or 4 binary bits), and the rest of the significand uses the densely packed decimal (DPD) encoding.
The leading 2 bits of the exponent and the leading digit (3 or 4 bits) of the significand are combined into the five bits that follow the sign bit.
This eight bits after that are the exponent continuation field, providing the less-significant bits of the exponent.
The last 50 bits are the significand continuation field, consisting of five 10-bit declets . [1] Each declet encodes three decimal digits [1] using the DPD encoding.
If the first two bits after the sign bit are "00", "01", or "10", then those are the leading bits of the exponent, and the three bits "cde" after that are interpreted as the leading decimal digit (0 to 7):
If the first two bits after the sign bit are "11", then the second 2-bits are the leading bits of the exponent, and the next bit "e" is prefixed with implicit bits "100" to form the leading decimal digit of the significand (8 or 9):
The remaining two combinations (11 110 and 11 111) of the 5-bit field after the sign bit are used to represent ±infinity and NaNs, respectively.
Combination Field | Exponent | Significand / Description | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
m12 | m11 | m10 | m9 | m8 | m7 | m6 | m5 | m4 | m3 | m2 | m1 | m0 | |||
combination field not! starting with '11', bits ab = 00, 01 or 10 | |||||||||||||||
a | b | c | d | e | m | m | m | m | m | m | m | m | abmmmmmmmm | (0)cde tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt Finite number with small first digit of significand (0 … 7). | |
combination field starting with '11', but not 1111, bits ab = 11, bits cd = 00, 01 or 10 | |||||||||||||||
1 | 1 | c | d | e | m | m | m | m | m | m | m | m | cdmmmmmmmm | 100e tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt Finite number with big first digit of significand (8 or 9). | |
combination field starting with '1111', bits abcd = 1111 | |||||||||||||||
1 | 1 | 1 | 1 | 0 | ±Infinity | ||||||||||
1 | 1 | 1 | 1 | 1 | 0 | quiet NaN | |||||||||
1 | 1 | 1 | 1 | 1 | 1 | signaling NaN (with payload in significand) |
The resulting 'raw' exponent is a 10 bit binary integer where the leading bits are not '11', thus values 0 ... 1011111111b = 0 ... 767d, appr. bias is to be subtracted. The significand's leading decimal digit forms from the (0)cde or 100e bits as binary integer. The subsequent digits are encoded in the 10 bit 'declet' fields 'tttttttttt' according the DPD rules (see below). The full decimal significand is then obtained by concatenating the leading and trailing decimal digits.
The 10-bit DPD to 3-digit BCD transcoding for the declets is given by the following table. b9 … b0 are the bits of the DPD, and d2 … d0 are the three BCD digits. Be aware that the bit numbering used here for e.g. b9 … b0 is in opposite direction than that used in the paper for the IEEE 754 standard b0 … b9, add. the decimal digits are numbered 0-based here while in opposite direction and 1-based in the IEEE 754 paper. The bits on white background are not counting for the value, but signal how to understand / shift the other bits. The concept is to denote which digits are small (0 … 7) and encoded in three bits, and which are not, then calculated from a prefix of '100', and one bit specifying if 8 or 9.
DPD encoded value | Decimal digits | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code space (1024 states) | b9 | b8 | b7 | b6 | b5 | b4 | b3 | b2 | b1 | b0 | d2 | d1 | d0 | Values encoded | Description | Occurrences (1000 states) | |
50.0% (512 states) | a | b | c | d | e | f | 0 | g | h | i | 0abc | 0def | 0ghi | (0–7) (0–7) (0–7) | 3 small digits | 51.2% (512 states) | |
37.5% (384 states) | a | b | c | d | e | f | 1 | 0 | 0 | i | 0abc | 0def | 100i | (0–7) (0–7) (8–9) | 2 small digits, 1 large digit | 38.4% (384 states) | |
a | b | c | g | h | f | 1 | 0 | 1 | i | 0abc | 100f | 0ghi | (0–7) (8–9) (0–7) | ||||
g | h | c | d | e | f | 1 | 1 | 0 | i | 100c | 0def | 0ghi | (8–9) (0–7) (0–7) | ||||
9.375% (96 states) | g | h | c | 0 | 0 | f | 1 | 1 | 1 | i | 100c | 100f | 0ghi | (8–9) (8–9) (0–7) | 1 small digit, 2 large digits | 9.6% (96 states) | |
d | e | c | 0 | 1 | f | 1 | 1 | 1 | i | 100c | 0def | 100i | (8–9) (0–7) (8–9) | ||||
a | b | c | 1 | 0 | f | 1 | 1 | 1 | i | 0abc | 100f | 100i | (0–7) (8–9) (8–9) | ||||
3.125% (32 states, 8 used) | x | x | c | 1 | 1 | f | 1 | 1 | 1 | i | 100c | 100f | 100i | (8–9) (8–9) (8–9) | 3 large digits, b9, b8: don't care | 0.8% (8 states) |
The 8 decimal values whose digits are all 8s or 9s have four codings each. The bits marked x in the table above are ignored on input, but will always be 0 in computed results. (The 8 × 3 = 24 non-standard encodings fill the unused range from 103 = 1000 to 210 - 1 = 1023.)
In the above cases, with the true significand as the sequence of decimal digits decoded, the value represented is
decimal64 was formally introduced in the 2008 revision [3] of the IEEE 754 standard, which was taken over into the ISO/IEC/IEEE 60559:2011 [4] standard.
Zero has 768 possible representations (1536 accounting signed zeroes, in two different cohorts), (even many more if you account the 'illegal' significands which have to be treated as zeroes).
The gain in range and precision by the 'combination encoding' evolves because the taken 2 bits from the exponent only use three of four possible states, and the 4 MSBs of the significand stay within 0000 … 1001 (10 of 16 possible states). In total that is 3 × 10 = 30 possible values when combined in one encoding, which is representable in 5 instead of 6 bits ().
In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a signed sequence of a fixed number of digits in some base, called a significand, scaled by an integer exponent of that base. Numbers of this form are called floating-point numbers.
IEEE 754-1985 is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.
In computing, NaN, standing for Not a Number, is a particular value of a numeric data type which is undefined as a number, such as the result of 0/0. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.
Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point.
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The significand is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its significant digits.
Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are equivalent. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 and +0, regarded as equal by the numerical comparison operations but with possible different behaviors in particular operations. This occurs in the sign-magnitude and ones' complement signed number representations for integers, and in most floating-point number representations. The number 0 is usually encoded as +0, but can still be represented by +0, −0, or 0.
Densely packed decimal (DPD) is an efficient method for binary encoding decimal digits.
In computing, minifloats are floating-point values represented with very few bits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended-precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.
IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.
In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.
Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory.
In computing, decimal128 is a decimal floating-point number format that occupies 16 bytes (128 bits) in memory.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.
{{cite book}}
: CS1 maint: numeric names: authors list (link)