Floating-point formats |
---|
IEEE 754 |
|
Other |
Alternatives |
Tapered floating point |
Computer architecture bit widths |
---|
Bit |
Application |
Binary floating-point precision |
Decimal floating-point precision |
In computing, minifloats are floating-point values represented with very few bits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:
Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.
Minifloats with 16 bits are half-precision numbers (opposed to single and double precision). There are also minifloats with 8 bits or even fewer. [2]
Minifloats can be designed following the principles of the IEEE 754 standard. In this case they must obey the (not explicitly written) rules for the frontier between subnormal and normal numbers and must have special patterns for infinity and NaN. Normalized numbers are stored with a biased exponent. The new revision of the standard, IEEE 754-2008, has 16-bit binary minifloats.
A minifloat is usually described using a tuple of four numbers, (S, E, M, B):
A minifloat format denoted by (S, E, M, B) is, therefore, S + E + M bits long. The (S, E, M, B) notation can be converted to a (B, P, L, U) format as (2, M + 1, B + 1, 2S − B) (with IEEE use of exponents).
sign | exponent | significand | |||||
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
A minifloat in 1 byte (8 bit) with 1 sign bit, 4 exponent bits and 3 significand bits (in short, a 1.4.3 minifloat) is demonstrated here. The exponent bias is defined as 7 to center the values around 1 to match other IEEE 754 floats [3] [4] so (for most values) the actual multiplier for exponent x is 2x−7. All IEEE 754 principles should be valid. [5]
Numbers in a different base are marked as ...base, for example, 1012 = 5. The bit patterns have spaces to visualize their parts.
Zero is represented as zero exponent with a zero mantissa. The zero exponent means zero is a subnormal number with a leading "0." prefix, and with the zero mantissa all bits after the decimal point are zero, meaning this value is interpreted as . Floating point numbers use a signed zero, so is also available and is equal to positive .
0 0000 000 = 0 1 0000 000 = −0
The significand is extended with "0." and the exponent value is treated as 1 higher like the least normalized number:
0 0000 001 = 0.0012 × 21 - 7 = 0.125 × 2-6 = 0.001953125 (least subnormal number) ... 0 0000 111 = 0.1112 × 21 - 7 = 0.875 × 2-6 = 0.013671875 (greatest subnormal number)
The significand is extended with "1.":
0 0001 000 = 1.0002 × 21 - 7 = 1 × 2-6 = 0.015625 (least normalized number) 0 0001 001 = 1.0012 × 21 - 7 = 1.125 × 2-6 = 0.017578125 ... 0 0111 000 = 1.0002 × 27 - 7 = 1 × 20 = 1 0 0111 001 = 1.0012 × 27 - 7 = 1.125 × 20 = 1.125 (least value above 1) ... 0 1110 000 = 1.0002 × 214 - 7 = 1.000 × 27 = 128 0 1110 001 = 1.0012 × 214 - 7 = 1.125 × 27 = 144 ... 0 1110 110 = 1.1102 × 214 - 7 = 1.750 × 27 = 224 0 1110 111 = 1.1112 × 214 - 7 = 1.875 × 27 = 240 (greatest normalized number)
Infinity values have the highest exponent, with the mantissa set to zero. The sign bit can be either positive or negative.
0 1111 000 = +infinity 1 1111 000 = −infinity
NaN values have the highest exponent, with a non-zero value for the mantissa. A float with 1-bit sign and 3-bit mantissa has NaN values.
s 1111 mmm = NaN (if mmm ≠ 000)
This is a chart of all possible values for this example 8-bit float.
… 000 | … 001 | … 010 | … 011 | … 100 | … 101 | … 110 | … 111 | |
---|---|---|---|---|---|---|---|---|
0 0000 … | 0 | 0.001953125 | 0.00390625 | 0.005859375 | 0.0078125 | 0.009765625 | 0.01171875 | 0.013671875 |
0 0001 … | 0.015625 | 0.017578125 | 0.01953125 | 0.021484375 | 0.0234375 | 0.025390625 | 0.02734375 | 0.029296875 |
0 0010 … | 0.03125 | 0.03515625 | 0.0390625 | 0.04296875 | 0.046875 | 0.05078125 | 0.0546875 | 0.05859375 |
0 0011 … | 0.0625 | 0.0703125 | 0.078125 | 0.0859375 | 0.09375 | 0.1015625 | 0.109375 | 0.1171875 |
0 0100 … | 0.125 | 0.140625 | 0.15625 | 0.171875 | 0.1875 | 0.203125 | 0.21875 | 0.234375 |
0 0101 … | 0.25 | 0.28125 | 0.3125 | 0.34375 | 0.375 | 0.40625 | 0.4375 | 0.46875 |
0 0110 … | 0.5 | 0.5625 | 0.625 | 0.6875 | 0.75 | 0.8125 | 0.875 | 0.9375 |
0 0111 … | 1 | 1.125 | 1.25 | 1.375 | 1.5 | 1.625 | 1.75 | 1.875 |
0 1000 … | 2 | 2.25 | 2.5 | 2.75 | 3 | 3.25 | 3.5 | 3.75 |
0 1001 … | 4 | 4.5 | 5 | 5.5 | 6 | 6.5 | 7 | 7.5 |
0 1010 … | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
0 1011 … | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 |
0 1100 … | 32 | 36 | 40 | 44 | 48 | 52 | 56 | 60 |
0 1101 … | 64 | 72 | 80 | 88 | 96 | 104 | 112 | 120 |
0 1110 … | 128 | 144 | 160 | 176 | 192 | 208 | 224 | 240 |
0 1111 … | Inf | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 0000 … | −0 | −0.001953125 | −0.00390625 | −0.005859375 | −0.0078125 | −0.009765625 | −0.01171875 | −0.013671875 |
1 0001 … | −0.015625 | −0.017578125 | −0.01953125 | −0.021484375 | −0.0234375 | −0.025390625 | −0.02734375 | −0.029296875 |
1 0010 … | −0.03125 | −0.03515625 | −0.0390625 | −0.04296875 | −0.046875 | −0.05078125 | −0.0546875 | −0.05859375 |
1 0011 … | −0.0625 | −0.0703125 | −0.078125 | −0.0859375 | −0.09375 | −0.1015625 | −0.109375 | −0.1171875 |
1 0100 … | −0.125 | −0.140625 | −0.15625 | −0.171875 | −0.1875 | −0.203125 | −0.21875 | −0.234375 |
1 0101 … | −0.25 | −0.28125 | −0.3125 | −0.34375 | −0.375 | −0.40625 | −0.4375 | −0.46875 |
1 0110 … | −0.5 | −0.5625 | −0.625 | −0.6875 | −0.75 | −0.8125 | −0.875 | −0.9375 |
1 0111 … | −1 | −1.125 | −1.25 | −1.375 | −1.5 | −1.625 | −1.75 | −1.875 |
1 1000 … | −2 | −2.25 | −2.5 | −2.75 | −3 | −3.25 | −3.5 | −3.75 |
1 1001 … | −4 | −4.5 | −5 | −5.5 | −6 | −6.5 | −7 | −7.5 |
1 1010 … | −8 | −9 | −10 | −11 | −12 | −13 | −14 | −15 |
1 1011 … | −16 | −18 | −20 | −22 | −24 | −26 | −28 | −30 |
1 1100 … | −32 | −36 | −40 | −44 | −48 | −52 | −56 | −60 |
1 1101 … | −64 | −72 | −80 | −88 | −96 | −104 | −112 | −120 |
1 1110 … | −128 | −144 | −160 | −176 | −192 | −208 | −224 | −240 |
1 1111 … | −Inf | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
There are only 242 different non-NaN values (if +0 and −0 are regarded as different), because 14 of the bit patterns represent NaNs.
At these small sizes other bias values may be interesting, for instance a bias of -2 will make the numbers 0-16 have the same bit representation as the integers 0-16, with the loss that no non-integer values can be represented.
0 0000 000 = 0.0002 × 21 - (-2) = 0.0 × 23 = 0 (subnormal number) 0 0000 001 = 0.0012 × 21 - (-2) = 0.125 × 23 = 1 (subnormal number) 0 0000 111 = 0.1112 × 21 - (-2) = 0.875 × 23 = 7 (subnormal number) 0 0001 000 = 1.0002 × 21 - (-2) = 1.000 × 23 = 8 (normalized number) 0 0001 111 = 1.1112 × 21 - (-2) = 1.875 × 23 = 15 (normalized number) 0 0010 000 = 1.0002 × 22 - (-2) = 1.000 × 24 = 16 (normalized number)
The above describes an example 8-bit float with 1 sign bit, 4 exponent bits, and 3 significand bits, which is a nice balance. However, any bit allocation is possible. A format could choose to give more of the bits to the exponent if they need more dynamic range with less precision, or give more of the bits to the significand if they need more precision with less dynamic range. At the extreme, it is possible to allocate all bits to the exponent, or all but one of the bits to the significand, leaving the exponent with only one bit. The exponent must be given at least one bit, or else it no longer makes sense as a float, it just becomes a signed number.
Here is a chart of all possible values for a different 8-bit float with 1 sign bit, 3 exponent bits and 4 significand bits. Having 1 more significand bit than exponent bits ensures that the precision remains at least 0.5 throughout the entire range. [6]
… 0000 | … 0001 | … 0010 | … 0011 | … 0100 | … 0101 | … 0110 | … 0111 | … 1000 | … 1001 | … 1010 | … 1011 | … 1100 | … 1101 | … 1110 | … 1111 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 000 … | 0 | 0.015625 | 0.03125 | 0.046875 | 0.0625 | 0.078125 | 0.09375 | 0.109375 | 0.125 | 0.140625 | 0.15625 | 0.171875 | 0.1875 | 0.203125 | 0.21875 | 0.234375 |
0 001 … | 0.25 | 0.265625 | 0.28125 | 0.296875 | 0.3125 | 0.328125 | 0.34375 | 0.359375 | 0.375 | 0.390625 | 0.40625 | 0.421875 | 0.4375 | 0.453125 | 0.46875 | 0.484375 |
0 010 … | 0.5 | 0.53125 | 0.5625 | 0.59375 | 0.625 | 0.65625 | 0.6875 | 0.71875 | 0.75 | 0.78125 | 0.8125 | 0.84375 | 0.875 | 0.90625 | 0.9375 | 0.96875 |
0 011 … | 1 | 1.0625 | 1.125 | 1.1875 | 1.25 | 1.3125 | 1.375 | 1.4375 | 1.5 | 1.5625 | 1.625 | 1.6875 | 1.75 | 1.8125 | 1.875 | 1.9375 |
0 100 … | 2 | 2.125 | 2.25 | 2.375 | 2.5 | 2.625 | 2.75 | 2.875 | 3 | 3.125 | 3.25 | 3.375 | 3.5 | 3.625 | 3.75 | 3.875 |
0 101 … | 4 | 4.25 | 4.5 | 4.75 | 5 | 5.25 | 5.5 | 5.75 | 6 | 6.25 | 6.5 | 6.75 | 7 | 7.25 | 7.5 | 7.75 |
0 110 … | 8 | 8.5 | 9 | 9.5 | 10 | 10.5 | 11 | 11.5 | 12 | 12.5 | 13 | 13.5 | 14 | 14.5 | 15 | 15.5 |
0 111 … | Inf | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 000 … | −0 | −0.015625 | −0.03125 | −0.046875 | −0.0625 | −0.078125 | −0.09375 | −0.109375 | −0.125 | −0.140625 | −0.15625 | −0.171875 | −0.1875 | −0.203125 | −0.21875 | −0.234375 |
1 001 … | −0.25 | −0.265625 | −0.28125 | −0.296875 | −0.3125 | −0.328125 | −0.34375 | −0.359375 | −0.375 | −0.390625 | −0.40625 | −0.421875 | −0.4375 | −0.453125 | −0.46875 | −0.484375 |
1 010 … | −0.5 | −0.53125 | −0.5625 | −0.59375 | −0.625 | −0.65625 | −0.6875 | −0.71875 | −0.75 | −0.78125 | −0.8125 | −0.84375 | −0.875 | −0.90625 | −0.9375 | −0.96875 |
1 011 … | −1 | −1.0625 | −1.125 | −1.1875 | −1.25 | −1.3125 | −1.375 | −1.4375 | −1.5 | −1.5625 | −1.625 | −1.6875 | −1.75 | −1.8125 | −1.875 | −1.9375 |
1 100 … | −2 | −2.125 | −2.25 | −2.375 | −2.5 | −2.625 | −2.75 | −2.875 | −3 | −3.125 | −3.25 | −3.375 | −3.5 | −3.625 | −3.75 | −3.875 |
1 101 … | −4 | −4.25 | −4.5 | −4.75 | −5 | −5.25 | −5.5 | −5.75 | −6 | −6.25 | −6.5 | −6.75 | −7 | −7.25 | −7.5 | −7.75 |
1 110 … | −8 | −8.5 | −9 | −9.5 | −10 | −10.5 | −11 | −11.5 | −12 | −12.5 | −13 | −13.5 | −14 | −14.5 | −15 | −15.5 |
1 111 … | −Inf | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Tables like the above can be generated for any combination of SEMB (sign, exponent, mantissa/significand, and bias) values using a script in Python or in GDScript.
The graphic demonstrates the addition of even smaller (1.3.2.3)-minifloats with 6 bits. This floating-point system follows the rules of IEEE 754 exactly. NaN as operand produces always NaN results. Inf − Inf and (−Inf) + Inf results in NaN too (green area). Inf can be augmented and decremented by finite values without change. Sums with finite operands can give an infinite result (i.e. 14.0 + 3.0 = +Inf as a result is the cyan area, −Inf is the magenta area). The range of the finite operands is filled with the curves x + y = c, where c is always one of the representable float values (blue and red for positive and negative results respectively).
The other arithmetic operations can be illustrated similarly:
The Radeon R300 and R420 GPUs used an "fp24" floating-point format with 7 bits of exponent and 16 bits (+1 implicit) of mantissa. [7] "Full Precision" in Direct3D 9.0 is a proprietary 24-bit floating-point format. Microsoft's D3D9 (Shader Model 2.0) graphics API initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.
Khronos defines 10-bit and 11-bit float formats for use with Vulkan. Both formats have no sign bit and a 5-bit exponent. The 10-bit format has a 5-bit mantissa, and the 11-bit format has a 6-bit mantissa. [8] [9]
IEEE SA Working Group P3109 is currently working on a standard for 8-bit minifloats optimized for machine learning. The current draft defines not one format, but a family of 7 different formats, named "binary8pP", where "P" is a number from 1 to 7. These floats are designed to be compact and efficient, but do not follow the same semantics as other IEEE floats, and are missing features such as negative zero and multiple NaN values. Infinity is defined as both the exponent and significand having all ones, instead of other IEEE floats where the exponent is all ones and the significand is all zeroes. [10]
The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa. [11] In the table below, the columns have different values for the sign and mantissa bits, and the rows are different values for the exponent bits.
0 … 0 | 0 … 1 | 1 … 0 | 1 … 1 | |
---|---|---|---|---|
… 00 … | 0 | 0.5 | −0 | −0.5 |
… 01 … | 1 | 1.5 | −1 | −1.5 |
… 10 … | 2 | 3 | −2 | −3 |
… 11 … | Inf | NaN | −Inf | NaN |
If normalized numbers are not required, the size can be reduced to 3-bit by reducing the exponent down to 1.
0 … 0 | 0 … 1 | 1 … 0 | 1 … 1 | |
---|---|---|---|---|
… 0 … | 0 | 1 | −0 | −1 |
… 1 … | Inf | NaN | −Inf | NaN |
In situations where the sign bit can be excluded, each of the above examples can be reduced by 1 bit further, keeping only the left half of the above tables. A 2-bit float with 1-bit exponent and 1-bit mantissa would only have 0, 1, Inf, NaN values.
If the mantissa is allowed to be 0-bit, a 1-bit float format would have a 1-bit exponent, and the only two values would be 0 and Inf. The exponent must be at least 1 bit or else it no longer makes sense as a float (it would just be a signed number).
4-bit floating point numbers — without the four special IEEE values — have found use in accelerating large language models. [12]
Minifloats are also commonly used in embedded devices,[ citation needed ] especially on microcontrollers where floating-point will need to be emulated in software. To speed up the computation, the mantissa typically occupies exactly half of the bits, so the register boundary automatically addresses the parts without shifting.
In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a signed sequence of a fixed number of digits in some base, called a significand, scaled by an integer exponent of that base. Numbers of this form are called floating-point numbers.
IEEE 754-1985 is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point.
In computer science, subnormal numbers are the subset of denormalized numbers that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal, while denormal can also refer to numbers outside that range.
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The significand is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its significant digits.
Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.
In IEEE 754 floating-point numbers, the exponent is biased in the engineering sense of the word – the value stored is offset from the actual value by the exponent bias, also called a biased exponent. Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder.
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended-precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.
In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.
Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. Like the binary16 and binary32 formats, it is intended for memory saving storage.
In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is requested to come near to schoolhouse math. In contrast to the binaryxxx datatypes the decimalxxx datatypes provide exact calculations also with decimal fractions and 'nearest, ties away from zero' rounding, in some range, to some precision, to some degree.
In computing, decimal128 is a decimal floating-point number format that occupies 128 bits in memory. Formally introduced in IEEE 754-2008, it is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.
In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC languages, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.
{{cite web}}
: CS1 maint: bot: original URL status unknown (link)