Floating-point formats |
---|
IEEE 754 |
Other |
The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent (and the payloads of NaNs) can be encoded in two ways, referred to as binary encoding and decimal encoding. [1]
Both formats break a number down into a sign bit s, an exponent q (between qmin and qmax), and a p-digit significand c (between 0 and 10p−1). The value encoded is (−1)s×10q×c. In both formats the range of possible values is identical, but they differ in how the significand c is represented. In the decimal encoding, it is encoded as a series of p decimal digits (using the densely packed decimal (DPD) encoding). This makes conversion to decimal form efficient, but requires a specialized decimal ALU to process. In the binary integer decimal (BID) encoding, it is encoded as a binary number.
Using the fact that 210 = 1024 is only slightly more than 103 = 1000, 3n-digit decimal numbers can be efficiently packed into 10n binary bits. However, the IEEE formats have significands of 3n+1 digits, which would generally require 10n+4 binary bits to represent.
This would not be efficient, because only 10 of the 16 possible values of the additional 4 bits are needed. A more efficient encoding can be designed using the fact that the exponent range is of the form 3×2k, so the exponent never starts with 11
. Using the Decimal32 encoding (with a significand of 3*2+1 decimal digits) as an example (e
stands for exponent, m
for mantissa, i.e. significand):
0mmm
, omitting the leading 0 bit lets the significand fit into 23 bits:s 00eeeeee (0)mmm mmmmmmmmmm mmmmmmmmmm s 01eeeeee (0)mmm mmmmmmmmmm mmmmmmmmmm s 10eeeeee (0)mmm mmmmmmmmmm mmmmmmmmmm
100m
, omitting the leading 100 bits lets the significand fit into 21 bits. The exponent is shifted over 2 bits, and a 11
bit pair shows that this form is being used:s 1100eeeeee (100)m mmmmmmmmmm mmmmmmmmmm s 1101eeeeee (100)m mmmmmmmmmm mmmmmmmmmm s 1110eeeeee (100)m mmmmmmmmmm mmmmmmmmmm
s 1111
:s 11110 xxxxxxxxxxxxxxxxxxxxxxxxxx s 111110 xxxxxxxxxxxxxxxxxxxxxxxxx s 111111 xxxxxxxxxxxxxxxxxxxxxxxxx
The bits shown in parentheses are implicit: they are not included in the 32 bits of the Decimal32 encoding, but are implied by the two bits after the sign bit.
The Decimal64 and Decimal128 encodings have larger exponent and significand fields, but operate in a similar fashion.
For the Decimal128 encoding, 113 bits of significand is actually enough to encode 34 decimal digits, and the second form is never actually required.
A decimal floating point number can be encoded in several ways, the different ways represent different precisions, for example 100.0 is encoded as 1000×10−1, while 100.00 is encoded as 10000×10−2. The set of possible encodings of the same numerical value is called a cohort in the standard. If the result of a calculation is inexact the largest amount of significant data is preserved by selecting the cohort member with the largest integer that can be stored in the significand along with the required exponent.
The proposed IEEE 754r standard limits the range of numbers to a significand of the form 10n−1, where n is the number of whole decimal digits that can be stored in the bits available so that decimal rounding is effected correctly.
32 bit | 64 bit | 128 bit | |
---|---|---|---|
Storage bits | 32 | 64 | 128 |
Trailing Significand bits | 20 | 50 | 110 |
Significand bits | 23/24 | 53/54 | 113 |
Significand digits | 7 | 16 | 34 |
Combination bits | 11 | 13 | 17 |
Exponent bits | 8 | 10 | 14 |
Bias | 101 | 398 | 6176 |
Standard emax | 96 | 384 | 6144 |
Standard emin | −95 | −383 | −6143 |
A binary encoding is inherently less efficient for conversions to or from decimal-encoded data, such as strings (ASCII, Unicode, etc.) and BCD. A binary encoding is therefore best chosen only when the data are binary rather than decimal. IBM has published some unverified performance data. [2]
In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often found in systems which include very small and very large real numbers, which require fast processing times. A number is, in general, represented approximately to a fixed number of significant digits and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
IEEE 754-1985 was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
A computer number format is the internal representation of numeric values in digital computer and calculator hardware and software. Normally, numeric values are stored as groupings of bits, named for the number of bits that compose them. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the bit format used by the computer's instruction set generally requires conversion for external use such as printing and display. Different types of processors may have different internal representations of numerical values. Different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.
In computing, NaN, standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.
Double-precision floating-point format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The significand is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction. The word mantissa seems to have been introduced by Arthur Burks in 1946 writing for the Institute for Advanced Study at Princeton, although this use of the word is discouraged by the IEEE floating-point standard committee as well as some professionals such as the creator of the standard, William Kahan, and also the prominent computer programmer and author of The Art of Computer Programming, Donald E. Knuth.
IBM System/360 computers, and subsequent machines based on that architecture (mainframes), support a hexadecimal floating-point format (HFP).
Densely packed decimal (DPD) is an efficient method for binary encoding decimal digits.
Extended precision refers to floating point number formats that provide greater precision than the basic floating point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
IEEE 754-2008 was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating-point standard, while in 2019 it got updated with a minor revision IEEE 754-2019. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854.
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory.
In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision more than twice the 53-bit double precision.
Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.
In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.
In computing, decimal128 is a decimal floating-point computer numbering format that occupies 16 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use.