Decimal floating point

Last updated January 14, 2025

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base-2) fractions.

The advantage of decimal floating-point representation over decimal fixed-point and integer representation is that it supports a much wider range of values. For example, while a fixed-point representation that allocates 8 decimal digits and 2 decimal places can represent the numbers 123456.78, 8765.43, 123.00, and so on, a floating-point representation with 8 decimal digits could also represent 1.2345678, 1234567.8, 0.000012345678, 12345678000000000, and so on. This wider range can dramatically slow the accumulation of rounding errors during successive calculations; for example, the Kahan summation algorithm can be used in floating point to add many numbers with no asymptotic accumulation of rounding error.

Implementations

Early mechanical uses of decimal floating point are evident in the abacus, slide rule, the Smallwood calculator, and some other calculators that support entries in scientific notation. In the case of the mechanical calculators, the exponent is often treated as side information that is accounted for separately.

The IBM 650 computer supported an 8-digit decimal floating-point format in 1953.^[1] The otherwise binary Wang VS machine supported a 64-bit decimal floating-point format in 1977.^[2] The Motorola 68881 supported a format with 17 digits of mantissa and 3 of exponent in 1984, with the floating-point support library for the Motorola 68040 processor providing a compatible 96-bit decimal floating-point storage format in 1990.^[2]

Some computer languages have implementations of decimal floating-point arithmetic, including PL/I, .NET,^[3] emacs with calc, and Python's decimal module.^[4] In 1987, the IEEE released IEEE 854, a standard for computing with decimal floating point, which lacked a specification for how floating-point data should be encoded for interchange with other systems. This was subsequently addressed in IEEE 754-2008, which standardized the encoding of decimal floating-point data, albeit with two different alternative methods.

IBM POWER6 and newer POWER processors include DFP in hardware, as does the IBM System z9 ^[5] (and later zSeries machines). SilMinds offers SilAx, a configurable vector DFP coprocessor.^[6] IEEE 754-2008 defines this in more detail. Fujitsu also has 64-bit Sparc processors with DFP in hardware.^[7]^[2]

IEEE 754-2008 encoding

The IEEE 754-2008 standard defines 32-, 64- and 128-bit decimal floating-point representations. Like the binary floating-point formats, the number is divided into a sign, an exponent, and a significand. Unlike binary floating-point, numbers are not necessarily normalized; values with few significant digits have multiple possible representations: 1×10²=0.1×10³=0.01×10⁴, etc. When the significand is zero, the exponent can be any value at all.

IEEE 754-2008 decimal floating-point formats
decimal32	decimal64	decimal128	decimal(32k)	Format
1	1	1	1	Sign field (bits)
5	5	5	5	Combination field (bits)
6	8	12	w = 2×k + 4	Exponent continuation field (bits)
20	50	110	t = 30×k−10	Coefficient continuation field (bits)
32	64	128	32×k	Total size (bits)
7	16	34	p = 3×t/10+1 = 9×k−2	Coefficient size (decimal digits)
192	768	12288	3×2^w = 48×4^k	Exponent range
96	384	6144	Emax = 3×2^w−1	Largest value is 9.99...×10^Emax
−95	−383	−6143	Emin = 1−Emax	Smallest normalized value is 1.00...×10^Emin
−101	−398	−6176	Etiny = 2−p−Emax	Smallest non-zero value is 1×10^Etiny

The exponent ranges were chosen so that the range available to normalized values is approximately symmetrical. Since this cannot be done exactly with an even number of possible exponent values, the extra value was given to Emax.

Two different representations are defined:

One with a binary integer significand field encodes the significand as a large binary integer between 0 and 10^p−1. This is expected to be more convenient for software implementations using a binary ALU.
Another with a densely packed decimal significand field encodes decimal digits more directly. This makes conversion to and from binary floating-point form faster, but requires specialized hardware to manipulate efficiently. This is expected to be more convenient for hardware implementations.

Both alternatives provide exactly the same range of representable values.

The most significant two bits of the exponent are limited to the range of 0−2, and the most significant 4 bits of the significand are limited to the range of 0−9. The 30 possible combinations are encoded in a 5-bit field, along with special forms for infinity and NaN.

If the most significant 4 bits of the significand are between 0 and 7, the encoded value begins as follows:

s 00mmm xxx   Exponent begins with 00, significand with 0mmm s 01mmm xxx   Exponent begins with 01, significand with 0mmm s 10mmm xxx   Exponent begins with 10, significand with 0mmm

If the leading 4 bits of the significand are binary 1000 or 1001 (decimal 8 or 9), the number begins as follows:

s 1100m xxx   Exponent begins with 00, significand with 100m s 1101m xxx   Exponent begins with 01, significand with 100m s 1110m xxx   Exponent begins with 10, significand with 100m

The leading bit (s in the above) is a sign bit, and the following bits (xxx in the above) encode the additional exponent bits and the remainder of the most significant digit, but the details vary depending on the encoding alternative used.

The final combinations are used for infinities and NaNs, and are the same for both alternative encodings:

s 11110 x   ±Infinity (see Extended real number line) s 11111 0   quiet NaN (sign bit ignored) s 11111 1   signaling NaN (sign bit ignored)

In the latter cases, all other bits of the encoding are ignored. Thus, it is possible to initialize an array to NaNs by filling it with a single byte value.

Binary integer significand field

This format uses a binary significand from 0 to 10^p−1. For example, the Decimal32 significand can be up to 10⁷−1 = 9999999 = 98967F₁₆ = 100110001001011001111111₂. While the encoding can represent larger significands, they are illegal and the standard requires implementations to treat them as 0, if encountered on input.

As described above, the encoding varies depending on whether the most significant 4 bits of the significand are in the range 0 to 7 (0000₂ to 0111₂), or higher (1000₂ or 1001₂).

If the 2 bits after the sign bit are "00", "01", or "10", then the exponent field consists of the 8 bits following the sign bit (the 2 bits mentioned plus 6 bits of "exponent continuation field"), and the significand is the remaining 23 bits, with an implicit leading 0 bit, shown here in parentheses:

 s 00eeeeee   (0)ttt tttttttttt tttttttttt  s 01eeeeee   (0)ttt tttttttttt tttttttttt  s 10eeeeee   (0)ttt tttttttttt tttttttttt

This includes subnormal numbers where the leading significand digit is 0.

If the 2 bits after the sign bit are "11", then the 8-bit exponent field is shifted 2 bits to the right (after both the sign bit and the "11" bits thereafter), and the represented significand is in the remaining 21 bits. In this case there is an implicit (that is, not stored) leading 3-bit sequence "100" in the true significand:

 s 1100eeeeee (100)t tttttttttt tttttttttt  s 1101eeeeee (100)t tttttttttt tttttttttt  s 1110eeeeee (100)t tttttttttt tttttttttt

The "11" 2-bit sequence after the sign bit indicates that there is an implicit "100" 3-bit prefix to the significand.

Note that the leading bits of the significand field do not encode the most significant decimal digit; they are simply part of a larger pure-binary number. For example, a significand of 8000000 is encoded as binary 011110100001001000000000, with the leading 4 bits encoding 7; the first significand which requires a 24th bit (and thus the second encoding form) is 2²³ = 8388608.

In the above cases, the value represented is:

(−1)^sign × 10^{exponent−101} × significand

Decimal64 and Decimal128 operate analogously, but with larger exponent continuation and significand fields. For Decimal128, the second encoding form is actually never used; the largest valid significand of 10³⁴−1 = 1ED09BEAD87C0378D8E63FFFFFFFF₁₆ can be represented in 113 bits.

Densely packed decimal significand field

In this version, the significand is stored as a series of decimal digits. The leading digit is between 0 and 9 (3 or 4 binary bits), and the rest of the significand uses the densely packed decimal (DPD) encoding.

The leading 2 bits of the exponent and the leading digit (3 or 4 bits) of the significand are combined into the five bits that follow the sign bit. This is followed by a fixed-offset exponent continuation field.

Finally, the significand continuation field made of 2, 5, or 11 10-bit declets , each encoding 3 decimal digits.^[8]

If the first two bits after the sign bit are "00", "01", or "10", then those are the leading bits of the exponent, and the three bits after that are interpreted as the leading decimal digit (0 to 7):^[9]

    Comb.  Exponent          Significand  s 00 TTT (00)eeeeee (0TTT)[tttttttttt][tttttttttt]  s 01 TTT (01)eeeeee (0TTT)[tttttttttt][tttttttttt]  s 10 TTT (10)eeeeee (0TTT)[tttttttttt][tttttttttt]

If the first two bits after the sign bit are "11", then the second two bits are the leading bits of the exponent, and the last bit is prefixed with "100" to form the leading decimal digit (8 or 9):

    Comb.  Exponent          Significand  s 1100 T (00)eeeeee (100T)[tttttttttt][tttttttttt]  s 1101 T (01)eeeeee (100T)[tttttttttt][tttttttttt]  s 1110 T (10)eeeeee (100T)[tttttttttt][tttttttttt]

The remaining two combinations (11110 and 11111) of the 5-bit field are used to represent ±infinity and NaNs, respectively.

Floating-point arithmetic operations

The usual rule for performing floating-point arithmetic is that the exact mathematical value is calculated,^[10] and the result is then rounded to the nearest representable value in the specified precision. This is in fact the behavior mandated for IEEE-compliant computer hardware, under normal rounding behavior and in the absence of exceptional conditions.

For ease of presentation and understanding, 7-digit precision will be used in the examples. The fundamental principles are the same in any precision.

Addition

A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by 3 digits. We proceed with the usual addition method:

The following example is decimal, which simply means the base is 10.

  123456.7 = 1.234567 × 10⁵   101.7654 = 1.017654 × 10² = 0.001017654 × 10⁵

Hence:

  123456.7 + 101.7654 = (1.234567 × 10⁵) + (1.017654 × 10²)                       = (1.234567 × 10⁵) + (0.001017654 × 10⁵)                       = 10⁵ × (1.234567 + 0.001017654)                       = 10⁵ × 1.235584654

This is nothing other than converting to scientific notation. In detail:

  e=5;  s=1.234567     (123456.7) + e=2;  s=1.017654     (101.7654)

  e=5;  s=1.234567 + e=5;  s=0.001017654  (after shifting) --------------------   e=5;  s=1.235584654  (true sum: 123558.4654)

This is the true result, the exact sum of the operands. It will be rounded to 7 digits and then normalized if necessary. The final result is:

  e=5;  s=1.235585    (final sum: 123558.5)

Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error. In extreme cases, the sum of two non-zero numbers may be equal to one of them:

  e=5;  s=1.234567 + e=−3; s=9.876543

  e=5;  s=1.234567 + e=5;  s=0.00000009876543 (after shifting) ----------------------   e=5;  s=1.23456709876543 (true sum)   e=5;  s=1.234567         (after rounding/normalization)

Another problem of loss of significance occurs when approximations to two nearly equal numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are approximations to the rationals 123457.1467 and 123456.659.

  e=5;  s=1.234571 − e=5;  s=1.234567 ----------------   e=5;  s=0.000004   e=−1; s=4.000000 (after rounding and normalization)

The floating-point difference is computed exactly because the numbers are close—the Sterbenz lemma guarantees this, even in case of underflow when gradual underflow is supported. Despite this, the difference of the original numbers is e = −1; s = 4.877000, which differs more than 20% from the difference e = −1; s = 4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost.^[11]^[12] This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis; see also Accuracy problems.

Multiplication

To multiply, the significands are multiplied, while the exponents are added, and the result is rounded and normalized.

  e=3;  s=4.734612 × e=5;  s=5.417242 -----------------------   e=8;  s=25.648538980104 (true product)   e=8;  s=25.64854        (after rounding)   e=9;  s=2.564854        (after normalization)

Division is done similarly, but that is more complicated.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex.

Related Research Articles

In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a signed sequence of a fixed number of digits in some base, called a significand, scaled by an integer exponent of that base. Numbers of this form are called floating-point numbers.

IEEE 754-1985 is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.

A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point.

In computer science, subnormal numbers are the subset of denormalized numbers that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal, while denormal can also refer to numbers outside that range.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The significand is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its significant digits.

In computing, a roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Rounding errors are due to inexactness in the representation of real numbers and the arithmetic operations done with them. This is a form of quantization error. When using approximation equations or algorithms, especially when using finitely many digits to represent real numbers, one of the goals of numerical analysis is to estimate computation errors. Computation errors, also called numerical errors, include both truncation errors and roundoff errors.

Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

In computing, minifloats are floating-point values represented with very few bits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended-precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.

IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory.

In computing, decimal64 is a decimal floating-point computer number format that occupies 8 bytes in computer memory.

In computing, decimal128 is a decimal floating-point number format that occupies 128 bits in memory. Formally introduced in IEEE 754-2008, it is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

References

↑ Beebe, Nelson H. F. (2017-08-22). "Chapter H. Historical floating-point architectures". The Mathematical-Function Computation Handbook - Programming Using the MathCW Portable Software Library (1 ed.). Salt Lake City, UT, USA: Springer International Publishing AG. p. 948. doi:10.1007/978-3-319-64110-2. ISBN 978-3-319-64109-6. LCCN 2017947446. S2CID 30244721.
1 2 3 Savard, John J. G. (2018) [2007]. "The Decimal Floating-Point Standard". quadibloc. Archived from the original on 2018-07-03. Retrieved 2018-07-16.
↑ ".NET API Documentation for System.Decimal". learn.microsoft.com. Retrieved 2024-07-07.
↑ "Python Documentation for decimal". docs.python.org. Retrieved 2024-07-07.
↑ "IBM z9 EC and z9 BC — Delivering greater value for everyone" (PDF). 306.ibm.com. Retrieved 2018-07-07.
↑ "Arithmetic IPs for Financial Applications - SilMinds". Silminds.com.
↑ "Chapter 4. Data Formats". Sparc64 X/X+ Specification. Nakahara-ku, Kawasaki, Japan. January 2015. p. 13.{{cite book}}: CS1 maint: location missing publisher (link)
↑ Muller, Jean-Michel; Brisebarre, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge (2010). Handbook of Floating-Point Arithmetic (1 ed.). Birkhäuser. doi:10.1007/978-0-8176-4705-6. ISBN 978-0-8176-4704-9. LCCN 2009939668.
↑ Decimal Encoding Specification, version 1.00, from IBM
↑ Computer hardware doesn't necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.
↑ Goldberg, David (March 1991). "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (PDF). ACM Computing Surveys . 23 (1): 5–48. doi:10.1145/103162.103163. S2CID 222008826 . Retrieved 2016-01-20. (, , )
↑ USpatent 3037701A,Huberto M Sierra,"Floating decimal point arithmetic control means for calculator",issued 1962-06-05