Octuple-precision floating-point format

Last updated

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes (256 bits) in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely (if ever) used and very few environments support it.

Contents

IEEE 754 octuple-precision binary floating-point format: binary256

In its 2008 revision, the IEEE 754 standard specifies a binary256 format among the interchange formats (it is not a basic format), as having:

The format is written with an implicit lead bit with value 1 unless the exponent is all zeros. Thus only 236 bits of the significand appear in the memory format, but the total precision is 237 bits (approximately 71 decimal digits: log10(2237) ≈ 71.344). The bits are laid out as follows:

Octuple precision visual demonstration.svg

Exponent encoding

The octuple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 262143; also known as exponent bias in the IEEE 754 standard.

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 262143 has to be subtracted from the stored exponent.

The stored exponents 0000016 and 7FFFF16 are interpreted specially.

ExponentSignificand zeroSignificand non-zeroEquation
0000016 0, −0 subnormal numbers (-1)signbit × 2−262142 × 0.significandbits2
0000116, ..., 7FFFE16normalized value(-1)signbit × 2exponent bits2 × 1.significandbits2
7FFFF16± NaN (quiet, signalling)

The minimum strictly positive (subnormal) value is 2−262378 ≈ 10−78984 and has a precision of only one bit. The minimum positive normal value is 2−262142 ≈ 2.4824 × 10−78913. The maximum representable value is 2262144 − 2261907 ≈ 1.6113 × 1078913.

Octuple-precision examples

These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000016 = +0 8000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000016 = −0
7fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000016 = +infinity ffff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000016 = −infinity
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000116 = 2−262142 × 2−236 = 2−262378 ≈ 2.24800708647703657297018614776265182597360918266100276294348974547709294462 × 10−78984   (smallest positive subnormal number)
0000 0fff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff16 = 2−262142 × (1 − 2−236) ≈ 2.4824279514643497882993282229138717236776877060796468692709532979137875392 × 10−78913   (largest subnormal number)
0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000016 = 2−262142 ≈ 2.48242795146434978829932822291387172367768770607964686927095329791378756168 × 10−78913   (smallest positive normal number)
7fff efff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff16 = 2262143 × (2 − 2−236) ≈ 1.61132571748576047361957211845200501064402387454966951747637125049607182699 × 1078913   (largest normal number)
3fff efff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff16 = 1 − 2−237 ≈ 0.999999999999999999999999999999999999999999999999999999999999999999999995472   (largest number less than one)
3fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000016 = 1 (one)
3fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000116 = 1 + 2−236 ≈ 1.00000000000000000000000000000000000000000000000000000000000000000000000906   (smallest number larger than one)

By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.

Implementations

Octuple precision is rarely implemented since usage of it is extremely rare. Apple Inc. had an implementation of addition, subtraction and multiplication of octuple-precision numbers with a 224-bit two's complement significand and a 32-bit exponent. [1] One can use general arbitrary-precision arithmetic libraries to obtain octuple (or higher) precision, but specialized octuple-precision implementations may achieve higher performance.

Hardware support

There is no known hardware implementation of octuple precision.

See also

Related Research Articles

<span class="mw-page-title-main">Floating-point arithmetic</span> Computer approximation for real numbers

In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be represented as a base-ten floating-point number:

In computing, NaN, standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computer science, subnormal numbers are the subset of denormalized numbers that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal number is subnormal.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The significand is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.

Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

In computing, a normal number is a non-zero number in a floating-point representation which is within the balanced range supported by a given floating-point format: it is a floating point number that can be represented without leading zeros in its significand.

In IEEE 754 floating-point numbers, the exponent is biased in the engineering sense of the word – the value stored is offset from the actual value by the exponent bias, also called a biased exponent. Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder.

In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.

IEEE 754-2008 was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating-point standard, while in 2019 it was updated with a minor revision IEEE 754-2019. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, decimal128 is a decimal floating-point computer numbering format that occupies 16 bytes (128 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

References

  1. Crandall, Richard E.; Papadopoulos, Jason S. (2002-05-08). "Octuple-precision floating point on Apple G4 (archived copy on web.archive.org)" (PDF). Archived from the original on 2006-07-28.{{cite web}}: CS1 maint: unfit URL (link) (8 pages)

Further reading