This article needs additional citations for verification .(June 2016) (Learn how and when to remove this template message) |

Floating-point formats |
---|

IEEE 754 |

Other |

Computer architecture bit widths |
---|

Bit |

Application |

Binary floating-point precision |

Decimal floating point precision |

In computing, **octuple precision** is a binary floating-point-based computer number format that occupies 32 bytes (256 bits) in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely (if ever) used and very few environments support it.

In its 2008 revision, the IEEE 754 standard specifies a **binary256** format among the *interchange formats* (it is not a basic format), as having:

- Sign bit: 1 bit
- Exponent width: 19 bits
- Significand precision: 237 bits (236 explicitly stored)

The format is written with an implicit lead bit with value 1 unless the exponent is all zeros. Thus only 236 bits of the significand appear in the memory format, but the total precision is 237 bits (approximately 71 decimal digits: log_{10}(2^{237}) ≈ 71.344). The bits are laid out as follows:

The octuple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 262143; also known as exponent bias in the IEEE 754 standard.

- E
_{min}= −262142 - E
_{max}= 262143 - Exponent bias = 3FFFF
_{16}= 262143

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 262143 has to be subtracted from the stored exponent.

The stored exponents 00000_{16} and 7FFFF_{16} are interpreted specially.

Exponent | Significand zero | Significand non-zero | Equation |
---|---|---|---|

00000_{16} | 0, −0 | subnormal numbers | (-1)^{signbit} × 2^{−262142} × 0.significandbits_{2} |

00001_{16}, ..., 7FFFE_{16} | normalized value | (-1)^{signbit} × 2^{exponent bits2} × 1.significandbits_{2} | |

7FFFF_{16} | ±∞ | NaN (quiet, signalling) |

The minimum strictly positive (subnormal) value is 2^{−262378} ≈ 10^{−78984} and has a precision of only one bit. The minimum positive normal value is 2^{−262142} ≈ 2.4824 × 10^{−78913}. The maximum representable value is 2^{262144} − 2^{261907} ≈ 1.6113 × 10^{78913}.

These examples are given in bit *representation*, in hexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000_{16}= +0 8000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000_{16}= −0

7fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000_{16}= +infinity ffff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000_{16}= −infinity

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001_{16}= 2^{−262142}× 2^{−236}= 2^{−262378}≈ 2.24800708647703657297018614776265182597360918266100276294348974547709294462 × 10^{−78984}(smallest positive subnormal number)

0000 0fff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff_{16}= 2^{−262142}× (1 − 2^{−236}) ≈ 2.4824279514643497882993282229138717236776877060796468692709532979137875392 × 10^{−78913}(largest subnormal number)

0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000_{16}= 2^{−262142}≈ 2.48242795146434978829932822291387172367768770607964686927095329791378756168 × 10^{−78913}(smallest positive normal number)

7fff efff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff_{16}= 2^{262143}× (2 − 2^{−236}) ≈ 1.61132571748576047361957211845200501064402387454966951747637125049607182699 × 10^{78913}(largest normal number)

3fff efff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff_{16}= 1 − 2^{−237}≈ 0.999999999999999999999999999999999999999999999999999999999999999999999995472 (largest number less than one)

3fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000_{16}= 1 (one)

3fff f000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001_{16}= 1 + 2^{−236}≈ 1.00000000000000000000000000000000000000000000000000000000000000000000000906 (smallest number larger than one)

By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are `0101...`

which is less than 1/2 of a unit in the last place.

Octuple precision is rarely implemented since usage of it is extremely rare. Apple Inc. had an implementation of addition, subtraction and multiplication of octuple-precision numbers with a 224-bit two's complement significand and a 32-bit exponent.^{ [1] } One can use general arbitrary-precision arithmetic libraries to obtain octuple (or higher) precision, but specialized octuple-precision implementations may achieve higher performance.

There is no known hardware implementation of octuple precision.

- IEEE Standard for Floating-Point Arithmetic (IEEE 754)
- ISO/IEC 10967, Language-independent arithmetic
- Primitive data type

In computing, **floating-point arithmetic** (**FP**) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often found in systems which include very small and very large real numbers, which require fast processing times. A number is, in general, represented approximately to a fixed number of significant digits and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:

A **computer number format** is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

**Double-precision floating-point format** is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computer science, **denormal numbers** or **denormalized numbers** fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal number is *subnormal*.

The **IEEE Standard for Floating-Point Arithmetic** is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The **significand** is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.

**Hexadecimal floating point** is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

In computing, **minifloats** are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

**Extended precision** refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to *extended precision*, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

**Decimal floating-point** (**DFP**) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as **binary encoding** and *decimal encoding*.

**IEEE 754-2008** was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating-point standard, while in 2019 it was updated with a minor revision IEEE 754-2019. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854.

In computing, **half precision** is a binary floating-point computer number format that occupies 16 bits in computer memory.

In computing, **quadruple precision** is a binary floating point–based computer number format that occupies 16 bytes with precision more than twice the 53-bit double precision.

**Single-precision floating-point format** is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, **decimal32** is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

In computing, **decimal64** is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, **decimal128** is a decimal floating-point computer numbering format that occupies 16 bytes (128 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, **Microsoft Binary Format** (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC language products, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.

The **bfloat16 floating-point format** is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

- ↑ R. Crandall; J. Papadopoulos (8 May 2002). "Octuple-precision floating point on Apple G4 (archived copy on web.archive.org)" (PDF). Archived from the original on July 28, 2006.CS1 maint: unfit url (link)

- Beebe, Nelson H. F. (2017-08-22).
*The Mathematical-Function Computation Handbook - Programming Using the MathCW Portable Software Library*(1 ed.). Salt Lake City, UT, USA: Springer International Publishing AG. doi:10.1007/978-3-319-64110-2. ISBN 978-3-319-64109-6. LCCN 2017947446.

This page is based on this Wikipedia article

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.