|Computer architecture bit widths|
|Binary floating-point precision|
|Decimal floating point precision|
In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision.
This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision,but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE-754 floating point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed."
In IEEE 754-2008 the 128-bit base-2 format is officially referred to as binary128.
The IEEE 754 standard specifies a binary128 as having:
This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to IEEE 754 quadruple-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: log10(2113) ≈ 34.016). The bits are laid out as:
A binary256 would have a significand precision of 237 bits (approximately 71 decimal digits) and exponent bias 262143.
The quadruple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.
Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent.
The stored exponents 000016 and 7FFF16 are interpreted specially.
|Exponent||Significand zero||Significand non-zero||Equation|
|000016||0, −0||subnormal numbers||(−1)signbit × 2−16382 × 0.significandbits2|
|000116, ..., 7FFE16||normalized value||(−1)signbit × 2exponentbits2 − 16383 × 1.significandbits2|
|7FFF16||±∞||NaN (quiet, signalling)|
The minimum strictly positive (subnormal) value is 2−16494 ≈ 10−4965 and has a precision of only one bit. The minimum positive normal value is 2−16382 ≈ 3.3621 × 10−4932 and has a precision of 113 bits, i.e. ±2−16494 as well. The maximum representable value is 216384 − 216271 ≈ 1.1897 × 104932.
These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.
0000 0000 0000 0000 0000 0000 0000 000116 = 2−16382 × 2−112 = 2−16494 ≈ 6.4751751194380251109244389582276465525 × 10−4966 (smallest positive subnormal number)
0000 ffff ffff ffff ffff ffff ffff ffff16 = 2−16382 × (1 − 2−112) ≈ 3.3621031431120935062626778173217519551 × 10−4932 (largest subnormal number)
0001 0000 0000 0000 0000 0000 0000 000016 = 2−16382 ≈ 3.3621031431120935062626778173217526026 × 10−4932 (smallest positive normal number)
7ffe ffff ffff ffff ffff ffff ffff ffff16 = 216383 × (2 − 2−112) ≈ 1.1897314953572317650857593266280070162 × 104932 (largest normal number)
3ffe ffff ffff ffff ffff ffff ffff ffff16 = 1 − 2−113 ≈ 0.9999999999999999999999999999999999037 (largest number less than one)
3fff 0000 0000 0000 0000 0000 0000 000016 = 1 (one)
3fff 0000 0000 0000 0000 0000 0000 000116 = 1 + 2−112 ≈ 1.0000000000000000000000000000000001926 (smallest number larger than one)
c000 0000 0000 0000 0000 0000 0000 000016 = −2
0000 0000 0000 0000 0000 0000 0000 000016 = 0 8000 0000 0000 0000 0000 0000 0000 000016 = −0
7fff 0000 0000 0000 0000 0000 0000 000016 = infinity ffff 0000 0000 0000 0000 0000 0000 000016 = −infinity
4000 921f b544 42d1 8469 898c c517 01b816 ≈ π
3ffd 5555 5555 5555 5555 5555 5555 555516 ≈ 1/3
By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are
0101... which is less than 1/2 of a unit in the last place.
A common software technique to implement nearly quadruple precision using pairs of double-precision values is sometimes called double-double arithmetic. 2 × 53 = 106 bits (actually 107 bits except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits, significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of 1.8 × 10308 for double-double versus 1.2 × 104932 for binary128).Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least
In particular, a double-double/quadruple-precision value q in the double-double technique is represented implicitly as a sum q = x + y of two double-precision values x and y, each of which supplies half of q's significand. That is, the pair (x, y) is stored in place of q, and operations on q values (+, −, ×, ...) are transformed into equivalent (but more complicated) operations on the x and y values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general arbitrary-precision arithmetic techniques.
Note that double-double arithmetic has the following special characteristics:
In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively.
A similar technique can be used to produce a double-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.
Quadruple precision is often implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, as of 2016, less common (see "Hardware support" below). One can use general arbitrary-precision arithmetic libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance.
A separate question is the extent to which quadruple-precision types are directly incorporated into computer programming languages.
Quadruple precision is specified in Fortran by the
iso_fortran_env from Fortran 2008 must be used, the constant
real128 is equal to 16 on most processors), or as
real(selected_real_kind(33, 4931)), or in a non-standard way as
REAL*16 is supported by the Intel Fortran Compiler and by the GNU Fortran compiler on x86, x86-64, and Itanium architectures, for example.)
For the C programming language, ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies
_Float128 as the type implementing the IEEE 754 quadruple-precision format (binary128). Alternatively, in C/C++ with a few systems and compilers, quadruple precision may be specified by the long double type, but this is not required by the language (which only requires
long double to be at least as precise as
double), nor is it common.
On x86 and x86-64, the most common C/C++ compilers implement
long double as either 80-bit extended precision (e.g. the GNU C Compiler gcc and the Intel C++ compiler with a
/Qlong‑double switch ) or simply as being synonymous with double precision (e.g. Microsoft Visual C++ ), rather than as quadruple precision. The procedure call standard for the ARM 64-bit architecture (AArch64) specifies that
long double corresponds to the IEEE 754 quadruple-precision format. On a few other architectures, some C/C++ compilers implement
long double as quadruple precision, e.g. gcc on PowerPC (as double-double ) and SPARC, or the Sun Studio compilers on SPARC. Even if
long double is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called
__float128 for x86, x86-64 and Itanium CPUs, and on PowerPC as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options; and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called
_Quadtypes, and includes a custom implementation of the standard math library.
IEEE quadruple precision was added to the IBM S/390 G5 in 1998,and is supported in hardware in subsequent z/Architecture processors. The IBM POWER9 CPU (Power ISA 3.0) has native 128-bit hardware support.
Native support of IEEE 128-bit floats is defined in PA-RISC 1.0, as of 2004 [update] .and in SPARC V8 and V9 architectures (e.g. there are 16 quad-precision registers %q0, %q4, ...), but no SPARC CPU implements quad-precision operations in hardware
Non-IEEE extended-precision (128 bit of storage, 1 sign bit, 7 exponent bit, 112 fraction bit, 8 bits unused) was added to the IBM System/370 series (1970s–1980s) and was available on some S/360 models in the 1960s (S/360-85,-195, and others by special request or simulated by OS software).
The VAX processor implemented non-IEEE quadruple-precision floating point as its "H Floating-point" format. It had one sign bit, a 15-bit exponent and 112-fraction bits, however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed. Only a few of the earliest VAX processors implemented H Floating-point instructions in hardware, all the others emulated H Floating-point in software.
The RISC-V architecture specifies a "Q" (quad-precision) extension for 128-bit binary IEEE 754-2008 floating point arithmetic.The "L" extension (not yet certified) will specify 64-bit and 128-bit decimal floating point.
Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implement SIMD instructions, such as Streaming SIMD Extensions or AltiVec, which refers to 128-bit vectors of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously.
In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often used in systems with very small and very large real numbers that require fast processing times. In general, a floating-point number is represented approximately with a fixed number of significant digits and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
IEEE 754-1985 was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.
Double-precision floating-point format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computer science, denormal numbers or denormalized numbers fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal number is subnormal.
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The significand is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.
Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.
In C and related programming languages,
long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as
double. As with C's other floating-point types, it may not necessarily map to an IEEE format.
In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.
IEEE 754-2008 was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating-point standard, while in 2019 it was updated with a minor revision IEEE 754-2019. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854.
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory.
Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computing, decimal128 is a decimal floating-point computer numbering format that occupies 16 bytes (128 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.
In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC language products, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.
SPARC is an instruction set architecture (ISA) with 32-bit integer and 32-, 64-, and 128-bit IEEE Standard 754 floating-point as its principal data types.
Floating-point: The architecture provides an IEEE 754-compatible floating-point instruction set, operating on a separate register file that provides 32 single-precision (32-bit), 32 double-precision (64-bit), 16 quad-precision (128-bit) registers, or a mixture thereof.
There are four situations, however, when the hardware will not successfully complete a floating-point instruction: ... The instruction is not implemented by the hardware (such as ... quad-precision instructions on any SPARC FPU).