This article needs additional citations for verification . (March 2007) (Learn how and when to remove this template message)
In computer science, the precision of a numerical quantity is a measure of the detail in which the quantity is expressed. This is usually measured in bits, but sometimes in decimal digits. It is related to precision in mathematics, which describes the number of digits that are used to express a value.
Some of the standardized precision formats are
Of these, octuple-precision format is rarely used. The single- and double-precision formats are most widely used and supported on nearly all platforms. The use of half-precision format has been increasing especially in the field of machine learning since many machine learning algorithms are inherently error-tolerant.
Precision is often the source of rounding errors in computation. The number of bits used to store a number will often cause some loss of accuracy. An example would be to store "sin(0.1)" in IEEE single precision floating point standard. The error is then often magnified as subsequent computations are made using the data (although it can also be reduced).
In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often found in systems which include very small and very large real numbers, which require fast processing times. A number is, in general, represented approximately to a fixed number of significant digits and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
IEEE 754-1985 was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
A computer number format is the internal representation of numeric values in digital computer and calculator hardware and software. Normally, numeric values are stored as groupings of bits, named for the number of bits that compose them. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the bit format used by the computer's instruction set generally requires conversion for external use such as printing and display. Different types of processors may have different internal representations of numerical values. Different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.
Double-precision floating-point format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $23.4476 with $23.45, the fraction 312/937 with 1/3, or the expression √2 with 1.414.
In numerical analysis, the Kahan summation algorithm, also known as compensated summation, significantly reduces the numerical error in the total obtained by adding a sequence of finite-precision floating-point numbers, compared to the obvious approach. This is done by keeping a separate running compensation.
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The significand is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.
A roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Rounding errors are due to inexactness in the representation of real numbers and the arithmetic operations done with them. This is a form of quantization error. When using approximation equations or algorithms, especially when using finitely many digits to represent real numbers, one of the goals of numerical analysis is to estimate computation errors. Computation errors, also called numerical errors, include both truncation errors and roundoff errors.
IBM System/360 computers, and subsequent machines based on that architecture (mainframes), support a hexadecimal floating-point format (HFP).
In computer science and numerical analysis, unit in the last place or unit of least precision (ULP) is the spacing between floating-point numbers, i.e., the value the least significant digit represents if it is 1. It is used as a measure of accuracy in numeric calculations.
Machine epsilon gives an upper bound on the relative error due to rounding in floating point arithmetic. This value characterizes computer arithmetic in the field of numerical analysis, and by extension in the subject of computational science. The quantity is also called macheps or unit roundoff, and it has the symbols Greek epsilon or bold Roman u, respectively.
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory.
In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision more than twice the 53-bit double precision.
Unums are an arithmetic and a binary representation format for real numbers analogous to floating point, proposed by John L. Gustafson as an alternative to the now ubiquitous IEEE 754 arithmetic. The first version of unums, now officially known as Type I unum, was introduced in his book The End of Error. Gustafson has since created two newer revisions of the unum format, Type II and Type III, in late 2016. Type III unum is also known as posits and valids; posits present arithmetic for single real values and valids present the interval arithmetic version. This data type can serve as a replacement for IEEE 754 floats for programs which do not depend on specific features of IEEE 754. Details of valids have yet to be officially articulated by Gustafson.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.
Floating-point error mitigation is the minimization of errors caused by the fact that real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating-point error cannot be eliminated, and, at best, can only be managed.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.