This article includes a list of references, related reading or external links, but its sources remain unclear because it lacks inline citations . (August 2009) (Learn how and when to remove this template message)
IEEE 754-2008 (previously known as IEEE 754r) was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating point standard. The revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 (the radix-independent floating-point standard).
IEEE 754-1985 was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
Standardization or standardisation is the process of implementing and developing technical standards based on the consensus of different parties that include firms, users, interest groups, standards organizations and governments. Standardization can help maximize compatibility, interoperability, safety, repeatability, or quality. It can also facilitate commoditization of formerly custom processes. In social sciences, including economics, the idea of standardization is close to the solution for a coordination problem, a situation in which all parties can realize mutual gains, but only by making mutually consistent decisions. This view includes the case of "spontaneous standardization processes", to produce de facto standards.
In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional.
The standard had been under revision since 2000, with a target completion date of December 2006. The revision of an IEEE standard broadly follows three phases:
On 11 June 2008, it was approved unanimously by the IEEE Revision Committee (RevCom), and it was formally approved by the IEEE-SA Standards Board on 12 June 2008. It was published on 29 August 2008.
Participation in drafting the standard was open to people with a solid knowledge of floating-point arithmetic. More than 90 people attended at least one of the monthly meetings, which were held in Silicon Valley, and many more participated through the mailing list.
Silicon Valley is a region in the southern part of the San Francisco Bay Area in Northern California that serves as a global center for high technology, innovation, and social media. It corresponds roughly to the geographical Santa Clara Valley, although its boundaries have increased in recent decades. San Jose is the Valley's largest city, the third-largest in California, and the tenth-largest in the United States. Other major Silicon Valley cities include Palo Alto, Menlo Park, Redwood City, Cupertino, Santa Clara, Mountain View, and Sunnyvale. The San Jose Metropolitan Area has the third-highest GDP per capita in the world, according to the Brookings Institution.
Progress at times was slow, leading the chairman to declare at the 15 September 2005 meetingthat "no progress is being made, I am suspending these meetings until further notice on those grounds". In December 2005, the committee reorganized under new rules with a target completion date of December 2006.
New policies and procedures were adopted in February 2006. In September 2006, a working draft was approved to be sent to the parent sponsoring committee (the IEEE Microprocessor Standards Committee, or MSC) for editing and to be sent to sponsor ballot.
The last version of the draft, version 1.2.5, submitted to the MSC was from 4 October 2006.The MSC accepted the draft on 9 October 2006. The draft has been changed significantly in detail during the balloting process.
The first sponsor ballot took place from 29 November 2006 through 28 December 2006. Of the 84 members of the voting body, 85.7% responded—78.6% voted approval. There were negative votes (and over 400 comments) so there was a recirculation ballot in March 2007; this received an 84% approval. There were sufficient comments (over 130) from that ballot that a third draft was prepared for second, 15-day, recirculation ballot which started in mid-April 2007. For a technical reason, the ballot process was restarted with the 4th ballot in October 2007; there were also substantial changes in the draft resulting from 650 voters' comments and from requests from the sponsor (the IEEE MSC); this ballot just failed to reach the required 75% approval. The 5th ballot had a 98.0% response rate with 91.0% approval, with comments leading to relatively small changes. The 6th, 7th, and 8th ballots sustained approval ratings of over 90% with progressively fewer comments on each draft; the 8th (which had no in-scope comments: 9 were repeats of previous comments and one referred to material not in the draft) was submitted to the IEEE Standards Revision Committee ('RevCom') for approval as an IEEE standard.
The IEEE Standards Revision Committee (RevCom) considered and unanimously approved the IEEE 754r draft at its June 2008 meeting, and it was approved by the IEEE-SA Standards Board on 12 June 2008. Final editing is complete and the document has now been forwarded to the IEEE Standards Publications Department for publication.
The new IEEE 754 (formally IEEE Std 754-2008, the IEEE Standard for Floating-Point Arithmetic) was published by the IEEE Computer Society on 29 August 2008, and is available from the IEEE Xplore website
This standard replaces IEEE 754-1985. IEEE 854, the Radix-Independent floating-point standard was withdrawn in December 2008.
The most obvious enhancements to the standard are the addition of a 16-bit and a 128-bit binary type and three decimal types, some new operations, and many recommended functions. However, there have been significant clarifications in terminology throughout. This summary highlights the main differences in each major clause of the standard.
The scope (determined by the sponsor of the standard) has been widened to include decimal formats and arithmetic, and adds extendable formats.
Many of the definitions have been rewritten for clarification and consistency. A few terms have been renamed for clarity (for example, denormalized has been renamed to subnormal).
The description of formats has been made more regular, with a distinction between arithmetic formats (in which arithmetic may be carried out) and interchange formats (which have a standard encoding). Conformance to the standard is now defined in these terms.
The specification levels of a floating-point format have been enumerated, to clarify the distinction between:
The sets of representable entities are then explained in detail, showing that they can be treated with the significand being considered either as a fraction or an integer. The particular sets known as basic formats are defined, and the encodings used for interchange of binary and decimal formats are explained.
The binary interchange formats have the "half precision" (16-bit storage format) and "quad precision" (128-bit format) added, together with generalized formulae for some wider formats; the basic formats have 32-bit, 64-bit, and 128-bit encodings.
Three new decimal formats are described, matching the lengths of the 32–128-bit binary formats. These give decimal interchange formats with 7, 16, and 34-digit significands, which may be normalized or unnormalized. For maximum range and precision, the formats merge part of the exponent and significand into a combination field, and compress the remainder of the significand using either a decimal integer encoding (which uses Densely Packed Decimal , or DPD, a compressed form of BCD) encoding or conventional binary integer encoding. The basic formats are the two larger sizes, which have 64-bit and 128-bit encodings. Generalized formulae for some other interchange formats are also specified.
Extended and extendable formats allow for arithmetic at other precisions and ranges.
This clause has been changed to encourage the use of static attributes for controlling floating-point operations, and (in addition to required rounding attributes) allow for alternate exception handling, widening of intermediate results, value-changing optimizations, and reproducibility.
The round-to-nearest, ties away from zero rounding attribute has been added (required for decimal operations only).
This section has numerous clarifications (notably in the area of comparisons), and several previously recommended operations (such as copy, negate, abs, and class) are now required.
New operations include fused multiply–add (FMA), explicit conversions, classification predicates (isNan(x), etc.), various min and max functions, a total ordering predicate, and two decimal-specific operations (samequantum and quantize).
The min and max operations are defined but leave some leeway for the case where the inputs are equal in value but differ in representation. In particular:
min(−0,+0)must produce something with a value of zero but may always return the first argument.
In order to support operations such as windowing in which a NaN input should be quietly replaced with one of the end points, min and max are defined to select a number, x, in preference to a quiet NaN:
min(x,NaN) = min(NaN,x) = x
max(x,NaN) = max(NaN,x) = x
In the current draft, these functions are called minNum and maxNum to indicate their preference for a number over a quiet NaN.
Decimal arithmetic, compatible with that used in Java, C#, PL/I, COBOL, Python, REXX, etc., is also defined in this section. In general, decimal arithmetic follows the same rules as binary arithmetic (results are correctly rounded, and so on), with additional rules that define the exponent of a result (more than one is possible in many cases).
Unlike in 854, 754r requires correctly rounded base conversion between decimal and binary floating point within a range which depends on the format.
This clause has been revised and clarified, but with no major additions.
This clause has been revised and considerably clarified, but with no major additions.
This clause has been extended from the previous Clause 8 ('Traps') to allow optional exception handling in various forms, including traps and other models such as try/catch. Traps and other exception mechanisms remain optional, as they were in IEEE 754-1985.
This clause is new; it recommends fifty operations, including log, power, and trigonometric functions, that language standards should define. These are all optional (none are required in order to conform to the standard). The operations include some on dynamic modes for attributes, and also a set of reduction operations (sum, scaled product, etc.).
This clause is new; it recommends how language standards should specify the semantics of sequences of operations, and points out the subtleties of literal meanings and optimizations that change the value of a result.
This clause is new; it recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.
This annex is new; it lists some useful references.
This annex is new; it provides guidance to debugger developers for features that are desired for supporting the debugging of floating point code.
This is a new index, which lists all the operations described in the standard (required or optional).
Due to changes in CPU design and development, the 2008 IEEE floating point standard could be viewed as historical or outdated as the 1985 standard it replaced. There were many outside discussions and items not covered in the standardization process, the items below are the ones that became public knowledge:
In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often found in systems which include very small and very large real numbers, which require fast processing times. A number is, in general, represented approximately to a fixed number of significant digits and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
A computer number format is the internal representation of numeric values in digital computer and calculator hardware and software. Normally, numeric values are stored as groupings of bits, named for the number of bits that compose them. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the bit format used by the computer's instruction set generally requires conversion for external use such as printing and display. Different types of processors may have different internal representations of numerical values. Different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.
In computing, NaN, standing for not a number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.
Double-precision floating-point format is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
IBM System/360 computers, and subsequent machines based on that architecture (mainframes), support a hexadecimal floating-point format (HFP).
Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are identical. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 and +0, regarded as equal by the numerical comparison operations but with possible different behaviors in particular operations. This occurs in the sign and magnitude and ones' complement signed number representations for integers, and in most floating-point number representations. The number 0 is usually encoded as +0, but can be represented by either +0 or −0.
Extended precision refers to floating point number formats that provide greater precision than the basic floating point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.
Decimal computers are computers which can represent numbers and addresses in decimal as well as providing instructions to operate on those numbers and addresses directly in decimal, without conversion to a pure binary representation. Some also had a variable wordlength, which enabled operations on numbers with a large number of digits.
The IEEE 754-2008 standard includes an encoding format for decimal floating point numbers in which the significand and the exponent can be encoded in two ways, referred to in the draft as binary encoding and decimal encoding.
In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory.
In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision more than twice the 53-bit double precision.
Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.
In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.
In computing, decimal128 is a decimal floating-point computer numbering format that occupies 16 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.
In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use.