Tapered floating point

Last updated October 26, 2023

In computing, tapered floating point (TFP) is a format similar to floating point, but with variable-sized entries for the significand and exponent instead of the fixed-length entries found in normal floating-point formats. In addition to this, tapered floating-point formats provide a fixed-size pointer entry indicating the number of digits in the exponent entry. The number of digits of the significand entry (including the sign) results from the difference of the fixed total length minus the length of the exponent and pointer entries.^[1]

History

The tapered floating-point scheme was first proposed by Robert Morris of Bell Laboratories in 1971,^[2] and refined with leveling by Masao Iri and Shouichi Matsui of University of Tokyo in 1981,^[3]^[4]^[1] and by Hozumi Hamada of Hitachi, Ltd. ^[5]^[6]^[7]

Alan Feldstein of Arizona State University and Peter Turner^[8] of Clarkson University described a tapered scheme resembling a conventional floating-point system except for the overflow or underflow conditions.^[7]

In 2013, John Gustafson proposed the Unum number system, a variant of tapered floating-point arithmetic with an exact bit added to the representation and some interval interpretation to the non-exact values.^[9]^[10]

Related Research Articles

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:

IEEE 754-1985 is a historic industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.

A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computer science, subnormal numbers are the subset of denormalized numbers that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal, while denormal can also refer to numbers outside that range.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The significand is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.

Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

The term arithmetic underflow is a condition in a computer program where the result of a calculation is a number of more precise absolute value than the computer can actually represent in memory on its central processing unit (CPU).

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

A logarithmic number system (LNS) is an arithmetic system used for representing real numbers in computer and digital hardware, especially for digital signal processing.

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

In software engineering and numerical analysis, a binade is a set of numbers in a binary floating-point format that all have the same sign and exponent. In other words, a binade is the interval $or for some integer value, that is, the set of real numbers or floating-point numbers of the same sign such that .$

Unums are a family of number formats and arithmetic for implementing real numbers on a computer, proposed by John L. Gustafson in 2015. They are designed as an alternative to the ubiquitous IEEE 754 floating-point standard. The latest version is known as posits.

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.

Floating-point error mitigation is the minimization of errors caused by the fact that real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating-point error cannot be eliminated, and, at best, can only be managed.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

References

1 2 Zehendner, Eberhard (Summer 2008). "Rechnerarithmetik: Logarithmische Zahlensysteme" (PDF) (Lecture script) (in German). Friedrich-Schiller-Universität Jena. pp. 15–19. Archived (PDF) from the original on 2018-07-09. Retrieved 2018-07-09.
↑ Morris, Sr., Robert H. (December 1971). "Tapered Floating Point: A New Floating-Point Representation". IEEE Transactions on Computers . IEEE. C-20 (12): 1578–1579. doi:10.1109/T-C.1971.223174. ISSN 0018-9340. S2CID 206618406.
↑ Matsui, Shourichi; Iri, Masao (1981-11-05) [January 1981]. "An Overflow/Underflow-Free Floating-Point Representation of Numbers". Journal of Information Processing . Information Processing Society of Japan (IPSJ). 4 (3): 123–133. ISSN 1882-6652. NAID 110002673298 NCID AA00700121 . Retrieved 2018-07-09. . Also reprinted in: Swartzlander, Jr., Earl E., ed. (1990). Computer Arithmetic. Vol. II. IEEE Computer Society Press. pp. 357–.
↑ Higham, Nicholas John (2002). Accuracy and Stability of Numerical Algorithms (2 ed.). Society for Industrial and Applied Mathematics (SIAM). p. 49. ISBN 978-0-89871-521-7. 0-89871-355-2.
↑ Hamada, Hozumi (June 1983). "URR: Universal representation of real numbers". New Generation Computing. 1 (2): 205–209. doi:10.1007/BF03037427. ISSN 0288-3635. S2CID 12806462 . Retrieved 2018-07-09. (NB. The URR representation coincides with Elias delta (δ) coding.)
↑ Hamada, Hozumi (1987-05-18). "A new real number representation and its operation". In Irwin, Mary Jane; Stefanelli, Renato (eds.). 1987 IEEE 8th Symposium on Computer Arithmetic (ARITH). Washington, D.C., USA: IEEE Computer Society Press. pp. 153–157. doi:10.1109/ARITH.1987.6158698. ISBN 0-8186-0774-2. S2CID 15189621.
1 2 Hayes, Brian (September–October 2009). "The Higher Arithmetic". American Scientist . 97 (5): 364–368. doi:10.1511/2009.80.364. S2CID 121337883. . Also reprinted in: Hayes, Brian (2017). "Chapter 8: Higher Arithmetic". Foolproof, and Other Mathematical Meditations (1 ed.). The MIT Press. pp. 113–126. ISBN 978-0-26203686-3.
↑ Feldstein, Alan; Turner, Peter R. (March–April 2006). "Gradual and tapered overflow and underflow: A functional differential equation and its approximation". Journal of Applied Numerical Mathematics . Amsterdam, Netherlands: International Association for Mathematics and Computers in Simulation (IMACS) / Elsevier Science Publishers B. V. 56 (3–4): 517–532. doi:10.1016/j.apnum.2005.04.018. ISSN 0168-9274 . Retrieved 2018-07-09.
↑ Gustafson, John Leroy (March 2013). "Right-Sizing Precision: Unleashed Computing: The need to right-size precision to save energy, bandwidth, storage, and electrical power" (PDF). Archived (PDF) from the original on 2016-06-06. Retrieved 2016-06-06.
↑ Muller, Jean-Michel (2016-12-12). "Chapter 2.2.6. The Future of Floating Point Arithmetic". Elementary Functions: Algorithms and Implementation (3 ed.). Boston, Massachusetts, USA: Birkhäuser. pp. 29–30. ISBN 978-1-4899-7981-0.

Tapered floating point

Contents

History

See also

Related Research Articles

References

Further reading