IBM hexadecimal floating-point

Last updated

Hexadecimal floating point (now called HFP by IBM) is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, [1] [2] [3] as well as machines which were intended to be application-compatible with System/360. [4] [5]

Contents

In comparison to IEEE 754 floating point, the HFP format has a longer significand, and a shorter exponent. All HFP formats have 7 bits of exponent with a bias of 64. The normalized range of representable numbers is from 16−65 to 1663 (approx. 5.39761 × 10−79 to 7.237005 × 1075).

The number is represented as the following formula: (−1)sign × 0.significand × 16exponent−64.

Single-precision 32-bit

A single-precision HFP number (called "short" by IBM) is stored in a 32-bit word:

1724(width in bits)
SExpFraction 
3130...2423...0(bit index)*
* IBM documentation numbers the bits from left to right, so that the most significant bit is designated as bit number 0.

In this format the initial bit is not suppressed, and the radix (hexadecimal) point is set to the left of the significand (fraction in IBM documentation and the figures).

Since the base is 16, the exponent in this form is about twice as large as the equivalent in IEEE 754, in order to have similar exponent range in binary, 9 exponent bits would be required.

Example

Consider encoding the value −118.625 as an HFP single-precision floating-point value.

The value is negative, so the sign bit is 1.

The value 118.62510 in binary is 1110110.1012. This value is normalized by moving the radix point left four bits (one hexadecimal digit) at a time until the leftmost digit is zero, yielding 0.011101101012. The remaining rightmost digits are padded with zeros, yielding a 24-bit fraction of .0111 0110 1010 0000 0000 00002.

The normalized value moved the radix point two hexadecimal digits to the left, yielding a multiplier and exponent of 16+2. A bias of +64 is added to the exponent (+2), yielding +66, which is 100 00102.

Combining the sign, exponent plus bias, and normalized fraction produces this encoding:

SExpFraction 
1100 00100111 0110 1010 0000 0000 0000 

In other words, the number represented is −0.76A00016 × 1666 − 64 = −0.4633789… × 16+2 = −118.625

Largest representable number

SExpFraction 
0111 11111111 1111 1111 1111 1111 1111 

The number represented is +0.FFFFFF16 × 16127 − 64 = (1 − 16−6) × 1663 ≈ +7.2370051 × 1075

Smallest positive normalized number

SExpFraction 
0000 00000001 0000 0000 0000 0000 0000 

The number represented is +0.116 × 160 − 64 = 16−1 × 16−64 ≈ +5.397605 × 10−79.

Zero

SExpFraction 
0000 00000000 0000 0000 0000 0000 0000 

Zero (0.0) is represented in normalized form as all zero bits, which is arithmetically the value +0.016 × 160 − 64 = +0 × 16−64 ≈ +0.000000 × 10−79 = 0. Given a fraction of all-bits zero, any combination of positive or negative sign bit and a non-zero biased exponent will yield a value arithmetically equal to zero. However, the normalized form generated for zero by CPU hardware is all-bits zero. This is true for all three floating-point precision formats. Addition or subtraction with other exponent values can lose precision in the result.

Precision issues

Since the base is 16, there can be up to three leading zero bits in the binary significand. That means when the number is converted into binary, there can be as few as 21 bits of precision. Because of the "wobbling precision" effect, this can cause some calculations to be very inaccurate. This has caused considerable criticism. [6]

A good example of the inaccuracy is representation of decimal value 0.1. It has no exact binary or hexadecimal representation. In hexadecimal format, it is represented as 0.19999999...16 or 0.0001 1001 1001 1001 1001 1001 1001...2, that is:

SExpFraction 
0100 00000001 1001 1001 1001 1001 1010 

This has only 21 bits, whereas the binary version has 24 bits of precision.

Six hexadecimal digits of precision is roughly equivalent to six decimal digits (i.e. (6 − 1) log10(16) ≈ 6.02). A conversion of single precision hexadecimal float to decimal string would require at least 9 significant digits (i.e. 6 log10(16) + 1 ≈ 8.22) in order to convert back to the same hexadecimal float value.

Double-precision 64-bit

The double-precision HFP format (called "long" by IBM) is the same as the "short" format except that the fraction field is wider and the double-precision number is stored in a double word (8 bytes):

1756(width in bits)
SExpFraction 
6362...5655...0(bit index)*
* IBM documentation numbers the bits from left to right, so that the most significant bit is designated as bit number 0.

The exponent for this format covers only about a quarter of the range as the corresponding IEEE binary format.

14 hexadecimal digits of precision is roughly equivalent to 17 decimal digits. A conversion of double precision hexadecimal float to decimal string would require at least 18 significant digits in order to convert back to the same hexadecimal float value.

Extended-precision 128-bit

Called extended-precision by IBM, a quadruple-precision HFP format was added to the System/370 series and was available on some S/360 models (S/360-85, -195, and others by special request or simulated by OS software). The extended-precision fraction field is wider, and the extended-precision number is stored as two double words (16 bytes):

High-order part
1756(width in bits)
SExpFraction (high-order 14 digits) 
127126...120119...64(bit index)*
Low-order part
856(width in bits)
UnusedFraction (low-order 14 digits) 
63...5655...0(bit index)*
* IBM documentation numbers the bits from left to right, so that the most significant bit is designated as bit number 0.

28 hexadecimal digits of precision is roughly equivalent to 32 decimal digits. A conversion of extended precision HFP to decimal string would require at least 35 significant digits in order to convert back to the same HFP value. The stored exponent in the low-order part is 14 less than the high-order part, unless this would be less than zero.

Arithmetic operations

Available arithmetic operations are add and subtract, both normalized and unnormalized, and compare. Prenormalization is done based on the exponent difference. Multiply and divide prenormalize unnormalized values, and truncate the result after one guard digit. There is a halve operation to simplify dividing by two. Starting in ESA/390, there is a square root operation. All operations have one hexadecimal guard digit to avoid precision loss. Most arithmetic operations truncate like simple pocket calculators. Therefore, 1 − 16−8 = 1. In this case, the result is rounded away from zero. [7]

IEEE 754 on IBM mainframes

Starting with the S/390 G5 in 1998, [8] IBM mainframes have also included IEEE binary floating-point units which conform to the IEEE 754 Standard for Floating-Point Arithmetic. IEEE decimal floating-point was added to IBM System z9 GA2 [9] in 2007 using millicode [10] and in 2008 to the IBM System z10 in hardware. [11]

Modern IBM mainframes support three floating-point radices with 3 hexadecimal (HFP) formats, 3 binary (BFP) formats, and 3 decimal (DFP) formats. There are two floating-point units per core; one supporting HFP and BFP, and one supporting DFP; there is one register file, FPRs, which holds all 3 formats. Starting with the z13 in 2015, processors have added a vector facility that includes 32 vector registers, each 128 bits wide; a vector register can contain two 64-bit or four 32-bit floating-point numbers. [12] The traditional 16 floating-point registers are overlaid on the new vector registers so some data can be manipulated with traditional floating-point instructions or with the newer vector instructions.

Special uses

The IBM HFP format is used in:

As IBM is the only remaining provider of hardware using the HFP format, and as the only IBM machines that support that format are their mainframes, few file formats require it. One exception is the SAS 5 Transport file format, which the FDA requires; in that format, "All floating-point numbers in the file are stored using the IBM mainframe representation. [...] Most platforms use the IEEE representation for floating-point numbers. [...] To assist you in reading and/or writing transport files, we are providing routines to convert from IEEE representation (either big endian or little endian) to transport representation and back again." [13] Code for IBM's format is also available under LGPLv2.1. [15]

Systems that use the IBM floating-point format

The decision for hexadecimal floating-point

The article "Architecture of the IBM System/360" explains the choice as being because "the frequency of pre-shift, overflow, and precision-loss post-shift on floating-point addition are substantially reduced by this choice." [16] This allowed higher performance for the large System/360 models, and reduced cost for the small ones. The authors were aware of the potential for precision loss, but assumed that this would not be significant for 64-bit floating-point variables. Unfortunately, the designers seem not to have been aware of Benford's Law which means that a large proportion of numbers will suffer reduced precision.

The book "Computer Architecture" by two of the System/360 architects quotes Sweeney's study of 1958-65 which showed that using a base greater than 2 greatly reduced the number of shifts required for alignment and normalisation, in particular the number of different shifts needed. They used a larger base to make the implementations run faster, and the choice of base 16 was natural given 8-bit bytes. The intention was that 32-bit floats would only be used for calculations that would not propagate rounding errors, and 64-bit double precision would be used for all scientific and engineering calculations. The initial implementation of double precision lacked a guard digit to allow proper rounding, but this was changed soon after the first customer deliveries. [17]

See also

Related Research Articles

<span class="mw-page-title-main">Floating-point arithmetic</span> Computer approximation for real numbers

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:

A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Scientific notation is a way of expressing numbers that are too large or too small to be conveniently written in decimal form, since to do so would require writing out an inconveniently long string of digits. It may be referred to as scientific form or standard index form, or standard form in the United Kingdom. This base ten notation is commonly used by scientists, mathematicians, and engineers, in part because it can simplify certain arithmetic operations. On scientific calculators, it is usually known as "SCI" display mode.

In computer science, subnormal numbers are the subset of denormalized numbers that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest positive normal number is subnormal, while denormal can also refer to numbers outside that range.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The significand is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fractional number.

In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.

IEEE 754-2008 is a revision of the IEEE 754 standard for floating-point arithmetic. It was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 standard. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854 . In a few cases, where stricter definitions of binary floating-point arithmetic might be performance-incompatible with some existing implementation, they were made optional. In 2019, it was updated with a minor revision IEEE 754-2019.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

decimal128 is a decimal floating-point computer number format that occupies 128 bits in computer memory. Formally introduced in IEEE 754-2008, it is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

References

  1. IBM System/360 Principles of Operation, IBM Publication A22-6821-6, Seventh Edition (January 13, 1967), pp.41-50
  2. IBM System/370 Principles of Operation, IBM Publication GA22-7000-4, Fifth Edition (September 1, 1975), pp.157-170
  3. z/Architecture Principles of Operation, IBM Publication SA22-7832-01, Second Edition (October, 2001), chapter 9 ff.
  4. Xerox Data Systems (Oct 1973). Xerox SIGMA 7 Computer Reference Manyal. p. 48. Retrieved Nov 13, 2020.
  5. RCA (Mar 1966). Spectra 70 processors: 35 45 55 (PDF). p. 184. Retrieved Nov 13, 2020.
  6. Warren Jr., Henry S. (2013) [2002]. "The Distribution of Leading Digits". Hacker's Delight (2 ed.). Addison Wesley - Pearson Education, Inc. pp. 385–387. ISBN   978-0-321-84268-8. 0-321-84268-5.
  7. ESA/390 Enhanced Floating Point Support: An Overview
  8. Schwarz, E. M.; Krygowski, C. A. (September 1999). "The S/390 G5 floating-point unit". IBM Journal of Research and Development . 43 (5.6): 707–721. doi:10.1147/rd.435.0707.
  9. Duale, A. Y.; Decker, M. H.; Zipperer, H.-G.; Aharoni, M.; Bohizic, T. J. (January 2007). "Decimal floating-point in z9: An implementation and testing perspective". IBM Journal of Research and Development . 51 (1.2): 217–227. CiteSeerX   10.1.1.123.9055 . doi:10.1147/rd.511.0217.
  10. Heller, L. C.; Farrell, M. S. (May 2004). "Millicode in an IBM zSeries processor". IBM Journal of Research and Development . 48 (3.4): 425–434. CiteSeerX   10.1.1.641.1164 . doi:10.1147/rd.483.0425.
  11. Schwarz, E. M.; Kapernick, J. S.; Cowlishaw, M. F. (January 2009). "Decimal floating-point support on the IBM System z10 processor". IBM Journal of Research and Development . 53 (1): 4:1–4:10. doi:10.1147/JRD.2009.5388585.
  12. z/Architecture Principles of Operation
  13. 1 2 "The Record Layout of a Data Set in SAS Transport (XPORT) Format" (PDF). Retrieved September 18, 2014.
  14. "SEG Y rev 1 Data Exchange format, Release 1.0" (PDF). May 2002.
  15. "Package 'SASxport'" (PDF). March 10, 2020.
  16. Amdahl, Gene; Blaauw, Gerrit; Brooks, Jr, Frederick. "Architecture of the IBM System/360". IBM Journal of Research and Development. 1964: 87. Retrieved 4 September 2023.
  17. Blaauw, Gerrit A.; Brooks, Frederick P. (1997). Computer Architecture (1st ed.). Reading, Massachusetts: Addison-Weslet. ISBN   0-201-10557-8.

Further reading