Floating-point formats |
---|

IEEE 754 |

Other |

The **bfloat16** (**Brain Floating Point**)^{ [1] }^{ [2] } floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing.^{ [3] } It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.^{ [4] }

- bfloat16 floating-point format
- Contrast with bfloat16 and single precision
- Exponent encoding
- Encoding of special values
- Positive and negative infinity
- Not a Number
- Range and precision
- Examples
- Zeros and infinities
- Special values
- NaNs
- See also
- References

The bfloat16 format is utilized in Intel AI processors, such as Nervana NNP-L1000, Xeon processors (AVX-512 BF16 extensions), and Intel FPGAs,^{ [5] }^{ [6] }^{ [7] } Google Cloud TPUs,^{ [8] }^{ [9] }^{ [10] } and TensorFlow.^{ [10] }^{ [11] } ARMv8.6-A,^{ [12] } AMD ROCm,^{ [13] } and CUDA ^{ [14] } also support the bfloat16 format. On these platforms, bfloat16 may also be used in mixed-precision arithmetic, where bfloat16 numbers may be operated on and expanded to wider data types.

**bfloat16** has the following format:

- Sign bit: 1 bit
- Exponent width: 8 bits
- Significand precision: 8 bits (7 explicitly stored), as opposed to 24 bits in a classical single-precision floating-point format

The bfloat16 format, being a truncated IEEE 754 single-precision 32-bit float, allows for fast conversion to and from an IEEE 754 single-precision 32-bit float; in conversion to the bfloat16 format, the exponent bits are preserved while the significand field can be reduced by truncation (thus corresponding to round toward 0), ignoring the NaN special case. Preserving the exponent bits maintains the 32-bit float's range of ≈ 10^{−38} to ≈ 3 × 10^{38}.^{ [15] }

The bits are laid out as follows:

sign | exponent (5 bit) | fraction (10 bit) | ||||||||||||||

┃ | ┌───────┐ | ┌─────────────────┐ | ||||||||||||||

0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

15 | 14 | 10 | 9 | 0 |

sign | exponent (8 bit) | fraction (23 bit) | ||||||||||||||||||||||||||||||

┃ | ┌─────────────┐ | ┌───────────────────────────────────────────┐ | ||||||||||||||||||||||||||||||

0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

31 | 30 | 23 | 22 | 0 |

sign | exponent (8 bit) | fraction (7 bit) | ||||||||||||||

┃ | ┌─────────────┐ | ┌───────────┐ | ||||||||||||||

0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |

15 | 14 | 7 | 6 | 0 |

sign | exponent (8 bit) | fraction (10 bit) | |||||||||||||||||

┃ | ┌─────────────┐ | ┌─────────────────┐ | |||||||||||||||||

0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

18 | 17 | 10 | 9 | 0 |

sign | exponent (7 bit) | fraction (16 bit) | ||||||||||||||||||||||

┃ | ┌───────────┐ | ┌─────────────────────────────┐ | ||||||||||||||||||||||

0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

23 | 22 | 16 | 15 | 0 |

sign | exponent (8 bit) | fraction (15 bit) | ||||||||||||||||||||||

┃ | ┌─────────────┐ | ┌───────────────────────────┐ | ||||||||||||||||||||||

0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

23 | 22 | 15 | 14 | 0 |

S | E | E | E | E | E | E | E | E | F | F | F | F | F | F | F | f | f | f | f | f | f | f | f | f | f | f | f | f | f | f | f |

The bfloat16 binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 127; also known as exponent bias in the IEEE 754 standard.

- E
_{min}= 01_{H}−7F_{H}= −126 - E
_{max}= FE_{H}−7F_{H}= 127 - Exponent bias = 7F
_{H}= 127

Thus, in order to get the true exponent as defined by the offset-binary representation, the offset of 127 has to be subtracted from the value of the exponent field.

The minimum and maximum values of the exponent field (00_{H} and FF_{H}) are interpreted specially, like in the IEEE 754 standard formats.

Exponent | Significand zero | Significand non-zero | Equation |
---|---|---|---|

00_{H} | zero, −0 | subnormal numbers | (−1)^{signbit}×2^{−126}× 0.significandbits |

01_{H}, ..., FE_{H} | normalized value | (−1)^{signbit}×2^{exponentbits−127}× 1.significandbits | |

FF_{H} | ±infinity | NaN (quiet, signaling) |

The minimum positive normal value is 2^{−126} ≈ 1.18 × 10^{−38} and the minimum positive (subnormal) value is 2^{−126−7} = 2^{−133} ≈ 9.2 × 10^{−41}.

Just as in IEEE 754, positive and negative infinity are represented with their corresponding sign bits, all 8 exponent bits set (FF_{hex}) and all significand bits zero. Explicitly,

`val s_exponent_signcnd +inf = 0_11111111_0000000 -inf = 1_11111111_0000000 `

Just as in IEEE 754, NaN values are represented with either sign bit, all 8 exponent bits set (FF_{hex}) and not all significand bits zero. Explicitly,

`val s_exponent_signcnd +NaN = 0_11111111_klmnopq -NaN = 1_11111111_klmonpq `

where at least one of *k, l, m, n, o, p,* or *q* is 1. As with IEEE 754, NaN values can be quiet or signaling, although there are no known uses of signaling bfloat16 NaNs as of September 2018.

Bfloat16 is designed to maintain the number range from the 32-bit IEEE 754 single-precision floating-point format (binary32), while reducing the precision from 24 bits to 8 bits. This means that the precision is between two and three decimal digits, and bfloat16 can represent finite values up to about 3.4 × 10^{38}.

These examples are given in bit *representation*, in hexadecimal and binary, of the floating-point value. This includes the sign, (biased) exponent, and significand.

3f80 = 0 01111111 0000000 = 1 c000 = 1 10000000 0000000 = −2

7f7f = 0 11111110 1111111 = (2^{8}− 1) × 2^{−7}× 2^{127}≈ 3.38953139 × 10^{38}(max finite positive value in bfloat16 precision) 0080 = 0 00000001 0000000 = 2^{−126}≈ 1.175494351 × 10^{−38}(min normalized positive value in bfloat16 precision and single-precision floating point)

The maximum positive finite value of a normal bfloat16 number is 3.38953139 × 10^{38}, slightly below (2^{24} − 1) × 2^{−23} × 2^{127} = 3.402823466 × 10^{38}, the max finite positive value representable in single precision.

0000 = 0 00000000 0000000 = 0 8000 = 1 00000000 0000000 = −0

7f80 = 0 11111111 0000000 = infinity ff80 = 1 11111111 0000000 = −infinity

4049 = 0 10000000 1001001 = 3.140625 ≈ π ( pi ) 3eab = 0 01111101 0101011 = 0.333984375 ≈ 1/3

ffc1 = x 11111111 1000001 => qNaN ff81 = x 11111111 0000001 => sNaN

- Half-precision floating-point format: 16-bit float w/ 1-bit sign, 5-bit exponent, and 11-bit significand, as defined by IEEE 754
- ISO/IEC 10967, Language Independent Arithmetic
- Primitive data type
- Minifloat
- Google Brain

In computing, **floating-point arithmetic** (**FP**) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often used in systems with very small and very large real numbers that require fast processing times. In general, a floating-point number is represented approximately with a fixed number of significant digits and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:

**IEEE 754-1985** was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008, and then again in 2019 by minor revision IEEE 754-2019. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point libraries, and in hardware, in the instructions of many CPUs and FPUs. The first integrated circuit to implement the draft of what was to become IEEE 754-1985 was the Intel 8087.

A **computer number format** is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

In computing, **NaN**, standing for **Not a Number**, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities.

**Double-precision floating-point format** is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computer science, **denormal numbers** or **denormalized numbers** fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal number is *subnormal*.

The **IEEE Standard for Floating-Point Arithmetic** is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

**Hexadecimal floating point** is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

In computing, **minifloats** are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

**Extended precision** refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to *extended precision*, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

**Decimal floating-point** (**DFP**) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as **binary encoding** and *decimal encoding*.

**IEEE 754-2008** was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating-point standard, while in 2019 it was updated with a minor revision IEEE 754-2019. The 2008 revision extended the previous standard where it was necessary, added decimal arithmetic and formats, tightened up certain areas of the original standard which were left undefined, and merged in IEEE 854.

In computing, **half precision** is a binary floating-point computer number format that occupies 16 bits in computer memory.

In computing, **quadruple precision** is a binary floating point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

**Single-precision floating-point format** is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, **decimal32** is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

In computing, **decimal128** is a decimal floating-point computer numbering format that occupies 16 bytes (128 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, **Microsoft Binary Format** (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC language products, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.

In computing, **octuple precision** is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.

- ↑ Teich, Paul (2018-05-10). "Tearing Apart Google's TPU 3.0 AI Coprocessor".
*The Next Platform*. Retrieved 2020-08-11.Google invented its own internal floating point format called “bfloat” for “brain floating point” (after Google Brain).

- ↑ Wang, Shibo; Kanwar, Pankaj (2019-08-23). "BFloat16: The secret to high performance on Cloud TPUs".
*Google Cloud*. Retrieved 2020-08-11.This custom floating point format is called “Brain Floating Point Format,” or “bfloat16” for short. The name flows from “Google Brain”, which is an artificial intelligence research group at Google where the idea for this format was conceived.

- ↑ Tagliavini, Giuseppe; Mach, Stefan; Rossi, Davide; Marongiu, Andrea; Benin, Luca (2018). "A transprecision floating-point platform for ultra-low power computing".
*2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)*. pp. 1051–1056. arXiv: 1711.10374 . doi:10.23919/DATE.2018.8342167. ISBN 978-3-9819263-0-9. S2CID 5067903. - ↑ Dr. Ian Cutress (2020-03-17). "Intel': Cooper lake Plans: Why is BF16 Important?" . Retrieved 2020-05-12.
The bfloat16 standard is a targeted way of representing numbers that give the range of a full 32-bit number, but in the data size of a 16-bit number, keeping the accuracy close to zero but being a bit more loose with the accuracy near the limits of the standard. The bfloat16 standard has a lot of uses inside machine learning algorithms, by offering better accuracy of values inside the algorithm while affording double the data in any given dataset (or doubling the speed in those calculation sections).

- ↑ Khari Johnson (2018-05-23). "Intel unveils Nervana Neural Net L-1000 for accelerated AI training".
*VentureBeat*. Retrieved 2018-05-23....Intel will be extending bfloat16 support across our AI product lines, including Intel Xeon processors and Intel FPGAs.

- ↑ Michael Feldman (2018-05-23). "Intel Lays Out New Roadmap for AI Portfolio".
*TOP500 Supercomputer Sites*. Retrieved 2018-05-23.Intel plans to support this format across all their AI products, including the Xeon and FPGA lines

- ↑ Lucian Armasu (2018-05-23). "Intel To Launch Spring Crest, Its First Neural Network Processor, In 2019".
*Tom's Hardware*. Retrieved 2018-05-23.Intel said that the NNP-L1000 would also support bfloat16, a numerical format that’s being adopted by all the ML industry players for neural networks. The company will also support bfloat16 in its FPGAs, Xeons, and other ML products. The Nervana NNP-L1000 is scheduled for release in 2019.

- ↑ "Available TensorFlow Ops | Cloud TPU | Google Cloud".
*Google Cloud*. Retrieved 2018-05-23.This page lists the TensorFlow Python APIs and graph operators available on Cloud TPU.

- ↑ Elmar Haußmann (2018-04-26). "Comparing Google's TPUv2 against Nvidia's V100 on ResNet-50".
*RiseML Blog*. Archived from the original on 2018-04-26. Retrieved 2018-05-23.For the Cloud TPU, Google recommended we use the bfloat16 implementation from the official TPU repository with TensorFlow 1.7.0. Both the TPU and GPU implementations make use of mixed-precision computation on the respective architecture and store most tensors with half-precision.

- 1 2 Tensorflow Authors (2018-07-23). "ResNet-50 using BFloat16 on TPU".
*Google*. Retrieved 2018-11-06. - ↑ Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, Rif A. Saurous (2017-11-28). TensorFlow Distributions (Report). arXiv: 1711.10604 . Bibcode:2017arXiv171110604D. Accessed 2018-05-23.
All operations in TensorFlow Distributions are numerically stable across half, single, and double floating-point precisions (as TensorFlow dtypes: tf.bfloat16 (truncated floating point), tf.float16, tf.float32, tf.float64). Class constructors have a validate_args flag for numerical asserts

CS1 maint: multiple names: authors list (link) - ↑ "BFloat16 extensions for Armv8-A".
*community.arm.com*. Retrieved 2019-08-30. - ↑ "ROCm version history".
*github.com*. Retrieved 2019-10-23. - ↑ "CUDA Library bloat16 Intrinsics".
- ↑ "Livestream Day 1: Stage 8 (Google I/O '18) - YouTube".
*Google*. 2018-05-08. Retrieved 2018-05-23.In many models this is a drop-in replacement for float-32

This page is based on this Wikipedia article

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.