Block floating point

Last updated

Block floating point (BFP) is a method used to provide an arithmetic approaching floating point while using a fixed-point processor. BFP assigns a group of significands (the non-exponent part of the floating-point number) to a single exponent, rather than single significand being assigned its own exponent. BFP can be advantageous to limit space use in hardware to perform the same functions as floating-point algorithms, by reusing the exponent; some operations over multiple values between blocks can also be done with a reduced amount of computation. [1]

Contents

The common exponent is found by data with the largest amplitude in the block. To find the value of the exponent, the number of leading zeros must be found (count leading zeros). For this to be done, the number of left shifts needed for the data must be normalized to the dynamic range of the processor used. Some processors have means to find this out themselves, such as exponent detection and normalization instructions. [2] [3]

Block floating-point algorithms were extensively studied by James Hardy Wilkinson. [4] [5] [6]

BFP can be recreated in software for smaller performance gains.

Microscaling (MX) formats

Microscaling (MX) formats are a type of Block Floating Point (BFP) data format specifically designed for AI and machine learning workloads. Very small floating point numbers (minifloats) are used in machine learning for performance, but like fixed-point numbers they suffer from reduced representable ranges. Using a shared exponent allows for increasing the range of representable values at very little space and performance overhead. [7] [8] The MX format, endorsed and standardized by major industry players such as AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm, represents a significant advancement in narrow precision data formats for AI. [9]

The MX format contains a block of k (usually set to 32) elements, each being d bits long. These elements share a scaling factor of w bits, so that the entire block is w + kd bits in size. Standard MX data types include: [10]

NameElement data typedkScale data typewBits per block
MXFP8 (E5M2)FP8 (E5M2)832E8M08264
MXFP8 (E4M3)FP8 (E4M3)832E8M08264
MXFP6 (E6M2)FP6 (E3M2)632E8M08200
MXFP6 (E2M3)FP6 (E2M3)632E8M08200
MXFP4FP4 (E2M1)432E8M08136
MXINT8INT8832E8M08264

Here E8M0 is effectively the exponent part of a single-precision floating number, being able to represent powers of 2 between 2-127 and 2127. A single value is reserved for NaN. For descriptions of data types such as FP8-E5M2, see Minifloat § Machine learning.

MX formats have been demonstrated to be effective in a variety of AI tasks, including large language models (LLMs), image classification, speech recognition and recommendation systems. [11] For instance, MXFP6 closely matches FP32 for inference tasks after quantization-aware fine-tuning, and MXFP4 can be used for training generative language models with only a minor accuracy penalty.

The MX format has been standardized through the Open Compute Project (OCP) as Microscaling Formats (MX) Specification v1.0. [9] [10] An emulation libraries also has been published to provide details on the data science approach and select results of MX in action. [12]

Further development

The MXFP4 format groups 32 4-bit minifloats with very low dynamic range together. In an effort to reduce quantization artifacts, Nvidia has introduced NVFP4, which instead only groups 16 FP4-E2M1 numbers in a block and chanegs the scaling factor to E4M3 for more precision. To regain dynamic range, the many blocks in a tensor is then subject to a shared fp32 (E8M23) scaling factor for a two-layer setup. No M0 numbers are used for scaling; as a result, all scaling requires an actual multiplication instead of bit shifting or simple manipulation of the exponent part of a floating-point value. [13]

Hardware support

Hardware support for BFP exists in two layers: support for the underlying data type of the members (integer fixed-point or minifloats) and faster implementation of the scaling operation.

BFP with fixed-point elements

BFP with minifloat elements

The parallel handling of minifloat numbers is more complex to emulate in software compared to handling of packed integers. As a result, hardware support for the underlying minifloat goes a long way in offering BFP-with-minifloat support.

Other types of BFP

See also

References

  1. "Block floating point". BDTI DSP Dictionary. Berkeley Design Technology, Inc. (BDTI). Archived from the original on 2018-07-11. Retrieved 2015-11-01.
  2. Chhabra, Arun; Iyer, Ramesh (December 1999). "TMS320C55x A Block Floating Point Implementation on the TMS320C54x DSP" (PDF) (Application report). Digital Signal Processing Solutions. Texas Instruments. SPRA610. Archived (PDF) from the original on 2018-07-11. Retrieved 2018-07-11.
  3. Elam, David; Iovescu, Cesar (September 2003). "A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP" (PDF) (Application report). TMS320C5000 Software Applications. Texas Instruments. SPRA948. Archived (PDF) from the original on 2018-07-11. Retrieved 2015-11-01.
  4. Wilkinson, James Hardy (1994) [1st Pub. 1963]. Rounding Errors in Algebraic Processes (1 ed.). Englewood Cliffs, NJ, USA: Prentice-Hall, Inc. ISBN   978-0-486-67999-0. MR   0161456.
  5. Muller, Jean-Michel; Brisebarre, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge (2010). Handbook of Floating-Point Arithmetic (1 ed.). Birkhäuser. doi:10.1007/978-0-8176-4705-6. ISBN   978-0-8176-4704-9. LCCN   2009939668.
  6. Overton, Michael L. (2001). Numerical Computing with IEEE Floating Point Arithmetic - Including One Theorem, One Rule of Thumb and One Hundred and One Exercises (1 ed.). Society for Industrial and Applied Mathematics (SIAM). ISBN   0-89871-482-6. 9-780898-714821-90000.
  7. Rouhani, Bita Darvish; Zhao, Ritchie; More, Ankit; Hall, Mathew; Khodamoradi, Alireza; Deng, Summer; Choudhary, Dhruv; Cornea, Marius; Dellinger, Eric (2023-10-19). "Microscaling Data Formats for Deep Learning". arXiv: 2310.10537 [cs.LG].
  8. D'Sa, Reynold; Borkar, Rani (2023-10-17). "Fostering AI infrastructure advancements through standardization". Microsoft Azure Blog. Retrieved 2024-06-03.
  9. 1 2 "AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI". Open Compute Project. Retrieved 2024-06-03.
  10. 1 2 "OCP Microscaling Formats (MX) Specification Version 1.0". Open Compute Project. Archived from the original on 2024-02-24. Retrieved 2025-02-21.
  11. Rouhani, Bita; Zhao, Ritchie; Elango, Venmugil; Shafipour, Rasoul; Hall, Mathew; Mesmakhosroshahi, Maral; More, Ankit; Melnick, Levi; Golub, Maximilian (2023-04-12). "With Shared Microexponents, A Little Shifting Goes a Long Way". arXiv: 2302.08007 [cs.LG].
  12. microsoft/microxcaling, Microsoft, 2024-05-29, retrieved 2024-06-03
  13. 1 2 "Introducing NVFP4 for Efficient and Accurate Low-Precision Inference". NVIDIA Technical Blog. 2025-06-24.
  14. Shilov, Anton (2023-09-19). "D-Matrix's Jayhawk II Addresses Edge and Cloud AI Workloads". EE Times.
  15. "Accurate Block Quantization in LLMS with Outliers".
  16. "Tenstorrent AI Accelerators" (PDF).
  17. "Data Formats and Math Fidelity — TT Buda documentation". docs.tenstorrent.com.
  18. Bonshor, Gavin. "AMD Announces The Ryzen AI 300 Series For Mobile: Zen 5 With RDNA 3.5, and XDNA2 NPU With 50 TOPS". www.anandtech.com. Archived from the original on 2024-06-03. Retrieved 2024-06-03.
  19. "AMD Extends AI and High-Performance Leadership in Data Center and PCs with New AMD Instinct, Ryzen and EPYC Processors at Computex 2024". Advanced Micro Devices, Inc. 2024-06-02. Retrieved 2024-06-03.
  20. "BFP16 (Block floating point) Quantization — AMD Quark 0.10 documentation". quark.docs.amd.com.
  21. "Intel Advanced Vector Extensions 10.2 (Intel AVX10.2) Architecture Specification". Intel. 2024-10-16. p. 39. 361050-002US. Retrieved 2024-12-27.
  22. "Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities". arxiv.org.
  23. "Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training". NVIDIA Technical Blog. 2025-07-01.
  24. "An Investigation of FP8 Across Accelerators for LLM Inference".
  25. "Two Level Quantization Formats (MX4, MX6, MX9: shared Microexponents) — AMD Quark 0.10 documentation". quark.docs.amd.com.
  26. "AMD Versal™ AI Edge Series Gen 2".

Further reading