LEB128

Last updated

LEB128 or Little Endian Base 128 is a variable-length code compression used to store arbitrarily large integers in a small number of bytes. LEB128 is used in the DWARF debug file format [1] [2] and the WebAssembly binary encoding for all integer literals. [3]

Contents

Encoding format

LEB128 format is very similar to variable-length quantity (VLQ) format; the primary difference is that LEB128 is little-endian whereas variable-length quantities are big-endian. Both allow small numbers to be stored in a single byte, while also allowing encoding of arbitrarily long numbers. There are two versions of LEB128: unsigned LEB128 and signed LEB128. The decoder must know whether the encoded value is unsigned LEB128 or signed LEB128.

Unsigned LEB128

To encode an unsigned number using unsigned LEB128 (ULEB128) first represent the number in binary. Then zero extend the number up to a multiple of 7 bits (such that if the number is non-zero, the most significant 7 bits are not all 0). Break the number up into groups of 7 bits. Output one encoded byte for each 7 bit group, from least significant to most significant group. Each byte will have the group in its 7 least significant bits. Set the most significant bit on each byte except the last byte. The number zero is encoded as a single byte 0x00.

As an example, here is how the unsigned number 624485 gets encoded:

MSB ------------------ LSB       10011000011101100101  In raw binary      010011000011101100101  Padded to a multiple of 7 bits  0100110  0001110  1100101  Split into 7-bit groups 00100110 10001110 11100101  Add high 1 bits on all but last (most significant) group to form bytes     0x26     0x8E     0xE5  In hexadecimal  → 0xE5 0x8E 0x26            Output stream (LSB to MSB)

Unsigned LEB128 and VLQ (variable-length quantity) both compress any given integer into not only the same number of bits, but exactly the same bits—the two formats differ only in exactly how those bits are arranged.

Signed LEB128

A signed number is represented similarly: Starting with an -bit two's complement representation, where is a multiple of 7, the number is broken into groups as for the unsigned encoding.

For example, the signed number -123456 is encoded as 0xC0 0xBB 0x78:

MSB ------------------ LSB          11110001001000000  Binary encoding of 123456      000011110001001000000  As a 21-bit number      111100001110110111111  Negating all bits (ones' complement)      111100001110111000000  Adding one (two's complement)  1111000  0111011  1000000  Split into 7-bit groups 01111000 10111011 11000000  Add high 1 bits on all but last (most significant) group to form bytes     0x78     0xBB     0xC0  In hexadecimal  → 0xC0 0xBB 0x78            Output stream (LSB to MSB)

Fast decoding

A straightforward scalar implementation of LEB128 decoding is fairly slow, even more so on modern hardware where branch misprediction is relatively expensive. A series of papers presents SIMD techniques for accelerating decoding (it is called VByte in these papers, but is another name for the same encoding). The "Vectorized VByte Decoding" paper [4] presented "Masked VByte", which demonstrated speeds of 650–2700 million integers per second on commodity Haswell hardware, depending on encoding density. A followup paper presented a variant encoding, "Stream VByte: Faster Byte Oriented Integer Compression", [5] which increased speeds to over 4 billion integers per second. This stream encoding separates the control stream from the encoded data, so is not binary compatible with LEB128.

C-like pseudocode

Encode unsigned integer

do{byte=low-order7bitsofvalue;value>>=7;if(value!=0)/* more bytes to come */sethigh-orderbitofbyte;emitbyte;}while(value!=0);

Encode signed integer

more=1;negative=(value<0);/* the size in bits of the variable value, e.g., 64 if value's type is int64_t */size=no.ofbitsinsignedinteger;while(more){byte=low-order7bitsofvalue;value>>=7;/* the following is only necessary if the implementation of >>= uses a     logical shift rather than an arithmetic shift for a signed left operand */if(negative)value|=(~0<<(size-7));/* sign extend *//* sign bit of byte is second high-order bit (0x40) */if((value==0&&signbitofbyteisclear)||(value==-1&&signbitofbyteisset))more=0;elsesethigh-orderbitofbyte;emitbyte;}

Decode unsigned integer

result=0;shift=0;while(true){byte=nextbyteininput;result|=(low-order7bitsofbyte)<<shift;if(high-orderbitofbyte==0)break;shift+=7;}

Decode signed integer

result=0;shift=0;/* the size in bits of the result variable, e.g., 64 if result's type is int64_t */size=numberofbitsinsignedinteger;do{byte=nextbyteininput;result|=(low-order7bitsofbyte<<shift);shift+=7;}while(high-orderbitofbyte!=0);/* sign bit of byte is second high-order bit (0x40) */if((shift<size)&&(signbitofbyteisset))/* sign extend */result|=(~0<<shift);

JavaScript code

Encode signed 32-bit integer

constencodeSignedLeb128FromInt32=(value)=>{value|=0;constresult=[];while(true){constbyte_=value&0x7f;value>>=7;if((value===0&&(byte_&0x40)===0)||(value===-1&&(byte_&0x40)!==0)){result.push(byte_);returnresult;}result.push(byte_|0x80);}};

Decode signed 32-bit integer

constdecodeSignedLeb128=(input)=>{letresult=0;letshift=0;while(true){constbyte=input.shift();result|=(byte&0x7f)<<shift;shift+=7;if((0x80&byte)===0){if(shift<32&&(byte&0x40)!==0){returnresult|(~0<<shift);}returnresult;}}};

Uses

Related Research Articles

In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are commonly represented in a computer as a group of binary digits (bits). The size of the grouping varies so the set of integer sizes available varies between different types of computers. Computer hardware nearly always provides a way to represent a processor register or memory address as an integer.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

<span class="mw-page-title-main">G.711</span> ITU-T recommendation

G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. It is an ITU-T standard (Recommendation) for audio encoding, titled Pulse code modulation (PCM) of voice frequencies released for use in 1972.

In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operations and directly supported by the processor. Most bitwise operations are presented as two-operand instructions where the result replaces one of the input operands.

Two's complement is the most common method of representing signed integers on computers, and more generally, fixed point binary values. Two's complement uses the binary digit with the greatest place value as the sign to indicate whether the binary number is positive or negative. When the most significant bit is 1, the number is signed as negative; and when the most significant bit is 0 the number is signed as positive.

In computer programming, a magic number is any of the following:

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

External Data Representation (XDR) is a standard data serialization format, for uses such as computer network protocols. It allows data to be transferred between different kinds of computer systems. Converting from the local representation to XDR is called encoding. Converting from XDR to the local representation is called decoding. XDR is implemented as a software library of functions which is portable between different operating systems and is also independent of the transport layer.

A Java class file is a file containing Java bytecode that can be executed on the Java Virtual Machine (JVM). A Java class file is usually produced by a Java compiler from Java programming language source files containing Java classes. If a source file has more than one class, each class is compiled into a separate class file.

In computer science, a logical shift is a bitwise operation that shifts all the bits of its operand. The two base variants are the logical left shift and the logical right shift. This is further modulated by the number of bit positions a given value shall be shifted, such as shift left by 1 or shift right by n. Unlike an arithmetic shift, a logical shift does not preserve a number's sign bit or distinguish a number's exponent from its significand (mantissa); every bit in the operand is simply moved a given number of bit positions, and the vacant bit-positions are filled, usually with zeros, and possibly ones.

MIDI Machine Control, or MMC, a subset of the MIDI specification, provides specific commands for controlling recording equipment such as multi-track recorders. MMC messages can be sent along a standard MIDI cable for remote control of such functions as Play, Fast Forward, Rewind, Stop, Pause, and Record. These are "System Exclusive" (SysEx) messages, specifically Real Time Universal SysEx messages.

<span class="mw-page-title-main">SREC (file format)</span> File format developed by Motorola

Motorola S-record is a file format, created by Motorola in the mid-1970s, that conveys binary information as hex values in ASCII text form. This file format may also be known as SRECORD, SREC, S19, S28, S37. It is commonly used for programming flash memory in microcontrollers, EPROMs, EEPROMs, and other types of programmable logic devices. In a typical application, a compiler or assembler converts a program's source code to machine code and outputs it into a HEX file. The HEX file is then imported by a programmer to "burn" the machine code into non-volatile memory, or is transferred to the target system for loading and execution.

<span class="mw-page-title-main">ADX (file format)</span> File format family developed by CRI Middleware

CRI ADX is a proprietary audio container and compression format developed by CRI Middleware specifically for use in video games; it is derived from ADPCM but with lossy compression. Its most notable feature is a looping function that has proved useful for background sounds in various games that have adopted the format, including many games for the Sega Dreamcast as well as some PlayStation 2, GameCube and Wii games. One of the first games to use ADX was Burning Rangers, on the Sega Saturn. Notably, the Sonic the Hedgehog series since the Dreamcast generation and the majority of Sega games for home video consoles and PCs since the Dreamcast continue to use this format for sound and voice recordings. Jet Set Radio Future for original Xbox also used this format.

In computing, bit numbering is the convention used to identify the bit positions in a binary number.

A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to LEB128 except in endianness. See the example below.

<span class="mw-page-title-main">Computation of cyclic redundancy checks</span> Overview of the computation of cyclic redundancy checks

Computation of a cyclic redundancy check is derived from the mathematics of polynomial division, modulo two. In practice, it resembles long division of the binary message string, with a fixed number of zeroes appended, by the "generator polynomial" string except that exclusive or operations replace subtractions. Division of this type is efficiently realised in hardware by a modified shift register, and in software by a series of equivalent algorithms, starting with simple code close to the mathematics and becoming faster through byte-wise parallelism and space–time tradeoffs.

In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

In the C programming language, operations can be performed on a bit level using bitwise operators.

The Esri TIN format is a popular yet proprietary geospatial vector data format for geographic information system (GIS) software for storing elevation data as a triangulated irregular network. It is developed and regulated by Esri, US. The Esri TIN format can spatially describe elevation information including breaking edge features. Each points and triangle can carry a tag information. A TIN stored in this file format can have any shape, cover multiple regions and contain holes.

References

  1. UNIX International (July 1993), "7.8", DWARF Debugging Information Format Specification Version 2.0, Draft (PDF), retrieved 2009-07-19
  2. 1 2 Free Standards Group (December 2005). "DWARF Debugging Information Format Specification Version 3.0" (PDF). p. 70. Retrieved 2009-07-19.
  3. 1 2 WebAssembly Community Group (2020-11-12). "Values — Binary Format — WebAssembly 1.1" . Retrieved 2020-12-31.
  4. Plaisance, Jeff; Kurz, Nathan; Lemire, Daniel (2015). "Vectorized VByte Decoding". arXiv: 1503.07387 .{{cite journal}}: Cite journal requires |journal= (help)
  5. Lemire, Daniel; Kurz, Nathan; Rupp, Christoph (February 1028). "Stream VByte: Faster Byte-Oriented Integer Compression". Information Processing Letters. 130. First International Symposium on Web Algorithms (published June 2015): 1–6. arXiv: 1709.08990 . doi:10.1016/j.ipl.2017.09.011. S2CID   8265597.
  6. "Dalvik Executable Format" . Retrieved 2021-05-18.
  7. Christophe de Dinechin (October 2000). "C++ Exception Handling for IA-64" . Retrieved 2009-07-19.
  8. LLVM Project (2016). "LLVM Code Coverage Mapping Format" . Retrieved 2016-10-20.
  9. LLVM Project (2019). "LLVM LEB128 encoding and decoding" . Retrieved 2019-11-02.
  10. System.IO.BinaryWriter.Write7BitEncodedInt(int) method and System.IO.BinaryReader.Read7BitEncodedInt() method.
  11. "Minecraft Modern Varint & Varlong". wiki.vg. 2020. Retrieved 2020-11-29.
  12. "MPatrol documentation". December 2008. Retrieved 2009-07-19.
  13. "Osr (file format) - osu!wiki". osu.ppy.sh. Retrieved 2017-03-18.
  14. "Efficient XML Interchange (EXI) Format 1.0". www.w3.org (Second ed.). World Wide Web Consortium. 2014-02-11. Retrieved 2020-12-31.
  15. "The .xz File Format". tukaani.org. 2009. Retrieved 2017-10-30.
  16. "LLVM Bitcode File Format — LLVM 13 documentation".

See also