Variable-length quantity

Last updated

A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets (eight-bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to LEB128 except in endianness. See the example below.

Contents

Applications and history

Base-128 compression is known by many names VB (Variable Byte), VByte, Varint, VInt, EncInt etc. [1]

A variable-length quantity (VLQ) was defined for use in the standard MIDI file format [2] to save additional space for a resource-constrained system, and is also used in the later Extensible Music Format (XMF).

Base-128 is also used in ASN.1 BER encoding to encode tag numbers and object identifiers. [3] It is also used in the WAP environment, where it is called variable length unsigned integer or uintvar. The DWARF debugging format [4] defines a variant called LEB128 (or ULEB128 for unsigned numbers), where the least significant group of 7 bits is encoded in the first byte, and the most significant bits are in the last byte (so effectively it is the little-endian analog of a VLQ). Google Protocol Buffers use a similar format to have compact representation of integer values, [5] as does Oracle Portable Object Format (POF) [6] and the Microsoft .NET Framework "7-bit encoded int" in the BinaryReader and BinaryWriter classes. [7]

It is also used extensively in web browsers for source mapping – which contain a lot of integer line and column number mappings – to keep the size of the map to a minimum. [8]

Variable-width integers in LLVM use a similar principle. The encoding chunks are little-endian and need not be 8 bits in size. The LLVM documentation describes a field that uses 4-bit chunk, with each chunk consisting of 1 bit continuation and 3 bits payload. [9]

Benefits

  1. Compactness: One of the primary benefits of VLQ encoding is its compactness. Since it uses a variable number of bytes to encode an integer, smaller integers can be represented using fewer bytes, resulting in a smaller overall file size. This is particularly useful in scenarios where storage space is at a premium, such as in embedded systems or mobile devices.
  2. Efficiency: VLQ encoding is an efficient way to store and transmit data. Since smaller integers are represented using fewer bytes, this reduces the amount of data that needs to be transmitted, which in turn reduces the time and bandwidth required to transmit the data.
  3. Flexibility: Another advantage of VLQ encoding is its flexibility. Since the number of bytes used to represent an integer is based on its magnitude, VLQ encoding can handle integers of different sizes. This means that VLQ encoding can be used to represent integers of any size, from small 8-bit integers to large 64-bit integers.

General structure

The encoding assumes an octet (an 8-bit byte) where the most significant bit (MSB), also commonly known as the sign bit, is reserved to indicate whether another VLQ octet follows.

VLQ octet
76543210
2726252423222120
ABn

If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.

B is a 7-bit number [0x00, 0x7F], and n is the position of the VLQ octet where B0 is the least significant. The VLQ octets are arranged most significant first in a stream.

Variants

The general VLQ encoding is simple, but in basic form is only defined for unsigned integers (nonnegative, positive or zero), and is somewhat redundant, since prepending 0x80 octets corresponds to zero padding. There are various signed number representations to handle negative numbers, and techniques to remove the redundancy.

Group Varint Encoding

Google developed Group Varint Encoding (GVE) after observing that traditional VLQ encoding incurs many CPU branches during decompression. GVE uses a single byte as a header for 4 variable-length uint32 values. The header byte has 4 2-bit numbers representing the storage length of each of the following 4 uint32s. Such a layout eliminates the need to check and remove VLQ continuation bits. Data bytes can be copied directly to their destination. This layout reduces CPU branches, making GVE faster than VLQ on modern pipelined CPUs. [10]

PrefixVarint is a similar design but with a uint64 maximum. It is said to have "been invented multiple times independently". [11] It is possible to be changed into a chained version with infinitely many continuations.

Signed numbers

Sign bit

Negative numbers can be handled using a sign bit, which only needs to be present in the first octet.

In the data format for Unreal Packages used by the Unreal Engine, a variable-length quantity scheme called Compact Indices [12] is used. The only difference in this encoding is that the first VLQ octet has the sixth bit reserved to indicate whether the encoded integer is positive or negative. Any consecutive VLQ octet follows the general structure.

Unreal Signed VLQ
First VLQ octetOther VLQ octets
7654321076543210
27262524232221202726252423222120
ABC0ACn (n > 0)

If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.

If B is 0, then the VLQ represents a positive integer. If B is 1, then the VLQ represents a negative number.

C is the number chunk being encoded, and n is the position of the VLQ octet where C0 is the least significant. The VLQ octets are arranged least significant first in a stream.

Zigzag encoding

An alternative way to encode negative numbers is to use the least significant bit for sign. This is notably done for Google Protocol Buffers, and is known as a zigzag encoding for signed integers. [13] One can encode the numbers so that encoded 0 corresponds to 0, 1 to 1, 10 to 1, 11 to 2, 100 to 2, etc.: counting up alternates between nonnegative (starting at 0) and negative (since each step changes the least-significant bit, hence the sign), whence the name "zigzag encoding". Concretely, transform the integer as (n << 1) ^ (n >> k - 1) for fixed k-bit integers.

Two's complement

LEB128 uses two's complement to represent signed numbers. In this scheme of representation, n bits encode a range from −2n to 2n  1, and all negative numbers start with a 1 in the most significant bit. In Signed LEB128, the input is sign-extended so that its length is a multiple of 7 bits. From there the encoding proceeds as usual. [14]

In LEB128, the stream is arranged least significant first. [14]

Removing redundancy

With the VLQ encoding described above, any number that can be encoded with N octets can also be encoded with more than N octets simply by prepending additional 0x80 octets as zero-padding. For example, the decimal number 358 can be encoded as the 2-octet VLQ 0x8266, or the number 0358 can be encoded as 3-octet VLQ 0x808266, or 00358 as the 4-octet VLQ 0x80808266 and so forth.

However, the VLQ format used in Git [15] removes this prepending redundancy and extends the representable range of shorter VLQs by adding an offset to VLQs of 2 or more octets in such a way that the lowest possible value for such an (N + 1)-octet VLQ becomes exactly one more than the maximum possible value for an N-octet VLQ. In particular, since a 1-octet VLQ can store a maximum value of 127, the minimum 2-octet VLQ (0x8000) is assigned the value 128 instead of 0. Conversely, the maximum value of such a 2-octet VLQ (0xFF7F) is 16511 instead of just 16383. Similarly, the minimum 3-octet VLQ (0x808000) has a value of 16512 instead of zero, which means that the maximum 3-octet VLQ (0xFFFF7F) is 2113663 instead of just 2097151.

In this way, there is one and only one encoding of each integer, making this a base-128 bijective numeration.

Examples

Diagram showing how to convert 106903 from decimal to uintvar representation Uintvar coding.svg
Diagram showing how to convert 106903 from decimal to uintvar representation

Here is a worked-out example for the decimal number 137:

Another way to look at this is to represent the value in base-128 and then set the MSB of all but the last base-128 digit to 1.

The Standard MIDI File format specification gives more examples: [2] [16]

Integer
(decimal)
Integer (binary)Variable-length quantity (binary)Integer
(hexadecimal)
Variable-length
quantity
(hexadecimal)
0
00000000 00000000 00000000 00000000
                           00000000
0000000000
127
00000000 00000000 00000000 01111111
                           01111111
0000007F7F
128
00000000 00000000 00000000 10000000
                  10000001 00000000
0000008081 00
8192
00000000 00000000 00100000 00000000
                  11000000 00000000
00002000C0 00
16383
00000000 00000000 00111111 11111111
                  11111111 01111111
00003FFFFF 7F
16384
00000000 00000000 01000000 00000000
         10000001 10000000 00000000
0000400081 80 00
2097151
00000000 00011111 11111111 11111111
         11111111 11111111 01111111
001FFFFFFF FF 7F
2097152
00000000 00100000 00000000 00000000
10000001 10000000 10000000 00000000
0020000081 80 80 00
134217728
00001000 00000000 00000000 00000000
11000000 10000000 10000000 00000000
08000000C0 80 80 00
268435455
00001111 11111111 11111111 11111111
11111111 11111111 11111111 01111111
0FFFFFFFFF FF FF 7F

Related Research Articles

<span class="mw-page-title-main">Binary-coded decimal</span> System of digitally encoding numbers

In computing and electronic systems, binary-coded decimal (BCD) is a class of binary encodings of decimal numbers where each digit is represented by a fixed number of bits, usually four or eight. Sometimes, special bit patterns are used for a sign or other indications.

In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are commonly represented in a computer as a group of binary digits (bits). The size of the grouping varies so the set of integer sizes available varies between different types of computers. Computer hardware nearly always provides a way to represent a processor register or memory address as an integer.

A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

Two's complement is the most common method of representing signed integers on computers, and more generally, fixed point binary values. Two's complement uses the binary digit with the greatest value as the sign to indicate whether the binary number is positive or negative; when the most significant bit is 1 the number is signed as negative and when the most significant bit is 0 the number is signed as positive. As a result, non-negative numbers are represented as themselves: 6 is 0110, zero is 0000, and -6 is 1010. Note that while the number of binary bits is fixed throughout a computation it is otherwise arbitrary.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

<span class="mw-page-title-main">Power of two</span> Two raised to an integer power

A power of two is a number of the form 2n where n is an integer, that is, the result of exponentiation with number two as the base and integer n as the exponent.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

128 is the natural number following 127 and preceding 129.

In computing, signed number representations are required to encode negative numbers in binary number systems.

An organizationally unique identifier (OUI) is a 24-bit number that uniquely identifies a vendor, manufacturer, or other organization.

Netpbm is an open-source package of graphics programs and a programming library. It is used mainly in the Unix world, where one can find it included in all major open-source operating system distributions, but also works on Microsoft Windows, macOS, and other operating systems.

The Intel BCD opcodes are a set of six x86 instructions that operate with binary-coded decimal numbers. The radix used for the representation of numbers in the x86 processors is 2. This is called a binary numeral system. However, the x86 processors do have limited support for the decimal numeral system.

MIDI Machine Control, or MMC, a subset of the MIDI specification, provides specific commands for controlling recording equipment such as multi-track recorders. MMC messages can be sent along a standard MIDI cable for remote control of such functions as Play, Fast Forward, Rewind, Stop, Pause, and Record. These are "System Exclusive" (SysEx) messages, specifically Real Time Universal SysEx messages.

In computing, bit numbering is the convention used to identify the bit positions in a binary number.

Action Message Format (AMF) is a binary format used to serialize object graphs such as ActionScript objects and XML, or send messages between an Adobe Flash client and a remote service, usually a Flash Media Server or third party alternatives. The Actionscript 3 language provides classes for encoding and decoding from the AMF format.

In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes in computer memory. Like the binary16 and binary32 formats, it is intended for memory saving storage.

LEB128 or Little Endian Base 128 is a variable-length code compression used to store arbitrarily large integers in a small number of bytes. LEB128 is used in the DWARF debug file format and the WebAssembly binary encoding for all integer literals.

X.690 is an ITU-T standard specifying several ASN.1 encoding formats:

References

  1. Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson. "An Experimental Study of Bitmap Compression vs. Inverted List Compression" Archived 2019-12-07 at the Wayback Machine . 2017. doi : 10.1145/3035918.3064007.
  2. 1 2 MIDI File Format: Variable Quantities.
  3. "ITU-T Recommendation X.690" (PDF). International Telecommunication Union. 2002.
  4. DWARF Standard.
  5. Google Protocol Buffers.
  6. Oracle Portable Object Format (POF).
  7. System.IO.BinaryWriter.Write7BitEncodedInt(int) method and System.IO.BinaryReader.Read7BitEncodedInt() method.
  8. Introduction to javascript source maps.
  9. "LLVM Bitcode File Format", section "Variable Width Integers". Accessed 2019-10-01.
  10. Jeff Dean. "Challenges in Building Large-Scale Information Retrieval Systems" (PDF). p. 58. Retrieved 2020-05-30.
  11. Olesen, Jakob Stoklund (31 May 2020). "stoklund/varint". Archived from the original on 19 November 2020. Retrieved 9 July 2020.
  12. "Unreal Packages". 1999-07-21. Archived from the original on 2010-08-20. Retrieved 2021-08-29.
  13. Protocol Buffers: Encoding: Signed Integers.
  14. 1 2 Free Standards Group (December 2005). "DWARF Debugging Information Format Specification Version 3.0" (PDF). p. 70. Retrieved 2009-07-19.
  15. "Git – fast, scalable, distributed revision control system". 28 October 2021.
  16. Standard MIDI-File Format Spec. 1.1.