Truncated binary encoding

Last updated August 23, 2024

Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number n. It is a slightly more general form of binary encoding when n is not a power of two.

Truncated binary encoding assigns the first u symbols codewords of length k and then assigns the remaining n − u symbols the last n − u codewords of length k + 1. Because all the codewords of length k + 1 consist of an unassigned codeword of length k with a "0" or "1" appended, the resulting code is a prefix code.

History

Used since at least 1984, phase-in codes, also known as economy codes,^[1]^[2]^[3] are also known as truncated binary encoding.

Example with n = 5

For example, for the alphabet {0, 1, 2, 3, 4}, n = 5 and 2² ≤ n < 2³, hence k = 2 and u = 2³ − 5 = 3. Truncated binary encoding assigns the first u symbols the codewords 00, 01, and 10, all of length 2, then assigns the last n − u symbols the codewords 110 and 111, the last two codewords of length 3.

For example, if n is 5, plain binary encoding and truncated binary encoding allocates the following codewords. Digits shown ~~struck~~ are not transmitted in truncated binary.

Truncated binary	Encoding			Standard binary
0	0	0	0	0
1	0	0	1	1
2	0	1	0	2
UNUSED	0	1	1	3
UNUSED	1	0	0	4
UNUSED	1	0	1	5/UNUSED
3	1	1	0	6/UNUSED
4	1	1	1	7/UNUSED

It takes 3 bits to encode n using straightforward binary encoding, hence 2³ − n = 8 − 5 = 3 are unused.

In numerical terms, to send a value x, where 0 ≤ x < n, and where there are 2^k ≤ n < 2^k+1 symbols, there are u = 2^k+1 − n unused entries when the alphabet size is rounded up to the nearest power of two. The process to encode the number x in truncated binary is: if x is less than u, encode it in k binary bits; if x is greater than or equal to u, encode the value x + u in k + 1 binary bits.

Example with n = 10

Another example, encoding an alphabet of size 10 (between 0 and 9) requires 4 bits, but there are 2⁴ − 10 = 6 unused codes, so input values less than 6 have the first bit discarded, while input values greater than or equal to 6 are offset by 6 to the end of the binary space. (Unused patterns are not shown in this table.)

Input value	Offset	Offset value	Standard binary	Truncated binary
0	0	0	0000	000
1	0	1	0001	001
2	0	2	0010	010
3	0	3	0011	011
4	0	4	0100	100
5	0	5	0101	101

6	6	12	0110	1100
7	6	13	0111	1101
8	6	14	1000	1110
9	6	15	1001	1111

To decode, read the first k bits. If they encode a value less than u, decoding is complete. Otherwise, read an additional bit and subtract u from the result.

Example with n = 7

Here is a more extreme case: with n = 7 the next power of 2 is 8, so k = 2 and u = 2³ − 7 = 1:

Input value	Offset	Offset value	Standard binary	Truncated binary
0	0	0	000	00

1	1	2	001	010
2	1	3	010	011
3	1	4	011	100
4	1	5	100	101
5	1	6	101	110
6	1	7	110	111

This last example demonstrates that a leading zero bit does not always indicate a short code; if u < 2^k, some long codes will begin with a zero bit.

Simple algorithm

Generate the truncated binary encoding for a value x, 0 ≤ x < n, where n > 0 is the size of the alphabet containing x. n need not be a power of two.

stringTruncatedBinary(intx,intn){// Set k = floor(log2(n)), i.e., k such that 2^k <= n < 2^(k+1).intk=0,t=n;while(t>1){k++;t>>=1;}// Set u to the number of unused codewords = 2^(k+1) - n.intu=(1<<k+1)-n;if(x<u)returnBinary(x,k);elsereturnBinary(x+u,k+1));}

The routine Binary is expository; usually just the rightmost len bits of the variable x are desired. Here we simply output the binary code for x using len bits, padding with high-order 0s if necessary.

stringBinary(intx,intlen){strings="";while(x!=0){if(even(x))s='0'+s;elses='1'+s;x>>=1;}while(s.Length<len)s='0'+s;returns;}

On efficiency

If n is not a power of two, and k-bit symbols are observed with probability p, then (k + 1)-bit symbols are observed with probability 1 − p. We can calculate the expected number of bits per symbol $b_{e}$ as

b_{e}=pk+(1-p)(k+1).

Raw encoding of the symbol has $b_{u}=k+1$ bits. Then relative space saving s (see Data compression ratio) of the encoding can be defined as

s=1-{\frac {b_{e}}{b_{u}}}=1-{\frac {pk+(1-p)(k+1)}{k+1}}.

When simplified, this expression leads to

s={\frac {p}{k+1}}={\frac {p}{b_{u}}}.

This indicates that relative efficiency of truncated binary encoding increases as probability p of k-bit symbols increases, and the raw-encoding symbol bit-length $b_{u}$ decreases.

Related Research Articles

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable $, which takes values in the set and is distributed according to, the entropy is where denotes the sum over the variable's possible values. The choice of base for, the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.$

In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, including consumer technologies such as MiniDiscs, CDs, DVDs, Blu-ray discs, QR codes, Data Matrix, data transmission technologies such as DSL and WiMAX, broadcast systems such as satellite communications, DVB and ATSC, and storage systems such as RAID 6.

Elias δ code or Elias delta code is a universal code encoding the positive integers developed by Peter Elias.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is one of two related techniques for constructing a prefix code based on a set of symbols and their probabilities.

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

A prefix code is a type of code system distinguished by its possession of the "prefix property", which requires that there is no whole code word in the system that is a prefix of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for variable-length codes.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

<span class="mw-page-title-main">Aztec Code</span> Type of matrix barcode

The Aztec Code is a matrix code invented by Andrew Longacre, Jr. and Robert Hussey in 1995. The code was published by AIM, Inc. in 1997. Although the Aztec Code was patented, that patent was officially made public domain. The Aztec Code is also published as ISO/IEC 24778:2008 standard. Named after the resemblance of the central finder pattern to an Aztec pyramid, Aztec Code has the potential to use less space than other matrix barcodes because it does not require a surrounding blank "quiet zone".

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

Elias ω coding or Elias omega coding is a universal code encoding the positive integers developed by Peter Elias. Like Elias gamma coding and Elias delta coding, it works by prefixing the positive integer with a representation of its order of magnitude in a universal code. Unlike those other two codes, however, Elias omega recursively encodes that prefix; thus, they are sometimes known as recursive Elias codes.

In coding theory, a linear code is an error-correcting code for which any linear combination of codewords is also a codeword. Linear codes are traditionally partitioned into block codes and convolutional codes, although turbo codes can be seen as a hybrid of these two types. Linear codes allow for more efficient encoding and decoding algorithms than other codes.

In computer science and information theory, a canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

The Hadamard code is an error-correcting code named after Jacques Hadamard that is used for error detection and correction when transmitting messages over very noisy or unreliable channels. In 1971, the code was used to transmit photos of Mars back to Earth from the NASA space probe Mariner 9. Because of its unique mathematical properties, the Hadamard code is not only used by engineers, but also intensely studied in coding theory, mathematics, and theoretical computer science. The Hadamard code is also known under the names Walsh code, Walsh family, and Walsh–Hadamard code in recognition of the American mathematician Joseph Leonard Walsh.

In coding theory, Justesen codes form a class of error-correcting codes that have a constant rate, constant relative distance, and a constant alphabet size.

A locally decodable code (LDC) is an error-correcting code that allows a single bit of the original message to be decoded with high probability by only examining a small number of bits of a possibly corrupted codeword. This property could be useful, say, in a context where information is being transmitted over a noisy channel, and only a small subset of the data is required at a particular time and there is no need to decode the entire message at once. Note that locally decodable codes are not a subset of locally testable codes, though there is some overlap between the two.

In coding theory, burst error-correcting codes employ methods of correcting burst errors, which are errors that occur in many consecutive bits rather than occurring in bits independently of each other.

Asymmetric numeral systems (ANS) is a family of entropy encoding methods introduced by Jarosław (Jarek) Duda from Jagiellonian University, used in data compression since 2014 due to improved performance compared to previous methods. ANS combines the compression ratio of arithmetic coding, with a processing cost similar to that of Huffman coding. In the tabled ANS (tANS) variant, this is achieved by constructing a finite-state machine to operate on a large alphabet without using multiplication.

References

↑ Eastman, Willard L, et al. (Aug. 1984) Apparatus and Method for Compressing Data Signals and Restoring the Compressed Data Signals, US Patent 4,464,650.
↑ Acharya, Tinku et Já Já, Joseph F. (oct. 1996), An on-line variable-length binary encoding of text, Information Sciences, vol 94 no 1-4, p. 1-22.
↑ Job van der Zwan. "Phase-in Codes".

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Eastman, Willard L, et al. (Aug. 1984) Apparatus and Method for Compressing Data Signals and Restoring the Compressed Data Signals, US Patent 4,464,650.

[2] Acharya, Tinku et Já Já, Joseph F. (oct. 1996), An on-line variable-length binary encoding of text, Information Sciences, vol 94 no 1-4, p. 1-22.

[3] Job van der Zwan. "Phase-in Codes".

[1]

[2]

[3]