Truncated binary encoding

Last updated

Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number n. It is a slightly more general form of binary encoding when n is not a power of two.

Contents

If n is a power of two, then the coded value for 0 ≤ x < n is the simple binary code for x of length log2(n). Otherwise let k = floor(log2(n)), such that 2k < n < 2k+1 and let u = 2k+1n.

Truncated binary encoding assigns the first u symbols codewords of length k and then assigns the remaining nu symbols the last nu codewords of length k + 1. Because all the codewords of length k + 1 consist of an unassigned codeword of length k with a "0" or "1" appended, the resulting code is a prefix code.

History

Used since at least 1984, phase-in codes, also known as economy codes, [1] [2] [3] are also known as truncated binary encoding.

Example with n = 5

For example, for the alphabet {0, 1, 2, 3, 4}, n = 5 and 22n < 23, hence k = 2 and u = 23 − 5 = 3. Truncated binary encoding assigns the first u symbols the codewords 00, 01, and 10, all of length 2, then assigns the last nu symbols the codewords 110 and 111, the last two codewords of length 3.

For example, if n is 5, plain binary encoding and truncated binary encoding allocates the following codewords. Digits shown struck are not transmitted in truncated binary.

Truncated
binary
EncodingStandard
binary
00000
10011
20102
UNUSED0113
UNUSED1004
UNUSED1015/UNUSED
31106/UNUSED
41117/UNUSED

It takes 3 bits to encode n using straightforward binary encoding, hence 23n = 8 − 5 = 3 are unused.

In numerical terms, to send a value x, where 0 ≤ x < n, and where there are 2kn < 2k+1 symbols, there are u = 2k+1n unused entries when the alphabet size is rounded up to the nearest power of two. The process to encode the number x in truncated binary is: if x is less than u, encode it in k binary bits; if x is greater than or equal to u, encode the value x + u in k + 1 binary bits.

Example with n = 10

Another example, encoding an alphabet of size 10 (between 0 and 9) requires 4 bits, but there are 24 − 10 = 6 unused codes, so input values less than 6 have the first bit discarded, while input values greater than or equal to 6 are offset by 6 to the end of the binary space. (Unused patterns are not shown in this table.)

Input
value
OffsetOffset
value
Standard
binary
Truncated
binary
0000000000
1010001001
2020010010
3030011011
4040100100
5050101101
661201101100
761301111101
861410001110
961510011111

To decode, read the first k bits. If they encode a value less than u, decoding is complete. Otherwise, read an additional bit and subtract u from the result.

Example with n = 7

Here is a more extreme case: with n = 7 the next power of 2 is 8, so k = 2 and u = 23 − 7 = 1:

Input
value
OffsetOffset
value
Standard
binary
Truncated
binary
00000000
112001010
213010011
314011100
415100101
516101110
617110111

This last example demonstrates that a leading zero bit does not always indicate a short code; if u < 2k, some long codes will begin with a zero bit.

Simple algorithm

Generate the truncated binary encoding for a value x, 0 ≤ x < n, where n > 0 is the size of the alphabet containing x. n need not be a power of two.

stringTruncatedBinary(intx,intn){// Set k = floor(log2(n)), i.e., k such that 2^k <= n < 2^(k+1).intk=0,t=n;while(t>1){k++;t>>=1;}// Set u to the number of unused codewords = 2^(k+1) - n.intu=(1<<k+1)-n;if(x<u)returnBinary(x,k);elsereturnBinary(x+u,k+1));}

The routine Binary is expository; usually just the rightmost len bits of the variable x are desired. Here we simply output the binary code for x using len bits, padding with high-order 0s if necessary.

stringBinary(intx,intlen){strings="";while(x!=0){if(even(x))s='0'+s;elses='1'+s;x>>=1;}while(s.Length<len)s='0'+s;returns;}

On efficiency

If n is not a power of two, and k-bit symbols are observed with probability p, then (k + 1)-bit symbols are observed with probability 1 − p. We can calculate the expected number of bits per symbol as

Raw encoding of the symbol has bits. Then relative space saving s (see Data compression ratio) of the encoding can be defined as

When simplified, this expression leads to

This indicates that relative efficiency of truncated binary encoding increases as probability p of k-bit symbols increases, and the raw-encoding symbol bit-length decreases.

See also

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the set and is distributed according to , the entropy is where denotes the sum over the variable's possible values. The choice of base for , the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.

<span class="mw-page-title-main">Hamming distance</span> Number of bits that differ between two strings

In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, including consumer technologies such as MiniDiscs, CDs, DVDs, Blu-ray discs, QR codes, Data Matrix, data transmission technologies such as DSL and WiMAX, broadcast systems such as satellite communications, DVB and ATSC, and storage systems such as RAID 6.

Elias δ code or Elias delta code is a universal code encoding the positive integers developed by Peter Elias.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is one of two related techniques for constructing a prefix code based on a set of symbols and their probabilities.

<span class="mw-page-title-main">Arithmetic coding</span> Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

A prefix code is a type of code system distinguished by its possession of the "prefix property", which requires that there is no whole code word in the system that is a prefix of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for variable-length codes.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

<span class="mw-page-title-main">Aztec Code</span> Type of matrix barcode

The Aztec Code is a matrix code invented by Andrew Longacre, Jr. and Robert Hussey in 1995. The code was published by AIM, Inc. in 1997. Although the Aztec Code was patented, that patent was officially made public domain. The Aztec Code is also published as ISO/IEC 24778:2008 standard. Named after the resemblance of the central finder pattern to an Aztec pyramid, Aztec Code has the potential to use less space than other matrix barcodes because it does not require a surrounding blank "quiet zone".

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

Elias ω coding or Elias omega coding is a universal code encoding the positive integers developed by Peter Elias. Like Elias gamma coding and Elias delta coding, it works by prefixing the positive integer with a representation of its order of magnitude in a universal code. Unlike those other two codes, however, Elias omega recursively encodes that prefix; thus, they are sometimes known as recursive Elias codes.

In coding theory, a linear code is an error-correcting code for which any linear combination of codewords is also a codeword. Linear codes are traditionally partitioned into block codes and convolutional codes, although turbo codes can be seen as a hybrid of these two types. Linear codes allow for more efficient encoding and decoding algorithms than other codes.

In computer science and information theory, a canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

<span class="mw-page-title-main">Hadamard code</span> Error-correcting code

The Hadamard code is an error-correcting code named after Jacques Hadamard that is used for error detection and correction when transmitting messages over very noisy or unreliable channels. In 1971, the code was used to transmit photos of Mars back to Earth from the NASA space probe Mariner 9. Because of its unique mathematical properties, the Hadamard code is not only used by engineers, but also intensely studied in coding theory, mathematics, and theoretical computer science. The Hadamard code is also known under the names Walsh code, Walsh family, and Walsh–Hadamard code in recognition of the American mathematician Joseph Leonard Walsh.

In coding theory, Justesen codes form a class of error-correcting codes that have a constant rate, constant relative distance, and a constant alphabet size.

A locally decodable code (LDC) is an error-correcting code that allows a single bit of the original message to be decoded with high probability by only examining a small number of bits of a possibly corrupted codeword. This property could be useful, say, in a context where information is being transmitted over a noisy channel, and only a small subset of the data is required at a particular time and there is no need to decode the entire message at once. Note that locally decodable codes are not a subset of locally testable codes, though there is some overlap between the two.

In coding theory, burst error-correcting codes employ methods of correcting burst errors, which are errors that occur in many consecutive bits rather than occurring in bits independently of each other.

Asymmetric numeral systems (ANS) is a family of entropy encoding methods introduced by Jarosław (Jarek) Duda from Jagiellonian University, used in data compression since 2014 due to improved performance compared to previous methods. ANS combines the compression ratio of arithmetic coding, with a processing cost similar to that of Huffman coding. In the tabled ANS (tANS) variant, this is achieved by constructing a finite-state machine to operate on a large alphabet without using multiplication.

References

  1. Eastman, Willard L, et al. (Aug. 1984) Apparatus and Method for Compressing Data Signals and Restoring the Compressed Data Signals, US Patent 4,464,650.
  2. Acharya, Tinku et Já Já, Joseph F. (oct. 1996), An on-line variable-length binary encoding of text, Information Sciences, vol 94 no 1-4, p. 1-22.
  3. Job van der Zwan. "Phase-in Codes".