Tunstall coding

Last updated

In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression.

Contents

History

Tunstall coding was the subject of Brian Parker Tunstall's PhD thesis in 1967, while at Georgia Institute of Technology. The subject of that thesis was "Synthesis of noiseless compression codes" [1]

Its design is a precursor to Lempel–Ziv.

Properties

Unlike variable-length codes, which include Huffman and Lempel–Ziv coding, Tunstall coding is a code which maps source symbols to a fixed number of bits. [2]

Both Tunstall codes and Lempel–Ziv codes represent variable-length words by fixed-length codes. [3]

Unlike typical set encoding, Tunstall coding parses a stochastic source with codewords of variable length.

It can be shown [4] that, for a large enough dictionary, the number of bits per source letter can be arbitrarily close to , the entropy of the source.

Algorithm

The algorithm requires as input an input alphabet , along with a distribution of probabilities for each word input. It also requires an arbitrary constant , which is an upper bound to the size of the dictionary that it will compute. The dictionary in question, , is constructed as a tree of probabilities, in which each edge is associated to a letter from the input alphabet. The algorithm goes like this:

D := tree of  leaves, one for each letter in . While :     Convert most probable leaf to tree with  leaves.

Example

Let's imagine that we wish to encode the string "hello, world". Let's further assume (somewhat unrealistically) that the input alphabet contains only characters from the string "hello, world" — that is, 'h', 'e', 'l', ',', ' ', 'w', 'o', 'r', 'd'. We can therefore compute the probability of each character based on its statistical appearance in the input string. For instance, the letter L appears thrice in a string of 12 characters: its probability is .

We initialize the tree, starting with a tree of leaves. Each word is therefore directly associated to a letter of the alphabet. The 9 words that we thus obtain can be encoded into a fixed-sized output of bits.

Tunstall-1.png

We then take the leaf of highest probability (here, ), and convert it to yet another tree of leaves, one for each character. We re-compute the probabilities of those leaves. For instance, the sequence of two letters L happens once. Given that there are three occurrences of letters followed by an L, the resulting probability is .

We obtain 17 words, which can each be encoded into a fixed-sized output of bits.

Tunstall-2.png

Note that we could iterate further, increasing the number of words by every time.

Limitations

Tunstall coding requires the algorithm to know, prior to the parsing operation, what the distribution of probabilities for each letter of the alphabet is. This issue is shared with Huffman coding.

Its requiring a fixed-length block output makes it lesser than Lempel–Ziv, which has a similar dictionary-based design, but with a variable-sized block output.[ clarification needed ]

Implied Read for base modification

Ternary Tunstall Tree Ternary Tunstall Tree.png
Ternary Tunstall Tree

This is an example of a Tunstall code being used to read ( for transmit ) any data that is scrambled, e.g. by polynomial scrambling. This particular example helps to modify the base of the data from 2 to 3 in a stream therefore avoiding expensive base modification routines. With base modification we are particularly bound by 'efficiency' of reads, where ideally bits are used at an average to read the code. This ensures that upon use of the new base, which is duty bound to use at best bits per code, our reads do not result in lesser margin of efficiency of transmission for which we are employing the base modification in the first place. We can therefore then employ the read-to-modify-base mechanism for efficiently transmitting the data across channels that have a different base. eg. transmitting binary data across say MLT-3 channels with increased efficiency when compared to mapping codes ( with large number of unused codes ).

SymbolCode
AA010
AB011
AC100
B00
CA101
CB110
CC111

We are essentially reading perfectly scrambled binary data or 'implied data' for the purpose of transmitting it using base-3 channels. Please see leaf nodes in the Ternary Tunstall Tree. As we can see, the read will result in the first digit being 'B' - 25% of the time as it has an implied probability of 25%, being of length 2 trying to read from implied data. A 'B' such read does not read any further, but with 75% probability we read 'A' or 'C', requiring another code. Thus the efficiency of the read is 2.75 ( average length of the size 7 Huffman code ) / 1.75 ( average length of the 1 or 2-digit base - 3 Tunstall code ) = which is as per requirement very close to which calculates to an efficiency of . We can then transmit the symbols using base-3 channels efficiently.

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable , which takes values in the set and is distributed according to , the entropy is where denotes the sum over the variable's possible values. The choice of base for , the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is one of two related techniques for constructing a prefix code based on a set of symbols and their probabilities.

<span class="mw-page-title-main">Arithmetic coding</span> Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and the DEFLATE algorithm used in PNG and ZIP.

<span class="mw-page-title-main">Bernoulli process</span> Random process of binary (boolean) random variables

In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variablesXi are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin. Every variable Xi in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes ; this generalization is known as the Bernoulli scheme.

Channel capacity, in electrical engineering, computer science, and information theory, is the theoretical maximum rate at which information can be reliably transmitted over a communication channel.

Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number n. It is a slightly more general form of binary encoding when n is not a power of two.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how much a model probability distribution Q is different from a true probability distribution P. Mathematically, it is defined as

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

In information theory, redundancy measures the fractional difference between the entropy H(X) of an ensemble X, and its maximum possible value . Informally, it is the amount of wasted "space" used to transmit certain data. Data compression is a way to reduce or eliminate unwanted redundancy, while forward error correction is a way of adding desired redundancy for purposes of error detection and correction when communicating over a noisy channel of limited capacity.

In computer science and information theory, a canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

In coding theory, a variable-length code is a code which maps source symbols to a variable number of bits. The equivalent concept in computer science is bit string.

An arbitrarily varying channel (AVC) is a communication channel model used in coding theory, and was first introduced by Blackwell, Breiman, and Thomasian. This particular channel has unknown parameters that can change over time and these changes may not have a uniform pattern during the transmission of a codeword. uses of this channel can be described using a stochastic matrix , where is the input alphabet, is the output alphabet, and is the probability over a given set of states , that the transmitted input leads to the received output . The state in set can vary arbitrarily at each time unit . This channel was developed as an alternative to Shannon's Binary Symmetric Channel (BSC), where the entire nature of the channel is known, to be more realistic to actual network channel situations.

A comma code is a type of prefix-free code in which a comma, a particular symbol or sequence of symbols, occurs at the end of a code word and never occurs otherwise. This is an intuitive way to express arrays.

Asymmetric numeral systems (ANS) is a family of entropy encoding methods introduced by Jarosław (Jarek) Duda from Jagiellonian University, used in data compression since 2014 due to improved performance compared to previous methods. ANS combines the compression ratio of arithmetic coding, with a processing cost similar to that of Huffman coding. In the tabled ANS (tANS) variant, this is achieved by constructing a finite-state machine to operate on a large alphabet without using multiplication.

The Lempel–Ziv complexity is a measure that was first presented in the article On the Complexity of Finite Sequences, by two Israeli computer scientists, Abraham Lempel and Jacob Ziv. This complexity measure is related to Kolmogorov complexity, but the only function it uses is the recursive copy.

References

  1. Tunstall, Brian Parker (September 1967). Synthesis of noiseless compression codes. Georgia Institute of Technology.
  2. http://www.rle.mit.edu/rgallager/documents/notes1.pdf, Study of Tunstall's algorithm at MIT
  3. "Variable to fixed length adaptive source coding - Lempel-Ziv coding".
  4. , Study of Tunstall's algorithm from EPFL's Information Theory department