Dynamic Markov compression

Last updated

Dynamic Markov compression (DMC) is a lossless data compression algorithm developed by Gordon Cormack and Nigel Horspool. [1] It uses predictive arithmetic coding similar to prediction by partial matching (PPM), except that the input is predicted one bit at a time (rather than one byte at a time). DMC has a good compression ratio and moderate speed, similar to PPM, but requires somewhat more memory and is not widely implemented. Some recent implementations include the experimental compression programs hook by Nania Francesco Antonio, ocamyd by Frank Schwellinger, and as a submodel in paq8l by Matt Mahoney. These are based on the 1993 implementation in C by Gordon Cormack.

Contents

Algorithm

DMC predicts and codes one bit at a time. It differs from PPM in that it codes bits rather than bytes, and from context mixing algorithms such as PAQ in that there is only one context per prediction. The predicted bit is then coded using arithmetic coding.

Arithmetic coding

A bitwise arithmetic coder such as DMC has two components, a predictor and an arithmetic coder. The predictor accepts an n-bit input string x = x1x2...xn and assigns it a probability p(x), expressed as a product of a series of predictions, p(x1)p(x2|x1)p(x3|x1x2) ... p(xn|x1x2...xn1). The arithmetic coder maintains two high precision binary numbers, plow and phigh, representing the possible range for the total probability that the model would assign to all strings lexicographically less than x, given the bits of x seen so far. The compressed code for x is px, the shortest bit string representing a number between plow and phigh. It is always possible to find a number in this range no more than one bit longer than the Shannon limit, log2 1 /p(x). One such number can be obtained from phigh by dropping all of the trailing bits after the first bit that differs from plow.

Compression proceeds as follows. The initial range is set to plow = 0, phigh = 1. For each bit, the predictor estimates p0 = p(xi = 0|x1x2...xi1) and p1 = 1  p0, the probability of a 0 or 1, respectively. The arithmetic coder then divides the current range, (plow, phigh) into two parts in proportion to p0 and p1. Then the subrange corresponding to the next bit xi becomes the new range.

For decompression, the predictor makes an identical series of predictions, given the bits decompressed so far. The arithmetic coder makes an identical series of range splits, then selects the range containing px and outputs the bit xi corresponding to that subrange.

In practice, it is not necessary to keep plow and phigh in memory to high precision. As the range narrows the leading bits of both numbers will be the same, and can be output immediately.

DMC model

The DMC predictor is a table which maps (bitwise) contexts to a pair of counts, n0 and n1, representing the number of zeros and ones previously observed in this context. Thus, it predicts that the next bit will be a 0 with probability p0 = n0/n = n0/ (n0 + n1) and 1 with probability p1 = 1  p0 = n1/n. In addition, each table entry has a pair of pointers to the contexts obtained by appending either a 0 or a 1 to the right of the current context (and possibly dropping bits on the left). Thus, it is never necessary to look up the current context in the table; it is sufficient to maintain a pointer to the current context and follow the links.

In the original DMC implementation, the initial table is the set of all contexts of length 8 to 15 bits that begin on a byte boundary. The initial state is any of the 8 bit contexts. The counts are floating point numbers initialized to a small nonzero constant such as 0.2. The counts are not initialized to zero in order to allow values to be coded even if they have not been seen before in the current context.

Modeling is the same for compression and decompression. For each bit, p0 and p1 are computed, bit xi is coded or decoded, the model is updated by adding 1 to the count corresponding to xi, and the next context is found by traversing the link corresponding to xi.


Adding new contexts

DMC as described above is equivalent to an order-1 context model. However, it is normal to add longer contexts to improve compression. If the current context is A, and the next context B would drop bits on the left, then DMC may add (clone) a new context C from B. C represents the same context as A after appending one bit on the right as with B, but without dropping any bits on the left. The link from A will thus be moved from B to point to C. B and C will both make the same prediction, and both will point to the same pair of next states. The total count, n = n0 + n1 for C will be equal to the count nx for A (for input bit x), and that count will be subtracted from B.

For example, suppose that state A represents the context 11111. On input bit 0, it transitions to state B representing context 110, obtained by dropping 3 bits on the left. In context A, there have been 4 zero bits and some number of one bits. In context B, there have been 3 zeros and 7 ones (n = 10), which predicts p1 = 0.7.

Staten0n1next0next1
A = 111114B
B = 11037EF

C is cloned from B. It represents context 111110. Both B and C predict p1 = 0.7, and both go to the same next states, E and F. The count for C is n = 4, equal to n0 for A. This leaves n = 6 for B.

Staten0n1next0next1
A = 111114C
B = 1101.84.2EF
C = 1111101.22.8EF

States are cloned just prior to transitioning to them. In the original DMC, the condition for cloning a state is when the transition from A to B is at least 2, and the count for B is at least 2 more than that. (When the second threshold is greater than 0, it guarantees that other states will still transition to B after cloning). Some implementations such as hook allow these thresholds to be set as parameters. In paq8l, these thresholds increase as memory is used up to slow the growth rate of new states. In most implementations, when memory is exhausted the model is discarded and reinitialized back to the original bytewise order 1 model.

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

<span class="mw-page-title-main">Markov chain</span> Random process independent of past history

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happens next depends only on the state of affairs now." A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). It is named after the Russian mathematician Andrey Markov.

<span class="mw-page-title-main">Arithmetic coding</span> Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream. PPM algorithms can also be used to cluster data into predicted groupings in cluster analysis.

In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.

PAQ is a series of lossless data compression archivers that have gone through collaborative development to top rankings on several benchmarks measuring compression ratio. Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge. PAQ is free software distributed under the GNU General Public License.

Context mixing is a type of data compression algorithm in which the next-symbol predictions of two or more statistical models are combined to yield a prediction that is often more accurate than any of the individual predictions. For example, one simple method is to average the probabilities assigned by each model. The random forest is another method: it outputs the prediction that is the mode of the predictions output by individual models. Combining models is an active area of research in machine learning.

In bioinformatics, GLIMMER is used to find genes in prokaryotic DNA. "It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all relatively long protein coding genes". GLIMMER was the first system that used the interpolated Markov model to identify coding regions. The GLIMMER software is open source and is maintained by Steven Salzberg, Art Delcher, and their colleagues at the Center for Computational Biology at Johns Hopkins University. The original GLIMMER algorithms and software were designed by Art Delcher, Simon Kasif and Steven Salzberg and applied to bacterial genome annotation in collaboration with Owen White.

Lossless JPEG is a 1993 addition to JPEG standard by the Joint Photographic Experts Group to enable lossless compression. However, the term may also be used to refer to all lossless compression schemes developed by the group, including JPEG 2000 and JPEG-LS.

In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.

In the mathematical theory of stochastic processes, variable-order Markov (VOM) models are an important class of models that extend the well known Markov chain models. In contrast to the Markov chain models, where each random variable in a sequence with a Markov property depends on a fixed number of random variables, in VOM models this number of conditioning random variables may vary based on the specific observed realization.

Context-adaptive binary arithmetic coding (CABAC) is a form of entropy encoding used in the H.264/MPEG-4 AVC and High Efficiency Video Coding (HEVC) standards. It is a lossless compression technique, although the video coding standards in which it is used are typically for lossy compression applications. CABAC is notable for providing much better compression than most other entropy encoding algorithms used in video encoding, and it is one of the key elements that provides the H.264/AVC encoding scheme with better compression capability than its predecessors.

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

Stochastic chains with memory of variable length are a family of stochastic chains of finite order in a finite alphabet, such as, for every time pass, only one finite suffix of the past, called context, is necessary to predict the next symbol. These models were introduced in the information theory literature by Jorma Rissanen in 1983, as a universal tool to data compression, but recently have been used to model data in different areas such as biology, linguistics and music.

Asymmetric numeral systems (ANS) is a family of entropy encoding methods introduced by Jarosław (Jarek) Duda from Jagiellonian University, used in data compression since 2014 due to improved performance compared to previous methods. ANS combines the compression ratio of arithmetic coding, with a processing cost similar to that of Huffman coding. In the tabled ANS (tANS) variant, this is achieved by constructing a finite-state machine to operate on a large alphabet without using multiplication.

References

  1. Gordon Cormack and Nigel Horspool, "Data Compression using Dynamic Markov Modelling", Computer Journal 30:6 (December 1987)