Move-to-front transform

Last updated

The move-to-front (MTF) transform is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. When efficiently implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression algorithm.

Contents

This algorithm was first published by Boris Ryabko under the name of "book stack" in 1980. [1] Subsequently, it was rediscovered by J.K. Bentley et al. in 1986, [2] as attested in the explanatory note. [3]

The transform

The main idea is that each symbol in the data is replaced by its index in the stack of “recently used symbols”. For example, long sequences of identical symbols are replaced by as many zeroes, whereas when a symbol that has not been used in a long time appears, it is replaced with a large number. Thus at the end the data is transformed into a sequence of integers; if the data exhibits a lot of local correlations, then these integers tend to be small.

Let us give a precise description. Assume for simplicity that the symbols in the data are bytes. Each byte value is encoded by its index in a list of bytes, which changes over the course of the algorithm. The list is initially in order by byte value (0, 1, 2, 3, ..., 255). Therefore, the first byte is always encoded by its own value. However, after encoding a byte, that value is moved to the front of the list before continuing to the next byte.

An example will shed some light on how the transform works. Imagine instead of bytes, we are encoding values in a–z. We wish to transform the following sequence:

bananaaa

By convention, the list is initially (abcdefghijklmnopqrstuvwxyz). The first letter in the sequence is b, which appears at index 1 (the list is indexed from 0 to 25). We put a 1 to the output stream:

1

The b moves to the front of the list, producing (bacdefghijklmnopqrstuvwxyz). The next letter is a, which now appears at index 1. So we add a 1 to the output stream. We have:

1,1

and we move the letter a back to the top of the list. Continuing this way, we find that the sequence is encoded by:

1,1,13,1,1,1,0,0
IterationSequenceList
bananaaa1(abcdefghijklmnopqrstuvwxyz)
bananaaa1,1(bacdefghijklmnopqrstuvwxyz)
bananaaa1,1,13(abcdefghijklmnopqrstuvwxyz)
bananaaa1,1,13,1(nabcdefghijklmopqrstuvwxyz)
bananaaa1,1,13,1,1(anbcdefghijklmopqrstuvwxyz)
bananaaa1,1,13,1,1,1(nabcdefghijklmopqrstuvwxyz)
bananaaa1,1,13,1,1,1,0(anbcdefghijklmopqrstuvwxyz)
bananaaa1,1,13,1,1,1,0,0(anbcdefghijklmopqrstuvwxyz)
Final1,1,13,1,1,1,0,0(anbcdefghijklmopqrstuvwxyz)

It is easy to see that the transform is reversible. Simply maintain the same list and decode by replacing each index in the encoded stream with the letter at that index in the list. Note the difference between this and the encoding method: The index in the list is used directly instead of looking up each value for its index.

i.e. you start again with (abcdefghijklmnopqrstuvwxyz). You take the "1" of the encoded block and look it up in the list, which results in "b". Then move the "b" to front which results in (bacdef...). Then take the next "1", look it up in the list, this results in "a", move the "a" to front ... etc.

Implementation

Details of implementation are important for performance, particularly for decoding. For encoding, no clear advantage is gained by using a linked list, so using an array to store the list is acceptable, with worst-case performance O(nk), where n is the length of the data to be encoded and k is the number of values (generally a constant for a given implementation).

The typical performance is better because frequently-used symbols are more likely to be at the front and will produce earlier hits. This is also the idea behind a Move-to-front self-organizing list.

However, for decoding, we can use specialized data structures to greatly improve performance.[ example needed ]

Python

This is a possible implementation of the move-to-front algorithm in Python.

"""mtfwiki.py"""fromcollections.abcimportGenerator,IterablefromitertoolsimportchainclassMoveToFront:"""    >>> mtfwiki = MoveToFront()    >>> list(mtfwiki.encode("Wikipedia"))    [87, 105, 107, 1, 112, 104, 104, 3, 102]    >>> mtfwiki.decode([87, 105, 107, 1, 112, 104, 104, 3, 102])    'Wikipedia'    >>> list(mtfwiki.encode("wikipedia"))    [119, 106, 108, 1, 113, 105, 105, 3, 103]    >>> mtfwiki.decode([119, 106, 108, 1, 113, 105, 105, 3, 103])    'wikipedia'    """def__init__(self,common_dictionary:Iterable[int]=range(256)):"""        Instead of always transmitting an "original" dictionary,        it is simpler to just agree on an initial set.        Here we use the 256 possible values of a byte.        """self.common_dictionary=common_dictionarydefencode(self,plain_text:str)->Generator[int]:# Changing the common dictionary is a bad idea. Make a copy.dictionary=list(self.common_dictionary)# Read in each characterforcinplain_text.encode("latin-1"):# Change to bytes for 256.# Find the rank of the character in the dictionary [O(k)]rank=dictionary.index(c)# the encoded characteryieldrank# Update the dictionary [O(1)]dictionary.pop(rank)dictionary.insert(0,c)defdecode(self,compressed_data:Iterable[int])->str:"""        Inverse function that recover the original text        """dictionary=list(self.common_dictionary)plain_text=[]# Read in each rank in the encoded textforrankincompressed_data:# Read the character of that rank from the dictionaryplain_text.append(dictionary[rank])# Update the dictionarye=dictionary.pop(rank)dictionary.insert(0,e)returnbytes(plain_text).decode("latin-1")# Return original string

In this example we can see the MTF code taking advantage of the three repetitive i's in the input word. The common dictionary here, however, is less than ideal since it is initialized with more commonly used ASCII printable characters put after little-used control codes, against the MTF code's design intent of keeping what's commonly used in the front. If one rotates the dictionary to put the more-used characters in earlier places, a better encoding can be obtained:

defblock32(x):return(x+iforiinrange(32))classMoveToFrontMoreCommon(MoveToFront):"""    >>> mtfwiki = MoveToFrontMoreCommon()    >>> list(mtfwiki.encode("Wikipedia"))    [55, 10, 12, 1, 17, 9, 9, 3, 7]    """def__init__(self):super().__init__(chain(# Sort the ASCII blocks:block32(ord("a")-1),# first lowercase,block32(ord("A")-1),# then uppercase,block32(ord("!")-1),# punctuation/number,block32(0),# the control code,range(128,256),# and finally the non-ASCII stuff))if__name__=="__main__":importdoctestdoctest.testmod()

Use in practical data compression algorithms

The MTF transform takes advantage of local correlation of frequencies to reduce the entropy of a message.[ clarification needed ] Indeed, recently used letters stay towards the front of the list; if use of letters exhibits local correlations, this will result in a large number of small numbers such as "0"'s and "1"'s in the output.

However, not all data exhibits this type of local correlation, and for some messages, the MTF transform may actually increase the entropy.

An important use of the MTF transform is in Burrows–Wheeler transform based compression. The Burrows–Wheeler transform is very good at producing a sequence that exhibits local frequency correlation from text and certain other special classes of data. Compression benefits greatly from following up the Burrows–Wheeler transform with an MTF transform before the final entropy-encoding step.

Example

As an example, imagine we wish to compress Hamlet's soliloquy (To be, or not to be...). We can calculate the size of this message to be 7033 bits. Naively, we might try to apply the MTF transform directly. The result is a message with 7807 bits (higher than the original). The reason is that English text does not in general exhibit a high level of local frequency correlation. However, if we first apply the Burrows–Wheeler transform, and then the MTF transform, we get a message with 6187 bits. Note that the Burrows–Wheeler transform does not decrease the entropy of the message; it only reorders the bytes in a way that makes the MTF transform more effective.

One problem with the basic MTF transform is that it makes the same changes for any character, regardless of frequency, which can result in diminished compression as characters that occur rarely may push frequent characters to higher values. Various alterations and alternatives have been developed for this reason. One common change is to make it so that characters above a certain point can only be moved to a certain threshold. Another is to make some algorithm that runs a count of each character's local frequency and uses these values to choose the characters' order at any point. Many of these transforms still reserve zero for repeat characters, since these are often the most common in data after the Burrows Wheeler Transform.

Move-to-front linked-list

Related Research Articles

In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

<span class="mw-page-title-main">PNG</span> Family of lossless-compression image file formats

Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".

Run-length encoding (RLE) is a form of lossless data compression in which runs of data are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather than as the original run. As an imaginary example of the concept, when encoding an image built up from colored dots, the sequence "green green green green green green green green green" is shortened to "green x 9". This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, games, and animations. For files that do not have many runs, encoding them with RLE could increase the file size.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

<span class="mw-page-title-main">Video codec</span> Digital video coder/decoder

A video codec is software or hardware that compresses and decompresses digital video. In the context of video compression, codec is a portmanteau of encoder and decoder, while a device that only compresses is typically called an encoder, and one that only decompresses is a decoder.

<span class="mw-page-title-main">Arithmetic coding</span> Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and the DEFLATE algorithm used in PNG and ZIP.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 24.08.

Incremental encoding, also known as front compression, back compression, or front coding, is a type of delta encoding compression algorithm whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This algorithm is particularly well-suited for compressing sorted data, e.g., a list of words from a dictionary.

A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure.

Snappy is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than gzip.

JBIG2 is an image compression standard for bi-level images, developed by the Joint Bi-level Image Experts Group. It is suitable for both lossless and lossy compression. According to a press release from the Group, in its lossless mode JBIG2 typically generates files 3–5 times smaller than Fax Group 4 and 2–4 times smaller than JBIG, the previous bi-level compression standard released by the Group. JBIG2 was published in 2000 as the international standard ITU T.88, and in 2001 as ISO/IEC 14492.

Context-adaptive variable-length coding (CAVLC) is a form of entropy coding used in H.264/MPEG-4 AVC video encoding. It is an inherently lossless compression technique, like almost all entropy-coders. In H.264/MPEG-4 AVC, it is used to encode residual, zig-zag order, blocks of transform coefficients. It is an alternative to context-adaptive binary arithmetic coding (CABAC). CAVLC requires considerably less processing to decode than CABAC, although it does not compress the data quite as effectively. CAVLC is supported in all H.264 profiles, unlike CABAC which is not supported in Baseline and Extended profiles.

In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini, who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space.

LZ4 is a lossless data compression algorithm that is focused on compression and decompression speed. It belongs to the LZ77 family of byte-oriented compression schemes.

References

  1. Ryabko, Boris Yakovlevich [in Russian] (1980). "Data compression by means of a "book stack"" (PDF). Problems of Information Transmission. 16 (4): 265–269. Zbl   0466.94007.
  2. Bentley, Jon Louis; Sleator, Daniel Dominic Kaplan; Tarjan, Robert Endre; Wei, V. K. (1986). "A Locally Adaptive Data Compression Scheme". Communications of the ACM. 29 (4): 320–330. CiteSeerX   10.1.1.69.807 . doi:10.1145/5684.5688. S2CID   5854590.
  3. Ryabko, Boris Yakovlevich [in Russian]; Horspool, R. Nigel; Cormack, Gordon Villy (1987). "Comments to: "A locally adaptive data compression scheme" by J. L. Bentley, D. D. Sleator, R. E. Tarjan and V. K. Wei". Comm. ACM. 30 (9): 792–794. doi: 10.1145/30401.315747 . S2CID   16138142.
  4. Rivest, Ronald Linn (1976). "On self-organizing sequential search heuristics". Communications of the ACM. 19 (2): 63–67. doi: 10.1145/359997.360000 . S2CID   498886.