LZ77 and LZ78

Last updated

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 [1] and 1978. [2] They are also known as LZ1 and LZ2 respectively. [3] These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and the DEFLATE algorithm used in PNG and ZIP.

Contents

They are both theoretically dictionary coders. LZ77 maintains a sliding window during compression. This was later shown to be equivalent to the explicit dictionary constructed by LZ78—however, they are only equivalent when the entire data is intended to be decompressed.

Since LZ77 encodes and decodes from a sliding window over previously seen characters, decompression must always start at the beginning of the input. Conceptually, LZ78 decompression could allow random access to the input if the entire dictionary were known in advance. However, in practice the dictionary is created during encoding and decoding by creating a new phrase whenever a token is output. [4]

The algorithms were named an IEEE Milestone in 2004. [5] In 2021 Jacob Ziv was awarded the IEEE Medal of Honor for his involvement in their development. [6]

Theoretical efficiency

In the second of the two papers that introduced these algorithms they are analyzed as encoders defined by finite-state machines. A measure analogous to information entropy is developed for individual sequences (as opposed to probabilistic ensembles). This measure gives a bound on the data compression ratio that can be achieved. It is then shown that there exists finite lossless encoders for every sequence that achieve this bound as the length of the sequence grows to infinity. In this sense an algorithm based on this scheme produces asymptotically optimal encodings. This result can be proven more directly, as for example in notes by Peter Shor. [7]

Formally, (Theorem 13.5.3 [8] ).

LZ78 is universal and entropic  If is a binary source that is stationary and ergodic, then
with probability 1. Here is the entropy rate of the source.

Similar theorems apply to other versions of LZ algorithm.

LZ77

LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream". (The distance is sometimes called the offset instead.)

To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2  KB, 4 KB, or 32 KB. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes called sliding-window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the encoder may search for creating references.

It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back four characters and copy ten characters from that position into the current position". How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the beginning of the copy-from position. The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits". As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form of run-length encoding.

Another way to see things is as follows: While encoding, for the search pointer to continue finding matched pairs past the end of the search window, all characters from the first match at offset D and forward to the end of the search window must have matched input, and these are the (previously seen) characters that comprise a single run unit of length LR, which must equal D. Then as the search pointer proceeds past the search window and forward, as far as the run pattern repeats in the input, the search and input pointers will be in sync and match characters until the run pattern is interrupted. Then L characters have been matched in total, L > D, and the code is [D, L, c].

Upon decoding [D, L, c], again, D = LR. When the first LR characters are read to the output, this corresponds to a single run unit appended to the output buffer. At this point, the read pointer could be thought of as only needing to return int(L/LR) + (1 if L mod LR ≠ 0) times to the start of that single buffered run unit, read LR characters (or maybe fewer on the last return), and repeat until a total of L characters are read. But mirroring the encoding process, since the pattern is repetitive, the read pointer need only trail in sync with the write pointer by a fixed distance equal to the run length LR until L characters have been copied to output in total.

Considering the above, especially if the compression of data runs is expected to predominate, the window search should begin at the end of the window and proceed backwards, since run patterns, if they exist, will be found first and allow the search to terminate, absolutely if the current maximal matching sequence length is met, or judiciously, if a sufficient length is met, and finally for the simple possibility that the data is more recent and may correlate better with the next input.

Pseudocode

The following pseudocode is a reproduction of the LZ77 compression algorithm sliding window.

while input is not empty do     match := longest repeated occurrence of input that begins in window          if match exists then         d := distance to start of match         l := length of match         c := char following match in input     else         d := 0         l := 0         c := first char of input     end ifoutput (d, l, c)          discard l + 1 chars from front of window     s := pop l + 1 chars from front of input     append s to back of window repeat

Implementations

Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they encode their compressed data to vary the numerical ranges of a length–distance pair, alter the number of bits consumed for a length–distance pair, and distinguish their length–distance pairs from literals (raw data encoded as itself, rather than as part of a length–distance pair). A few examples:

LZ78

The LZ78 algorithms compress sequential data by building a dictionary of token sequences from the input, and then replacing the second and subsequent occurrence of the sequence in the data stream with a reference to the dictionary entry. The observation is that the number of repeated sequences is a good measure of the non random nature of a sequence. The algorithms represent the dictionary as an n-ary tree where n is the number of tokens used to form token sequences. Each dictionary entry is of the form dictionary[...] = {index, token}, where index is the index to a dictionary entry representing a previously seen sequence, and token is the next token from the input that makes this entry unique in the dictionary. Note how the algorithm is greedy, and so nothing is added to the table until a unique making token is found. The algorithm is to initialize last matching index = 0 and next available index = 1 and then, for each token of the input stream, the dictionary searched for a match: {last matching index, token}. If a match is found, then last matching index is set to the index of the matching entry, nothing is output, and last matching index is left representing the input so far. Input is processed until a match is not found. Then a new dictionary entry is created, dictionary[next available index] = {last matching index, token}, and the algorithm outputs last matching index, followed by token, then resets last matching index = 0 and increments next available index. As an example consider the sequence of tokens AABBA which would assemble the dictionary;

0 {0,_} 1 {0,A} 2 {1,B} 3 {0,B}

and the output sequence of the compressed data would be 0A1B0B. Note that the last A is not represented yet as the algorithm cannot know what comes next. In practice an EOF marker is added to the input - AABBA$ for example. Note also that in this case the output 0A1B0B1$ is longer than the original input but compression ratio improves considerably as the dictionary grows, and in binary the indexes need not be represented by any more than the minimum number of bits. [11]

Decompression consists of rebuilding the dictionary from the compressed sequence. From the sequence 0A1B0B1$ the first entry is always the terminator 0 {...} , and the first from the sequence would be 1 {0,A} . The A is added to the output. The second pair from the input is 1B and results in entry number 2 in the dictionary, {1,B}. The token "B" is output, preceded by the sequence represented by dictionary entry 1. Entry 1 is an 'A' (followed by "entry 0" - nothing) so AB is added to the output. Next 0B is added to the dictionary as the next entry, 3 {0,B} , and B (preceded by nothing) is added to the output. Finally a dictionary entry for 1$ is created and A$ is output resulting in A AB B A$ or AABBA removing the spaces and EOF marker.

LZW

LZW is an LZ78-based algorithm that uses a dictionary pre-initialized with all possible characters (symbols) or emulation of a pre-initialized dictionary. The main improvement of LZW is that when a match is not found, the current input stream character is assumed to be the first character of an existing string in the dictionary (since the dictionary is initialized with all possible characters), so only the last matching index is output (which may be the pre-initialized dictionary index corresponding to the previous (or the initial) input character). Refer to the LZW article for implementation details.

BTLZ is an LZ78-based algorithm that was developed for use in real-time communications systems (originally modems) and standardized by CCITT/ITU as V.42bis. When the trie-structured dictionary is full, a simple re-use/recovery algorithm is used to ensure that the dictionary can keep adapting to changing data. A counter cycles through the dictionary. When a new entry is needed, the counter steps through the dictionary until a leaf node is found (a node with no dependents). This is deleted and the space re-used for the new entry. This is simpler to implement than LRU or LFU and achieves equivalent performance.

See also

Related Research Articles

<span class="mw-page-title-main">GIF</span> Bitmap image file format family

The Graphics Interchange Format is a bitmap image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released on June 15, 1987.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

Run-length encoding (RLE) is a form of lossless data compression in which runs of data are stored as a single data value and count, rather than as the original run. This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, Conway's Game of Life, and animations. For files that do not have many runs, RLE could increase the file size.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

<span class="mw-page-title-main">Abraham Lempel</span> Israeli computer scientist (1936–2023)

Abraham Lempel was an Israeli computer scientist and one of the fathers of the LZ family of lossless data compression algorithms.

Lempel–Ziv–Storer–Szymanski (LZSS) is a lossless data compression algorithm, a derivative of LZ77, that was created in 1982 by James A. Storer and Thomas Szymanski. LZSS was described in article "Data compression via textual substitution" published in Journal of the ACM.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. To compress a data sequence , a grammar-based code transforms into a context-free grammar . The problem of finding a smallest grammar for an input sequence is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar is further compressed by statistical encoders like arithmetic coding.

LZWL is a syllable-based variant of the LZW (Lempel-Ziv-Welch) compression algorithm, designed to work with syllables derived from any syllable decomposition algorithm. This approach allows LZWL to efficiently process both syllables and words, offering a nuanced method for data compression.

Lempel–Ziv–Stac is a lossless data compression algorithm that uses a combination of the LZ77 sliding-window compression algorithm and fixed Huffman coding. It was originally developed by Stac Electronics for tape compression, and subsequently adapted for hard disk compression and sold as the Stacker disk compression software. It was later specified as a compression algorithm for various network protocols. LZS is specified in the Cisco IOS stack.

liblzg is a compression library for performing lossless data compression. It implements an algorithm that is a variation of the LZ77 algorithm, called the LZG algorithm, with the primary focus of providing a very simple and fast decoding method. One of the key features of the algorithm is that it requires no memory during decompression. The software library is free software, distributed under the zlib license.

In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression.

LZ4 is a lossless data compression algorithm that is focused on compression and decompression speed. It belongs to the LZ77 family of byte-oriented compression schemes.

LZFSE is an open source lossless data compression algorithm created by Apple Inc. It was released with a simpler algorithm called LZVN.

The Lempel–Ziv complexity is a measure that was first presented in the article On the Complexity of Finite Sequences, by two Israeli computer scientists, Abraham Lempel and Jacob Ziv. This complexity measure is related to Kolmogorov complexity, but the only function it uses is the recursive copy.

842, 8-4-2, or EFT is a data compression algorithm. It is a variation on Lempel–Ziv compression with a limited dictionary length. With typical data, 842 gives 80 to 90 percent of the compression of LZ77 with much faster throughput and less memory use. Hardware implementations also provide minimal use of energy and minimal chip area.

References

  1. Ziv, Jacob; Lempel, Abraham (May 1977). "A Universal Algorithm for Sequential Data Compression". IEEE Transactions on Information Theory . 23 (3): 337–343. CiteSeerX   10.1.1.118.8921 . doi:10.1109/TIT.1977.1055714. S2CID   9267632.
  2. Ziv, Jacob; Lempel, Abraham (September 1978). "Compression of Individual Sequences via Variable-Rate Coding". IEEE Transactions on Information Theory . 24 (5): 530–536. CiteSeerX   10.1.1.14.2892 . doi:10.1109/TIT.1978.1055934.
  3. US Patent No. 5532693 Adaptive data compression system with systolic string matching logic
  4. "Lossless Data Compression: LZ78". cs.stanford.edu.
  5. "Milestones:Lempel-Ziv Data Compression Algorithm, 1977". IEEE Global History Network. Institute of Electrical and Electronics Engineers. 22 July 2014. Retrieved 9 November 2014.
  6. Joanna, Goodrich. "IEEE Medal of Honor Goes to Data Compression Pioneer Jacob Ziv". IEEE Spectrum: Technology, Engineering, and Science News. Retrieved 18 January 2021.
  7. Peter Shor (14 October 2005). "Lempel-Ziv notes" (PDF). Archived from the original (PDF) on 28 May 2021. Retrieved 9 November 2014.
  8. Cover, Thomas M.; Thomas, Joy A. (2006). Elements of information theory (2nd ed.). Hoboken, N.J: Wiley-Interscience. ISBN   978-0-471-24195-9.
  9. "QFS Compression (RefPack)". Niotso Wiki. Retrieved 9 November 2014.
  10. Feldspar, Antaeus (23 August 1997). "An Explanation of the Deflate Algorithm". comp.compression newsgroup . zlib.net. Retrieved 9 November 2014.
  11. https://math.mit.edu/~goemans/18310S15/lempel-ziv-notes.pdf [ bare URL PDF ]