Re-Pair

Last updated

Re-Pair (short for recursive pairing) is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar generating a single string: the input text. In order to perform the compression in linear time, it consumes the amount of memory that is approximately five times the size of its input.

Contents

The grammar is built by recursively replacing the most frequent pair of characters occurring in the text. Once there is no pair of characters occurring twice, the resulting string is used as the axiom of the grammar. Therefore, the output grammar is such that all rules but the axiom have two symbols on the right-hand side.

How it works

Construction of a Straight Line Program that generates the string w="xabcabcy123123zabc" by using Re-Pair Re pair example.gif
Construction of a Straight Line Program that generates the string w="xabcabcy123123zabc" by using Re-Pair

Re-Pair was first introduced by NJ. Larsson and A. Moffat [1] in 1999.

In their paper the algorithm is presented together with a detailed description of the data structures required to implement it with linear time and space complexity. The experiments showed that Re-Pair achieves high compression ratios and offers good performance for decompression. However, the major drawback of the algorithm is its memory consumption, which is approximately 5 times the size of the input. Such memory usage is required in order to perform the compression in linear time but makes the algorithm impractical for compressing large files.

The image on the right shows how the algorithm works compresses the string .

During the first iteration, the pair , which occurs three times in , is replaced by a new symbol . On the second iteration, the most frequent pair in the string , which is , is replaced by a new symbol . Thus, at the end of the second iteration, the remaining string is . In the next two iterations, the pairs and are replaced by symbols and respectively. Finally, the string contains no repeated pair and therefore it is used as the axiom of the output grammar.

Data structures

In order to achieve linear time complexity, Re-Pair requires the following data structures

Since the hash table and the priority queue refer to the same elements (pairs), they can be implemented by a common data structure called PAIR with pointers for the hash table (h_next) and the priority queue (p_next and p_prev). Furthermore, each PAIR points to the beginning of the first (f_pos) and the last (b_pos) occurrences of the string represented by the PAIR in the sequence. The following picture shows an overview of this data structure.

Structure repair.png

The following two pictures show an example of how these data structures look after the initialization and after applying one step of the pairing process (pointers to NULL are not displayed):

Slide00.png Slide01.png

Encoding the grammar

Once the grammar has been built for a given input string, in order to achieve effective compression, this grammar has to be encoded efficiently. One of the simplest methods for encoding the grammar is the implicit encoding, which consists on invoking function encodeCFG(X), described below, sequentially on all the axiom's symbols. Intuitively, rules are encoded as they are visited in a depth-first traversal of the grammar. The first time a rule is visited, its right hand side is encoded recursively and a new code is assigned to the rule. From that point, whenever the rule is reached, the assigned value is written.

num_rules_encoded=256// By default, the extended ASCII charset are the terminals of the grammar.writeSymbol(symbols){bitslen=log(num_rules_encoded);// Initially 8, to describe any extended ASCII characterwritesinbinaryusingbitslenbits}voidencodeCFG_rec(symbols){if(sisnon-terminalandthisisthefirsttimesymbolsappears){takerulesXY;writebit1;encodeCFG_rec(X);encodeCFG_rec(Y);assigntosymbolsvalue++num_rules_encoded;}else{writebit0;writeSymbol(terminal/valueassigned)}}voidencodeCFG(symbols){encodeCFG_rec(s);writebit1;}

Another possibility is to separate the rules of the grammar into generations such that a rule belongs to generation iff at least one of or belongs to generation and the other belongs to generation with . Then these generations are encoded subsequently starting from generation . This was the method proposed originally when Re-Pair was first introduced. However, most implementations of Re-Pair use the implicit encoding method due to its simplicity and good performance. Furthermore, it allows on-the-fly decompression.

Versions

There exists a number of different implementations of Re-Pair. Each of these versions aims at improving one specific aspect of the algorithm, such as reducing the runtime, reducing the space consumption or increasing the compression ratio.

ImprovementYearImplementationDescription
Phrase Browsing [2] 2003 Instead of manipulating the input string as a sequence of characters, this tool first groups the characters into phrases (for instance, words). The compression algorithm works as Re-Pair but considering the identified phrases as the terminals of the grammar. The tool accepts different options to decide which sort of phrases are considered and encodes the resulting grammar into separated files: one containing the axiom and one with the rest of the rules.
Original2011 This is one of the most popular implementations of Re-Pair. It uses the data structures described here (the ones proposed when it was originally published [1] ) and encodes the resulting grammar using the implicit encoding method. Most later versions of Re-Pair are implemented starting from this one.
Encoding [3] 2013 Instead of the implicit encoding method, this implementation uses a Variable Length to Fixed Length method, in which each rule (represented with a string of Variable Length) is encoded using a code of Fixed Length.
Memory usage [4] 2017 The algorithm performs in two phases. During the first phase, it considers the high frequency pairs, i.e. those occurring more than times, while the low frequency pairs are considered in the second. The main difference between the two phases is the implementation of the corresponding priority queues.
Compression [5] 2017 This version modifies the way the next pair to be replaced is chosen. Instead of simply considering the most frequent pair, it employs a heuristic which penalizes pairs that are not coherent with a Lempel–Ziv factorisation of the input string.
Compression [6] 2018 This algorithm reduces the size of the grammar generated by Re-Pair by replacing maximal repeats first. When a pair is identified as the most frequent pair, i.e. the one to be replaced in the current step of the algorithm, MR-repair extends the pair to find the longest string that occurs the same number of times as the pair to be replaced. The provided implementation encodes the grammar by simply listing the rules in text, hence this tool is purely for research purposes and cannot be used for compression as it is.

See also

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

<span class="mw-page-title-main">L-system</span> Rewriting system and type of formal grammar

An L-system or Lindenmayer system is a parallel rewriting system and a type of formal grammar. An L-system consists of an alphabet of symbols that can be used to make strings, a collection of production rules that expand each symbol into some larger string of symbols, an initial "axiom" string from which to begin construction, and a mechanism for translating the generated strings into geometric structures. L-systems were introduced and developed in 1968 by Aristid Lindenmayer, a Hungarian theoretical biologist and botanist at the University of Utrecht. Lindenmayer used L-systems to describe the behaviour of plant cells and to model the growth processes of plant development. L-systems have also been used to model the morphology of a variety of organisms and can be used to generate self-similar fractals.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and the DEFLATE algorithm used in PNG and ZIP.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

Sequitur is a recursive algorithm developed by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical structure from a sequence of discrete symbols. The algorithm operates in linear space and time. It can be used in data compression software applications.

In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a alphabet. Given a binary relation between fixed strings over the alphabet, called rewrite rules, denoted by , an SRS extends the rewriting relation to all strings in which the left- and right-hand side of the rules appear as substrings, that is , where , , , and are strings.

ID/LP Grammars are a subset of Phrase Structure Grammars, differentiated from other formal grammars by distinguishing between immediate dominance (ID) and linear precedence (LP) constraints. Whereas traditional phrase structure rules incorporate dominance and precedence into a single rule, ID/LP Grammars maintains separate rule sets which need not be processed simultaneously. ID/LP Grammars are used in Computational Linguistics.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. To compress a data sequence , a grammar-based code transforms into a context-free grammar . The problem of finding a smallest grammar for an input sequence is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar is further compressed by statistical encoders like arithmetic coding.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and depending on its purpose maybe be finite, countable, or even uncountable.

In computational geometry, the Bentley–Ottmann algorithm is a sweep line algorithm for listing all crossings in a set of line segments, i.e. it finds the intersection points of line segments. It extends the Shamos–Hoey algorithm, a similar previous algorithm for testing whether or not a set of line segments has any crossings. For an input consisting of line segments with crossings, the Bentley–Ottmann algorithm takes time . In cases where , this is an improvement on a naïve algorithm that tests every pair of segments, which takes .

Controlled grammars are a class of grammars that extend, usually, the context-free grammars with additional controls on the derivations of a sentence in the language. A number of different kinds of controlled grammars exist, the four main divisions being Indexed grammars, grammars with prescribed derivation sequences, grammars with contextual conditions on rule application, and grammars with parallelism in rule application. Because indexed grammars are so well established in the field, this article will address only the latter three kinds of controlled grammars.

References

  1. 1 2 Larsson, N. J., & Moffat, A. (2000). Off-line dictionary-based compression. Proceedings of the IEEE, 88(11), 1722–1732.
  2. R. Wan. "Browsing and Searching Compressed Documents". PhD thesis, University of Melbourne, Australia, December 2003.
  3. Satoshi Yoshida and Takuya Kida, Effective Variable-Length-to-Fixed-Length Coding via a Re-Pair Algorithm, In Proc. of Data Compression Conference 2013 (DCC 2013), p. 532, Snowbird, Utah, USA, March 2013.
  4. Bille, P., Gørtz, I. L., & Prezza, N. (2017, April). Space-efficient re-pair compression. In 2017 (DCC) (pp. 171-180). IEEE.
  5. Gańczorz, M., & Jeż, A. (2017, April). Improvements on Re-Pair grammar compressor. In 2017 Data Compression Conference (DCC) (pp. 181–190). IEEE.
  6. Furuya, I., Takagi, T., Nakashima, Y., Inenaga, S., Bannai, H., & Kida, T. (2018). MR-RePair: Grammar Compression based on Maximal Repeats. arXiv preprint arXiv:1811.04596.