Re-Pair
   HOME

TheInfoList



OR:

Re-Pair (short for Recursive Pairing) is a grammar-based compression algorithm that, given an input text, builds a
straight-line program In mathematics, more specifically in computational algebra, a straight-line program (SLP) for a finite group ''G'' = ⟨''S''⟩ is a finite sequence ''L'' of elements of ''G'' such that every element of ''L'' either belongs to ''S'', i ...
, i.e. a
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
generating a single string: the input text. The grammar is built by recursively replacing the most frequent pair of characters occurring in the text. Once there is no pair of characters occurring twice, the resulting string is used as the axiom of the grammar. Therefore, the output grammar is such that all rules but the axiom have two symbols on the right-hand side.


How it works

Re-Pair was first introduced by NJ. Larsson and A. MoffatLarsson, N. J., & Moffat, A. (2000). Off-line dictionary-based compression. Proceedings of the IEEE, 88(11), 1722-1732. in 1999. In their paper the algorithm is presented together with a detailed description of the data structures required to implement it with linear time and space complexity. The experiments showed that Re-Pair achieves high compression ratios and offers good performance for decompression. However, the major drawback of the algorithm is its memory consumption, which is approximately 5 times the size of the input. Such memory usage is required in order to perform the compression in linear time but makes the algorithm impractical for compressing large files. The image on the right shows how the algorithm works compresses the string w=xabcabcy123123zabc. During the first iteration, the pair ab, which occurs three times in w, is replaced by a new symbol R_1. On the second iteration, the most frequent pair in the string w=xR_1cR_1cy123123zR_1c, which is R_1c, is replaced by a new symbol R_2. Thus, at the end of the second iteration, the remaining string is w=xR_2R_2y123123zR_2. In the next two iterations, the pairs 12 and R_3 are replaced by symbols R_3 and R_4 respectively. Finally, the string w=xR_2R_2yR_4R_4zR_2 contains no repeated pair and therefore it is used as the axiom of the output grammar.


Data structures

In order to achieve linear time complexity, Re-Pair requires the following data structures * A sequence representing the input string. Position i of the sequence contains the i-th symbol of the input string plus two references to other positions in the sequence. These references point to the next/previous positions, say k and m, such that the same substring begins at w /math>, w /math> and w /math> and all three occurrences are captured by the same reference (i.e. there is a variable in the grammar generating the string). * A priority queue. Each element of the queue is a pair of symbols (terminals or previously defined pairs) that occur consecutively in the sequence. The priority of a pair is given by the number of occurrences of the pair in the remaining sequence. Each time a new pair is created, the priority queue is updated. * A hash table to keep track of already defined pairs. This table is updated each time a new pair is created or removed. Since the hash table and the priority queue refer to the same elements (pairs), they can be implemented by a common data structure called PAIR with pointers for the hash table (h_next) and the priority queue (p_next and p_prev). Furthermore, each PAIR points to the beginning of the first (f_pos) and the last (b_pos) occurrences of the string represented by the PAIR in the sequence. The following picture shows an overview of this data structure. The following two pictures show an example of how these data structures look after the initialization and after applying one step of the pairing process (pointers to NULL are not displayed):


Encoding the grammar

Once the grammar has been built for a given input string, in order to achieve effective compression, this grammar has to be encoded efficiently. One of the simplest methods for encoding the grammar is the implicit encoding, which consists on invoking function encodeCFG(X), described below, sequentially on all the axiom's symbols. Intuitively, rules are encoded as they are visited in a depth-first traversal of the grammar. The first time a rule is visited, its right hand side is encoded recursively and a new code is assigned to the rule. From that point, whenever the rule is reached, the assigned value is written. num_rules_encoded = 256 // By default, the extended ASCII charset are the terminals of the grammar. writeSymbol(symbol s) void encodeCFG_rec(symbol s) void encodeCFG(symbol s) Another possibility is to separate the rules of the grammar into generations such that a rule X \to YZ belongs to generation i iff at least one of Y or Z belongs to generation i1 and the other belongs to generation j with j \leq i1. Then these generations are encoded subsequently starting from generation 0. This was the method proposed originally when Re-Pair was first introduced. However, most implementations of Re-Pair use the implicit encoding method due to its simplicity and good performance. Furthermore, it allows on-the-fly decompression.


Versions

There exists a number of different implementations of Re-Pair. Each of these versions aims at improving one specific aspect of the algorithm, such as reducing the runtime, reducing the space consumption or increasing the compression ratio.


See also

*
Byte pair encoding Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the ...
*
Sequitur algorithm Sequitur (or Nevill-Manning algorithm) is a recursive algorithm developed by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical structure (context-free grammar) from a sequence of discrete symbols. The algorithm operates in ...


References

{{Compression Methods Compression algorithms Lossless compression algorithms