GLZA
   HOME

TheInfoList



OR:

Grammar-based codes or Grammar-based compression are
compression Compression may refer to: Physical science *Compression (physics), size reduction due to forces *Compression member, a structural element such as a column *Compressibility, susceptibility to compression * Gas compression *Compression ratio, of a ...
algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal
lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
algorithms. To compress a data sequence x = x_1 \cdots x_n, a grammar-based code transforms x into a context-free grammar G. The problem of finding a smallest grammar for an input sequence (
smallest grammar problem In data compression and the theory of formal languages, the smallest grammar problem is the problem of finding the smallest context-free grammar that generates a given string of characters (but no other string). The size of a grammar is defined by ...
) is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar G is further compressed by statistical encoders like
arithmetic coding Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic ...
.


Examples and characteristics

The class of grammar-based codes is very broad. It includes
block code In coding theory, block codes are a large and important family of error-correcting codes that encode data in blocks. There is a vast number of examples for block codes, many of which have a wide range of practical applications. The abstract defini ...
s, the multilevel pattern matching (MPM) algorithm, variations of the incremental parsing Lempel-Ziv code, and many other new universal lossless compression algorithms. Grammar-based codes are universal in the sense that they can achieve asymptotically the
entropy rate In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process. For stochastic processes with a countable index, the ...
of any stationary, ergodic source with a finite alphabet.


Practical algorithms

The compression programs of the following are available from external links. * Sequitur is a classical grammar compression algorithm that sequentially translates an input text into a CFG, and then the produced CFG is encoded by an arithmetic coder. *
Re-Pair Re-Pair (short for Recursive Pairing) is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar ...
is a greedy algorithm using the strategy of most-frequent-first substitution. The compressive performance is powerful, although the main memory space requirement is very large. * GLZA, which constructs a grammar that may be reducible, i.e., contain repeats, where the entropy-coding cost of "spelling out" the repeats is less than the cost creating and entropy-coding a rule to capture them. (In general, the compression-optimal SLG is not irreducible, and the Smallest Grammar Problem is different from the actual SLG compression problem.)


See also

*
Dictionary coder A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (call ...
*
Grammar induction Grammar induction (or grammatical inference) is the process in machine learning of learning a formal grammar (usually as a collection of ''re-write rules'' or '' productions'' or alternatively as a finite state machine or automaton of some kind) fr ...
*
Straight-line grammar A straight-line grammar (sometimes abbreviated as SLG) is a formal grammar that generates exactly one string.Florian Benz and Timo Kötzing, “An effective heuristic for the smallest grammar problem,” Proceedings of the fifteenth annual conferen ...


References


External links


GLZA discussion and paperDescription of grammar-based codes with exampleSequitur codes


a version of Gonzalo Navarro.
GrammarViz 2.0
- implementation of Sequitur, Re-Pair, and parallel Re-Pair in Java. {{Compression methods Data compression Coding theory Information theory