PAQ is a series of

lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...

archivers that have gone through collaborative development to top rankings on several benchmarks measuring compression ratio (although at the expense of speed and memory usage). Specialized versions of PAQ have won the

Hutter Prize The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression improvements on a specific 1 Gigabyte, GB English text file, with the goal of encouraging research in artificial intelligence (AI). Launched in 2006, the priz ...

and the Calgary Challenge. PAQ is

free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...

distributed under the

GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the Four Freedoms (Free software), four freedoms to run, study, share, and modify the software. The license was th ...

Algorithm

PAQ uses a

context mixing Context mixing is a type of data compression algorithm in which the next- symbol predictions of two or more statistical models are combined to yield a prediction that is often more accurate than any of the individual predictions. For example, one ...

algorithm. Context mixing is related to

prediction by partial matching Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the st ...

(PPM) in that the compressor is divided into a predictor and an arithmetic coder, but differs in that the next-symbol prediction is computed using a weighted combination of probability estimates from a large number of models conditioned on different contexts. Unlike PPM, a context doesn't need to be contiguous. Most PAQ versions collect next-symbol statistics for the following contexts: * ''n''-grams; the context is the last bytes before the predicted symbol (as in PPM); * whole-word ''n''-grams, ignoring case and nonalphabetic characters (useful in text files); * "sparse" contexts, for example, the second and fourth bytes preceding the predicted symbol (useful in some binary formats); * "analog" contexts, consisting of the high-order bits of previous 8- or 16-bit words (useful for multimedia files); * two-dimensional contexts (useful for images, tables, and spreadsheets); the row length is determined by finding the stride length of repeating byte patterns; * specialized models, such as

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...

executables, BMP,

TIFF Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word processin ...

, or

JPEG JPEG ( ) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and imag ...

images; these models are active only when the particular file type is detected. All PAQ versions predict and compress one bit at a time, but differ in the details of the models and how the predictions are combined and postprocessed. Once the next-bit probability is determined, it is encoded by

arithmetic coding Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic e ...

. There are three methods for combining predictions, depending on the version: * In PAQ1 through PAQ3, each prediction is represented as a pair of bit counts

(n_0, n_1)

. These counts are combined by weighted summation, with greater weights given to longer contexts. * In PAQ4 through PAQ6, the predictions are combined as before, but the weights assigned to each model are adjusted to favor the more accurate models. * In PAQ7 and later, each model outputs a probability rather than a pair of counts. The probabilities are combined using an

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...

. PAQ1SSE and later versions postprocess the prediction using secondary symbol estimation (SSE). The combined prediction and a small context are used to look up a new prediction in a table. After the bit is encoded, the table entry is adjusted to reduce the prediction error. SSE stages can be pipelined with different contexts or computed in parallel with the outputs averaged.

Arithmetic coding

A string ''s'' is compressed to the shortest byte string representing a base-256

big-endian In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...

number ''x'' in the range

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...

such that P(''r'' < ''s'') ≤ ''x'' < P(''r'' ≤ ''s''), where P(''r'' < ''s'') is the probability that a random string ''r'' with the same length as ''s'' will be

lexicographically In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a ...

less than ''s''. It is always possible to find an ''x'' such that the length of ''x'' is at most one byte longer than the

Shannon limit In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data (dig ...

, −log₂P(''r'' = ''s'') bits. The length of ''s'' is stored in the archive header. The

arithmetic coder Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic e ...

in PAQ is implemented by maintaining for each prediction a lower and upper bound on ''x'', initially

After each prediction, the current range is split into two parts in proportion to P(0) and P(1), the probability that the next bit of ''s'' will be a 0 or 1 respectively, given the previous bits of ''s''. The next bit is then encoded by selecting the corresponding subrange to be the new range. The number ''x'' is decompressed back to string ''s'' by making an identical series of bit predictions (since the previous bits of ''s'' are known). The range is split as with compression. The portion containing ''x'' becomes the new range, and the corresponding bit is appended to ''s''. In PAQ, the lower and upper bounds of the range are represented in 3 parts. The most significant base-256 digits are identical, so they can be written as the leading bytes of ''x''. The next 4 bytes are kept in memory, such that the leading byte is different. The trailing bits are assumed to be all zeros for the lower bound and all ones for the upper bound. Compression is terminated by writing one more byte from the lower bound.

Adaptive model weighting

In PAQ versions through PAQ6, each model maps a set of distinct contexts to a pair of counts,

n_0

, a count of zero bits, and

n_1

, a count of 1 bits. In order to favor recent history, half of the count over 2 is discarded when the opposite bit is observed. For example, if the current state associated with a context is

(n_0,n_1) = (12,3)

and a 1 is observed, then the counts are updated to (7, 4). A bit is arithmetically coded with space proportional to its probability, either P(1) or P(0) = 1 − P(1). The probabilities are computed by weighted addition of the 0 and 1 counts: * ''S''₀ = Σ_''i'' ''w_i'' ''n''_0''i'', * ''S''₁ = Σ_''i'' ''w_i'' ''n''_1''i'', * ''S'' = ''S''₀ + ''S''₁, * P(0) = ''S''₀ / ''S'', * P(1) = ''S''₁ / ''S'', where ''w_i'' is the weight of the ''i''-th model. Through PAQ3, the weights were fixed and set in an ad-hoc manner. (Order-''n'' contexts had a weight of ''n''².) Beginning with PAQ4, the weights were adjusted adaptively in the direction that would reduce future errors in the same context set. If the bit to be coded is ''y'', then the weight adjustment is: * ''n_i'' = ''n''_0''i'' + ''n''_1''i'', * error = ''y'' – P(1), * ''w_i'' ← ''w_i'' + ''S'' ''n''_1''i'' − ''S''₁ ''n_i'') / (''S''₀ ''S''₁)error.

Neural-network mixing

Beginning with PAQ7, each model outputs a prediction (instead of a pair of counts). These predictions are averaged in the logistic domain: * ''x_i'' = stretch(P_''i''(1)), * P(1) = squash(Σ_''i'' ''w_i'' ''x_i''), where P(1) is the probability that the next bit will be a 1, P_''i''(1) is the probability estimated by the ''i''-th model, and * stretch(''x'') = ln(''x'' / (1 − ''x'')), * squash(''x'') = 1 / (1 + ''e''^−''x'') (inverse of stretch). After each prediction, the model is updated by adjusting the weights to minimize coding cost: * ''w_i'' ← ''w_i'' + η ''x_i'' (''y'' − P(1)), where η is the

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...

(typically 0.002 to 0.01), ''y'' is the predicted bit, and (''y'' − P(1)) is the prediction error. The weight update algorithm differs from

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward neural network, feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANN ...

in that the terms P(1)P(0) are dropped. This is because the goal of the neural network is to minimize coding cost, not

root mean square In mathematics and its applications, the root mean square of a set of numbers x_i (abbreviated as RMS, or rms and denoted in formulas as either x_\mathrm or \mathrm_x) is defined as the square root of the mean square (the arithmetic mean of the ...

error. Most versions of PAQ use a small context to select among sets of weights for the neural network. Some versions use multiple networks whose outputs are combined with one more network prior to the SSE stages. Furthermore, for each input prediction there may be several inputs which are

nonlinear In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...

functions of P_''i''(1) in addition to stretch(P(1)).

Context modeling

Each model partitions the known bits of ''s'' into a set of contexts and maps each context to a bit history represented by an 8-bit state. In versions through PAQ6, the state represents a pair of counters (''n''₀, ''n''₁). In PAQ7 and later versions under certain conditions, the state also represents the value of the last bit or the entire sequence. The states are mapped to probabilities using a 256-entry table for each model. After a prediction by the model, the table entry is adjusted slightly (typically by 0.4%) to reduce the prediction error. In all PAQ8 versions, the representable states are as follows: * The exact bit sequence for up to 4 bits. * A pair of counts and an indicator of the most recent bit for sequences of 5 to 15 bits. * A pair of counts for sequences of 16 to 41 bits. To keep the number of states to 256, the following limits are placed on the representable counts: (41, 0), (40, 1), (12, 2), (5, 3), (4, 4), (3, 5), (2, 12), (1, 40), (0, 41). If a count exceeds this limit, then the next state is one chosen to have a similar ratio of ''n''₀ to ''n''₁. Thus, if the current state is (''n''₀ = 4, ''n''₁ = 4, last bit = 0) and a 1 is observed, then the new state is not (''n''₀ = 4, ''n''₁ = 5, last bit = 1). Rather, it is (''n''₀ = 3, n₁ = 4, last bit = 1). Most context models are implemented as

hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...

s. Some small contexts are implemented as direct

lookup table In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v wi ...

Text preprocessing

Some versions of PAQ, in particular PAsQDa, PAQAR (both PAQ6 derivatives), and PAQ8HP1 through PAQ8HP8 (PAQ8 derivatives and

Hutter prize The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression improvements on a specific 1 Gigabyte, GB English text file, with the goal of encouraging research in artificial intelligence (AI). Launched in 2006, the priz ...

recipients) preprocess text files by looking up words in an external dictionary and replacing them with 1- to 3-byte codes. In addition, uppercase letters are encoded with a special character followed by the lowercase letter. In the PAQ8HP series, the dictionary is organized by grouping syntactically and semantically related words together. This allows models to use just the most significant bits of the dictionary codes as context.

Comparison

The following table is a sample from th
Large Text Compression Benchmark
by Matt Mahoney that consists of a file consisting of 10⁹ bytes (1 GB, or 0.931

GiB The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...

) of

English Wikipedia The English Wikipedia is, along with the Simple English Wikipedia, one of two English-language editions of Wikipedia, an online encyclopedia. It was founded on January 15, 2001, as Wikipedia's first edition, and, as of , has the most arti ...

text. See

Lossless compression benchmarks Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...

for a list of file compression benchmarks.

History

The following lists the major enhancements to the PAQ algorithm. In addition, there have been a large number of incremental improvements, which are omitted. * PAQ1 was released on January 6, 2002 by Matt Mahoney. It used fixed weights and did not include an analog or sparse model. * PAQ1SSE/PAQ2 was released on May 11, 2003 by Serge Osnach. It significantly improved compression by adding a Secondary Symbol Estimation (SSE) stage between the predictor and encoder. SSE inputs a short context and the current prediction and outputs a new prediction from a table. The table entry is then adjusted to reflect the actual bit value. * PAQ3N, released October 9, 2003 added a sparse model. * PAQ4, released November 15, 2003 by Matt Mahoney used adaptive weighting. PAQ5 (December 18, 2003) and PAQ6 (December 30, 2003) were minor improvements, including a new analog model. At this point, PAQ was competitive with the best PPM compressors and attracted the attention of the data compression community, which resulted in a large number of incremental improvements through April 2004. Berto Destasio tuned the models and adjusted the bit count discounting schedule. Johan de Bock made improvements to the user interface. David A. Scott made improvements to the arithmetic coder. Fabio Buffoni made speed improvements. * During the period May 20, 2004 through July 27, 2004, Alexander Ratushnyak released seven versions of PAQAR, which made significant compression improvements by adding many new models, multiple mixers with weights selected by context, adding an SSE stage to each mixer output, and adding a preprocessor to improve the compression of Intel executable files. PAQAR stood as the top-ranked compressor through the end of 2004 but was significantly slower than prior PAQ versions. * During the period January 18, 2005 through February 7, 2005, Przemyslaw Skibinski released four versions of PASqDa, based on PAQ6 and PAQAR with the addition of an English dictionary preprocessor. It achieved the top ranking on the Calgary corpus but not on most other benchmarks. * A modified version of PAQ6 won the Calgary Challenge on January 10, 2004 by Matt Mahoney. This was bettered by ten subsequent versions of PAQAR by Alexander Ratushnyak. The most recent was submitted on June 5, 2006, consisting of compressed data and program source code totaling 589,862 bytes. * PAQ7 was released December 2005 by Matt Mahoney. PAQ7 is a complete rewrite of PAQ6 and variants (PAQAR, PAsQDa). Compression ratio was similar to PAQAR but 3 times faster. However it lacked x86 and a dictionary, so it did not compress Windows executables and English text files as well as PAsQDa. It does include models for color BMP, TIFF and JPEG files, so compresses these files better. The primary difference from PAQ6 is it uses a neural network to combine models rather than a gradient descent mixer. Another feature is PAQ7's ability to compress embedded jpeg and bitmap images in Excel-, Word- and pdf-files. * PAQ8A was released on January 27, 2006, PAQ8C on February 13, 2006. These were experimental pre-release of anticipated PAQ8. It fixed several issues in PAQ7 (poor compression in some cases). PAQ8A also included model for compressing (x86) executables. * PAQ8F was released on February 28, 2006. PAQ8F had 3 improvements over PAQ8A: a more memory efficient context model, a new indirect context model to improve compression, and a new user interface to support drag and drop in Windows. It does not use an English dictionary like the PAQ8B/C/D/E variants. * PAQ8G was released March 3, 2006 by Przemyslaw Skibinski. PAQ8G is PAQ8F with dictionaries added and some other improvements as a redesigned TextFilter (which does not decrease compression performance on non-textual files) * PAQ8H was released on March 22, 2006 by Alexander Ratushnyak and updated on March 24, 2006. PAQ8H is based on PAQ8G with some improvements to the model. * PAQ8I was released on August 18, 2006 by Pavel L. Holoborodko, with bug fixes on August 24, September 4, and September 13. It added a grayscale image model for PGM files. * PAQ8J was released on November 13, 2006 by Bill Pettis. It was based on PAQ8F with some text model improvements taken from PAQ8HP5. Thus, it did not include the text dictionaries from PAQ8G or PGM model from PAQ8I. *Serge Osnach released a series of modeling improvements: PAQ8JA on November 16, 2006, PAQ8JB on November 21, and PAQ8JC on November 28. * PAQ8JD was released on December 30, 2006 by Bill Pettis. This version has since been ported to 32 bit

Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...

for several processors, and 32 and 64 bit

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...

. * PAQ8K was released on February 13, 2007 by Bill Pettis. It includes additional models for binary files. * PAQ8L was released on March 8, 2007 by Matt Mahoney. It is based on PAQ8JD and adds a DMC model. * PAQ8O was released on August 24, 2007 by Andreas Morphis. Contains improved BMP and

models over PAQ8L. Can be optionally compiled with

SSE2 SSE2 (Streaming SIMD Extensions 2) is one of the Intel SIMD (Single Instruction, Multiple Data) processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier Streamin ...

support and for 64-bit Linux. The algorithm has notable performance benefits under 64-bit OS. * PAQ8P was released on August 25, 2008 by Andreas Morphis. Contains improved BMP model and adds a

WAV Waveform Audio File Format (WAVE, or WAV due to its filename extension; pronounced "wave") is an audio file format standard, developed by IBM and Microsoft, for storing an audio bitstream on PCs. It is the main format used on Microsoft Wind ...

model. * PAQ8PX was released on April 25, 2009 by Jan Ondrus. It contains various improvements like better

compression and

EXE Exe or EXE may refer to: * .exe, a file extension * exe., abbreviation for executive Places * River Exe, in England * Exe Estuary, in England * Exe Island, in Exeter, England Transportation and vehicles * Exe (locomotive), a British locomotive ...

compression. * PAQ8KX was released on July 15, 2009 by Jan Ondrus. It is a combination of PAQ8K with PAQ8PX. * PAQ8PF was released on September 9, 2009 by LovePimple without source code (which the

GPL The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general u ...

license requires). It compresses 7% worse, but is 7 times faster compared to PAQ8PX v66 (measured with 1 MB English text) * PAQ9A was released on December 31, 2007 by Matt Mahoney. A new experimental version. It does not include models for specific file types, has an LZP preprocessor and supports files over 2 GB. *

ZPAQ ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update ...

was released on March 12, 2009 by Matt Mahoney. It uses a new archive format designed so that the current ZPAQ program will be able to decompress archives created by future ZPAQ versions (the various PAQ variants listed above are not forward compatible in this fashion). It achieves this by specifying the decompression algorithm in a bytecode program that is stored in each created archive file.

Hutter Prizes

The series PAQ8HP1 through PAQ8HP8 were released by Alexander Ratushnyak from August 21, 2006 through January 18, 2007 as

submissions. The Hutter Prize is a text compression contest using a 100 MB English and XML data set derived from Wikipedia's source. The PAQ8HP series was forked from PAQ8H. The programs include text preprocessing dictionaries and models tuned specifically to the benchmark. All non-text models were removed. The dictionaries were organized to group syntactically and semantically related words and to group words by common suffix. The former strategy improves compression because related words (which are likely to appear in similar context) can be modeled on the high order bits of their dictionary codes. The latter strategy makes the dictionary easier to compress. The size of the decompression program and compressed dictionary is included in the contest ranking. On October 27, 2006, it was announced that PAQ8HP5 won a Hutter Prize for Lossless Compression of Human Knowledge of

€ The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists o ...

3,416. On June 30, 2007, Ratushnyak's PAQ8HP12 was awarded a second Hutter prize of €1732, improving upon his previous record by 3.46%.

PAQ derivations

Being

, PAQ can be modified and redistributed by anyone who has a copy. This has allowed other authors to

fork In cutlery or kitchenware, a fork (from la, furca 'pitchfork') is a utensil, now usually made of metal, whose long handle terminates in a head that branches into several narrow and often slightly curved tines with which one can spear foods ei ...

the PAQ compression engine and add new features such as a

graphical user interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...

or better speed (at the expense of compression ratio). Notable PAQ derivatives include: * WinUDA 0.291, based on PAQ6 but fasterdwing's homepage
* UDA 0.301, based on PAQ8I algorithm *

KGB The KGB (russian: links=no, lit=Committee for State Security, Комитет государственной безопасности (КГБ), a=ru-KGB.ogg, p=kəmʲɪˈtʲet ɡəsʊˈdarstvʲɪn(ː)əj bʲɪzɐˈpasnəsʲtʲɪ, Komitet gosud ...

, based on PAQ6 (beta version is based on PAQ7). * Emilcont based on PAQ6 *

Peazip PeaZip is a free and open-source file manager and file archiver for Microsoft Windows, ReactOS, Linux, MacOS and BSD made by Giorgio Tani. It supports its native PEA archive format (featuring compression, multi volume split and flexible au ...

GUI frontend (for Windows and Linux) for LPAQ, ZPAQ and various PAQ8* algorithms * PWCM (PAQ weighted context mixing) is an independently developed closed source implementation of the PAQ algorithm used in WinRK. *PAQCompress is a graphical user interface for several newer PAQ versions, including the latest releases of PAQ8PX, PAQ8PXD and PAQ8PXV. It is updated whenever a new version is released. The software intelligently appends an extension to the filename which it can use to decompress the file using the correct PAQ Version. The software is open source. * PerfectCompress Is a compression software which features UCA (ULTRA Compressed Archive). A compression format that featured PAQ8PX v42 to v65 and that now can use PAQ8PF, PAQ8KX, or PAQ8PXPRE as the default UCA Compressor. In addition, PerfectCompress can compress files to PAQ8PX v42 to v67, and ZPAQ, and as of version 6.0, can compress files to LPAQ and PAQ8PF beta 1 to beta 3. PerfectCompress v6.10 introduced support compression for the recently released PAQ8PXPRE. PerfectCompress 6.12 introduces support for the PAQ8KX series. * FrontPAQ, small gui for PAQ. Latest version is FrontPAQ v8 supporting PAQ8PX, PAQ8PF, and FP8. The software is no longer updated and users are encouraged to use PAQCompress, which implements the latest PAQ releases.

References

External links

*
Compiled linux binaries
- Linux command-line executables download. {{DEFAULTSORT:Paq Free data compression software Lossless compression algorithms