PAQ

Last updated
A sample session of PAQ8O Paq8o Sample Session.png
A sample session of PAQ8O

PAQ is a series of lossless data compression archivers that have gone through collaborative development to top rankings on several benchmarks measuring compression ratio (although at the expense of speed and memory usage). Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge. [1] PAQ is free software distributed under the GNU General Public License. [2]

Contents

Algorithm

PAQ uses a context mixing algorithm. Context mixing is related to prediction by partial matching (PPM) in that the compressor is divided into a predictor and an arithmetic coder, but differs in that the next-symbol prediction is computed using a weighted combination of probability estimates from a large number of models conditioned on different contexts. Unlike PPM, a context doesn't need to be contiguous. Most PAQ versions collect next-symbol statistics for the following contexts:

All PAQ versions predict and compress one bit at a time, but differ in the details of the models and how the predictions are combined and postprocessed. Once the next-bit probability is determined, it is encoded by arithmetic coding. There are three methods for combining predictions, depending on the version:

PAQ1SSE and later versions postprocess the prediction using secondary symbol estimation (SSE). The combined prediction and a small context are used to look up a new prediction in a table. After the bit is encoded, the table entry is adjusted to reduce the prediction error. SSE stages can be pipelined with different contexts or computed in parallel with the outputs averaged.

Arithmetic coding

A string s is compressed to the shortest byte string representing a base-256 big-endian number x in the range [0, 1] such that P(r < s) ≤ x < P(rs), where P(r < s) is the probability that a random string r with the same length as s will be lexicographically less than s. It is always possible to find an x such that the length of x is at most one byte longer than the Shannon limit, −log2P(r = s) bits. The length of s is stored in the archive header.

The arithmetic coder in PAQ is implemented by maintaining for each prediction a lower and upper bound on x, initially [0, 1]. After each prediction, the current range is split into two parts in proportion to P(0) and P(1), the probability that the next bit of s will be a 0 or 1 respectively, given the previous bits of s. The next bit is then encoded by selecting the corresponding subrange to be the new range.

The number x is decompressed back to string s by making an identical series of bit predictions (since the previous bits of s are known). The range is split as with compression. The portion containing x becomes the new range, and the corresponding bit is appended to s.

In PAQ, the lower and upper bounds of the range are represented in 3 parts. The most significant base-256 digits are identical, so they can be written as the leading bytes of x. The next 4 bytes are kept in memory, such that the leading byte is different. The trailing bits are assumed to be all zeros for the lower bound and all ones for the upper bound. Compression is terminated by writing one more byte from the lower bound.

Adaptive model weighting

In PAQ versions through PAQ6, each model maps a set of distinct contexts to a pair of counts, , a count of zero bits, and , a count of 1 bits. In order to favor recent history, half of the count over 2 is discarded when the opposite bit is observed. For example, if the current state associated with a context is and a 1 is observed, then the counts are updated to (7, 4).

A bit is arithmetically coded with space proportional to its probability, either P(1) or P(0) = 1 − P(1). The probabilities are computed by weighted addition of the 0 and 1 counts:

where wi is the weight of the i-th model. Through PAQ3, the weights were fixed and set in an ad-hoc manner. (Order-n contexts had a weight of n2.) Beginning with PAQ4, the weights were adjusted adaptively in the direction that would reduce future errors in the same context set. If the bit to be coded is y, then the weight adjustment is:

Neural-network mixing

Beginning with PAQ7, each model outputs a prediction (instead of a pair of counts). These predictions are averaged in the logistic domain:

where P(1) is the probability that the next bit will be a 1, Pi(1) is the probability estimated by the i-th model, and

After each prediction, the model is updated by adjusting the weights to minimize coding cost:

where η is the learning rate (typically 0.002 to 0.01), y is the predicted bit, and (y − P(1)) is the prediction error. The weight update algorithm differs from backpropagation in that the terms P(1)P(0) are dropped. This is because the goal of the neural network is to minimize coding cost, not root mean square error.

Most versions of PAQ use a small context to select among sets of weights for the neural network. Some versions use multiple networks whose outputs are combined with one more network prior to the SSE stages. Furthermore, for each input prediction there may be several inputs which are nonlinear functions of Pi(1) in addition to stretch(P(1)).

Context modeling

Each model partitions the known bits of s into a set of contexts and maps each context to a bit history represented by an 8-bit state. In versions through PAQ6, the state represents a pair of counters (n0, n1). In PAQ7 and later versions under certain conditions, the state also represents the value of the last bit or the entire sequence. The states are mapped to probabilities using a 256-entry table for each model. After a prediction by the model, the table entry is adjusted slightly (typically by 0.4%) to reduce the prediction error.

In all PAQ8 versions, the representable states are as follows:

To keep the number of states to 256, the following limits are placed on the representable counts: (41, 0), (40, 1), (12, 2), (5, 3), (4, 4), (3, 5), (2, 12), (1, 40), (0, 41). If a count exceeds this limit, then the next state is one chosen to have a similar ratio of n0 to n1. Thus, if the current state is (n0 = 4, n1 = 4, last bit = 0) and a 1 is observed, then the new state is not (n0 = 4, n1 = 5, last bit = 1). Rather, it is (n0 = 3, n1 = 4, last bit = 1).

Most context models are implemented as hash tables. Some small contexts are implemented as direct lookup tables.

Text preprocessing

Some versions of PAQ, in particular PAsQDa, PAQAR (both PAQ6 derivatives), and PAQ8HP1 through PAQ8HP8 (PAQ8 derivatives and Hutter prize recipients) preprocess text files by looking up words in an external dictionary and replacing them with 1- to 3-byte codes. In addition, uppercase letters are encoded with a special character followed by the lowercase letter. In the PAQ8HP series, the dictionary is organized by grouping syntactically and semantically related words together. This allows models to use just the most significant bits of the dictionary codes as context.

Comparison

The following table is a sample from the Large Text Compression Benchmark by Matt Mahoney that consists of a file consisting of 109 bytes (1  GB, or 0.931  GiB) of English Wikipedia text.

ProgramCompressed size (bytes) % of original sizeCompression time (ns/B)Memory (MiB)
PAQ8HP8133,423,10913.3464 6391849
PPMd 183,976,01418.4880256
bzip2 254,007,87525.43798
InfoZIP 322,649,70332.261040.1

See Lossless compression benchmarks for a list of file compression benchmarks.

History

The following lists the major enhancements to the PAQ algorithm. In addition, there have been a large number of incremental improvements, which are omitted.

Hutter Prizes

The series PAQ8HP1 through PAQ8HP8 were released by Alexander Ratushnyak from August 21, 2006 through January 18, 2007 as Hutter Prize submissions. The Hutter Prize is a text compression contest using a 100 MB English and XML data set derived from Wikipedia's source. The PAQ8HP series was forked from PAQ8H. The programs include text preprocessing dictionaries and models tuned specifically to the benchmark. All non-text models were removed. The dictionaries were organized to group syntactically and semantically related words and to group words by common suffix. The former strategy improves compression because related words (which are likely to appear in similar context) can be modeled on the high order bits of their dictionary codes. The latter strategy makes the dictionary easier to compress. The size of the decompression program and compressed dictionary is included in the contest ranking.

On October 27, 2006, it was announced [5] that PAQ8HP5 won a Hutter Prize for Lossless Compression of Human Knowledge of 3,416.

On June 30, 2007, Ratushnyak's PAQ8HP12 was awarded a second Hutter prize of €1732, [6] improving upon his previous record by 3.46%.

PAQ derivations

Being free software, PAQ can be modified and redistributed by anyone who has a copy. This has allowed other authors to fork the PAQ compression engine and add new features such as a graphical user interface or better speed (at the expense of compression ratio). Notable PAQ derivatives include:

See also

Related Research Articles

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

<span class="mw-page-title-main">Arithmetic coding</span> Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 22.01.

Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream. PPM algorithms can also be used to cluster data into predicted groupings in cluster analysis.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

WavPack is a free and open-source lossless audio compression format and application implementing the format. It is unique in the way that it supports hybrid audio compression alongside normal compression which is similar to how FLAC works. It also supports compressing a wide variety of lossless formats, including various variants of PCM and also DSD as used in SACDs, together with its support for surround audio.

Context mixing is a type of data compression algorithm in which the next-symbol predictions of two or more statistical models are combined to yield a prediction that is often more accurate than any of the individual predictions. For example, one simple method is to average the probabilities assigned by each model. The random forest is another method: it outputs the prediction that is the mode of the predictions output by individual models. Combining models is an active area of research in machine learning.

The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression improvements on a specific 1 GB English text file, with the goal of encouraging research in artificial intelligence (AI).

<span class="mw-page-title-main">PeaZip</span> File archive computer program

PeaZip is a free and open-source file manager and file archiver for Microsoft Windows, ReactOS, Linux, MacOS and BSD by Giorgio Tani. It supports its native PEA archive format and other mainstream formats, with special focus on handling open formats. Version 9.4.0 supported 234 file extensions.

Dynamic Markov compression (DMC) is a lossless data compression algorithm developed by Gordon Cormack and Nigel Horspool. It uses predictive arithmetic coding similar to prediction by partial matching (PPM), except that the input is predicted one bit at a time. DMC has a good compression ratio and moderate speed, similar to PPM, but requires somewhat more memory and is not widely implemented. Some recent implementations include the experimental compression programs hook by Nania Francesco Antonio, ocamyd by Frank Schwellinger, and as a submodel in paq8l by Matt Mahoney. These are based on the 1993 implementation in C by Gordon Cormack.

Context-adaptive binary arithmetic coding (CABAC) is a form of entropy encoding used in the H.264/MPEG-4 AVC and High Efficiency Video Coding (HEVC) standards. It is a lossless compression technique, although the video coding standards in which it is used are typically for lossy compression applications. CABAC is notable for providing much better compression than most other entropy encoding algorithms used in video encoding, and it is one of the key elements that provides the H.264/AVC encoding scheme with better compression capability than its predecessors.

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

Brotli is a lossless data compression algorithm developed by Google. It uses a combination of the general-purpose LZ77 lossless compression algorithm, Huffman coding and 2nd-order context modelling. Brotli is primarily used by web servers and content delivery networks to compress HTTP content, making internet websites load faster. A successor to gzip, it is supported by all major web browsers and has become increasingly popular, as it provides better compression than gzip.

JPEG XS is an interoperable, visually lossless, low-latency and lightweight image and video coding system used in professional applications. Applications of the standard include streaming high quality content for virtual reality, drones, autonomous vehicles using cameras, gaming, and broadcasting. In this respect, JPEG XS is unique, being the first ISO codec ever designed for this specific purpose. JPEG XS, built on core technology from both intoPIX and Fraunhofer IIS, is formally standardized as ISO/IEC 21122 by the Joint Photographic Experts Group with the first edition published in 2019. Although not official, the XS acronym was chosen to highlight the eXtra Small and eXtra Speed characteristics of the codec. Today, the JPEG committee is still actively working on further improvements to XS, with the second edition scheduled for publication and initial efforts being launched towards a third edition.

References

  1. "The Compression/SHA-1 Challenge". Mailcom.com. Retrieved 2010-05-19.
  2. "Homepage of the PAQ compressors" . Retrieved 2007-07-10. You may download, use, copy, modify, and distribute these programs under the terms of the GNU general public license
  3. "Ubuntu Manpage: zpaq - PAQ open standard maximum compressor". manpages.ubuntu.com.
  4. "ZPAQ Level 1 Specification" (PDF). Retrieved 2010-09-03.
  5. James Bowery. Alexander Ratushnyak Wins First Hutter Prize Payout. Published October 27, 2006. Retrieved October 30, 2006. [ dead link ]
  6. http://prize.hutter1.net/award2.gif [ bare URL image file ]
  7. 1 2 dwing's homepage Archived February 24, 2007, at the Wayback Machine
  8. "KGB Archiver homepage". Kgbarchiver.net. Archived from the original on 2009-01-05. Retrieved 2010-05-19.
  9. "EmilCont Ultracompression". Freewebs.com. Archived from the original on 2010-09-10. Retrieved 2010-05-19.
  10. Matt Mahoney (2007). "LPAQ" . Retrieved 2013-12-29.
  11. "PeaZip". PeaZip. Retrieved 2013-10-06.
  12. "Single file data compression benchmark, sorted on compression ratio". Maximumcompression.com. 2007-04-14. Archived from the original on 2009-04-17. Retrieved 2010-05-19.
  13. "PAQCompress". Moisés Cardona. 2019-01-10. Retrieved 2019-03-05.
  14. "PerfectCompress Official Website". Moises-studios.110mb.com. 2010-04-03. Retrieved 2010-05-19.
  15. "PerfectCompress Official Facebook Page". Facebook.com. Retrieved 2010-05-19.
  16. "FrontPAQ - GUI frontend for PAQ8PF and PAQ8PX". encode.su. Retrieved 2019-07-26.

Further reading