This article needs additional citations for verification .(November 2012) |
The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms. It was created by Ian Witten, Tim Bell and John Cleary from the University of Calgary in 1987 and was commonly used in the 1990s. In 1997 it was replaced by the Canterbury corpus, [1] based on concerns about how representative the Calgary corpus was, [2] but the Calgary corpus still exists for comparison and is still useful for its originally intended purpose.
In its most commonly used form, the corpus consists of 14 files totaling 3,141,622 bytes as follows.
Size (bytes) | File name | Description |
---|---|---|
111,261 | BIB | ASCII text in UNIX "refer" format – 725 bibliographic references. |
768,771 | BOOK1 | unformatted ASCII text – Thomas Hardy: Far from the Madding Crowd. |
610,856 | BOOK2 | ASCII text in UNIX "troff" format – Witten: Principles of Computer Speech. |
102,400 | GEO | 32 bit numbers in IBM floating point format – seismic data. |
377,109 | NEWS | ASCII text – USENET batch file on a variety of topics. |
21,504 | OBJ1 | VAX executable program – compilation of PROGP. |
246,814 | OBJ2 | Macintosh executable program – "Knowledge Support System" of B.R. Gaines. |
53,161 | PAPER1 | UNIX "troff" format – Witten, Neal, Cleary: Arithmetic Coding for Data Compression. |
82,199 | PAPER2 | UNIX "troff" format – Witten: Computer (in)security. |
513,216 | PIC | 1728 x 2376 bitmap image (MSB first): text in French and line diagrams. |
39,611 | PROGC | Source code in C – UNIX compress v4.0. |
71,646 | PROGL | Source code in Lisp – system software. |
49,379 | PROGP | Source code in Pascal – program to evaluate PPM compression. |
93,695 | TRANS | ASCII and control characters – transcript of a terminal session. |
There is also a less commonly used 18 file version which include 4 additional text files in UNIX "troff" format, PAPER3 through PAPER6. The maintainers of the Canterbury corpus website notes that "they don't add to the evaluation". [3]
The Calgary corpus was a commonly used benchmark for data compression in the 1990s. Results were most commonly listed in bits per byte (bpb) for each file and then summarized by averaging. More recently, it has been common to just add the compressed sizes of all of the files. This is called a weighted average because it is equivalent to weighting the compression ratios by the original file sizes. The UCLC benchmark [4] by Johan de Bock uses this method.
For some data compressors it is possible to compress the corpus smaller by combining the inputs into an uncompressed archive (such as a tar file) before compression because of mutual information between the text files. In other cases, the compression is worse because the compressor handles nonuniform statistics poorly. This method was used in a benchmark in the online book Data Compression Explained by Matt Mahoney. [5]
The table below shows the compressed sizes of the 14 file Calgary corpus using both methods for some popular compression programs. Options, when used, select best compression. For a more complete list, see the above benchmarks.
Compressor | Options | As 14 separate files | As a tar file |
---|---|---|---|
Uncompressed | 3,141,622 | 3,152,896 | |
compress | 1,272,772 | 1,319,521 | |
Info-ZIP 2.32 | -9 | 1,020,781 | 1,023,042 |
gzip 1.3.5 | -9 | 1,017,624 | 1,022,810 |
bzip2 1.0.3 | -9 | 828,347 | 860,097 |
7-zip 9.12b | 848,687 | 824,573 | |
bzip3 1.1.8 | 765,939 | 779,795 | |
ppmd Jr1 | -m256 -o16 | 740,737 | 754,243 |
ppmonstr J | 675,485 | 669,497 | |
ZPAQ v7.15 | -method 5 | 659,709 | 659,853 |
The "Calgary corpus Compression and SHA-1 crack Challenge" [6] is a contest started by Leonid A. Broukhis on May 21, 1996 to compress the 14 file version of the Calgary corpus. The contest offers a small cash prize which has varied over time. Currently the prize is US $1 per 111 byte improvement over the previous result.
According to the rules of the contest, an entry must consist of both the compressed data and the decompression program packed into one of several standard archive formats. Time and memory limits, archive formats, and decompression languages have been relaxed over time. Currently the program must run within 24 hours on a 2000 MIPS machine under Windows or Linux and use less than 800 MB memory. An SHA-1 challenge was later added. It allows the decompression program to output files different from the Calgary corpus as long as they hash to the same values as the original files. So far, that part of the challenge has not been met.
The first entry received was 759,881 bytes in September, 1997 by Malcolm Taylor, author of RK and WinRK. The most recent entry was 580,170 bytes by Alexander Ratushnyak on July 2, 2010. The entry consists of a compressed file of size 572,465 bytes and a decompression program written in C++ and compressed to 7700 bytes as a PPMd var. I archive, plus 5 bytes for the compressed file name and size. The history is as follows.
Size (bytes) | Month/year | Author |
---|---|---|
759,881 | 09/1997 | Malcolm Taylor |
692,154 | 08/2001 | Maxim Smirnov |
680,558 | 09/2001 | Maxim Smirnov |
653,720 | 11/2002 | Serge Voskoboynikov |
645,667 | 01/2004 | Matt Mahoney |
637,116 | 04/2004 | Alexander Ratushnyak |
608,980 | 12/2004 | Alexander Ratushnyak |
603,416 | 04/2005 | Przemysław Skibiński |
596,314 | 10/2005 | Alexander Ratushnyak |
593,620 | 12/2005 | Alexander Ratushnyak |
589,863 | 05/2006 | Alexander Ratushnyak |
580,170 | 07/2010 | Alexander Ratushnyak |
A file archiver is a computer program that combines a number of files together into one archive file, or a series of archive files, for easier transportation or storage. File archivers may employ lossless data compression in their archive formats to reduce the size of the archive.
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU. Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993.
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.
Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".
bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.
In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and Mac OS X.
The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.
7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 22.01.
Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream. PPM algorithms can also be used to cluster data into predicted groupings in cluster analysis.
rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.
PAQ is a series of lossless data compression archivers that have gone through collaborative development to top rankings on several benchmarks measuring compression ratio. Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge. PAQ is free software distributed under the GNU General Public License.
Executable compression is any means of compressing an executable file and combining the compressed data with decompression code into a single executable. When this compressed executable is executed, the decompression code recreates the original code from the compressed code before executing it. In most cases this happens transparently so the compressed executable can be used in exactly the same way as the original. Executable compressors are often referred to as "runtime packers", "software packers", "software protectors".
Snappy is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than gzip.
The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.
.CSO is a compression method for the ISO image format. It is used to compress dumped PlayStation Portable UMD games, and is an alternative to the .DAX compression method. It is also sometimes called "CISO".
The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression improvements on a specific 1 GB English text file, with the goal of encouraging research in artificial intelligence (AI).
ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.
Zstandard, commonly known by the name of its reference implementation zstd, is a lossless data compression algorithm developed by Yann Collet at Facebook. Zstd is the reference implementation in C. Version 1 of this implementation was released as open-source software on 31 August 2016.