ZPAQ

Last updated
ZPAQ
Developer(s) Matt Mahoney
Stable release
7.15 [1]   OOjs UI icon edit-ltr-progressive.svg / 22 September 2016
Repository
Written in C++
Operating system Microsoft Windows, Linux
Platform IA-32, x86-64
Type File archiver
License MIT, Public domain
Website mattmahoney.net/dc/zpaq.html   OOjs UI icon edit-ltr-progressive.svg

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms (LZ77, BWT, and context mixing) depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

Contents

Archive format

Files are saved in the ZPAQ level 2 journaling format. [2] The standard defines two formats - streaming and journaling. Only the journaling format supports deduplication, directory attributes, and multiple dated file versions.

The streaming archive format is designed to be extracted in a single pass. An archive is divided into a sequence of blocks that can be decompressed independently in parallel. Blocks are divided into segments that must be decompressed sequentially in order. Each block header contains a description of the decompression algorithm. Each segment has a header containing an optional file name and an optional comment for meta-data such as size, date, and attributes, and an optional trailing SHA-1 checksum of the original data for integrity checking. If the file name is omitted, it is assumed to be a continuation of the last named file, which may be in the previous block. Thus, inserting, removing, or reordering the blocks in a streaming archive has the effect of performing the same operations on the data that the blocks represent.

The journaling format consists of a sequence of transactions, or updates. An update contains 4 types of blocks: a transaction header block, a sequence of data blocks, a corresponding sequence of fragment tables, and a sequence of index blocks. A transaction header block contains the transaction date and a pointer skipping over the data blocks to allow the archive index to be read quickly. The data blocks contain a sequence of file fragments compressed together. The fragment tables give the size and SHA-1 hash of each fragment. The index blocks contain a list of edits to the global archive index. An edit is either a file update or a file deletion. An update includes a file name, last modified date, attributes, and a list of fragment pointers into the current and previous transactions. Fragments may be shared by more than one file. A deletion does not remove any data from the archive, but rather indicates that the file is not to be extracted unless the archive is rolled back to an earlier date.

The ZPAQ standard does not specify a compression algorithm. Rather, it specifies a format for representing the decompression algorithm in the block headers. Decompression algorithms are written in a language called ZPAQL and stored as a byte code which can either be interpreted or converted directly to 32 or 64 bit x86 code and executed. A ZPAQL program has 3 parts.

The COMP models are based on PAQ, which compresses one bit at a time using arithmetic coding. There are 9 types of components. Each component takes a context and possibly the predictions of earlier components, and outputs a prediction or probability that the next bit will be a 1. The output of the last component is arithmetic coded. The component types are:

The HCOMP section computes the contexts for the components in the COMP section. It is a virtual machine whose state is 4 32-bit registers (A, B, C, D), a 16 bit program counter, a condition flag bit, and two memory arrays, one of bytes (M) and one of 32 bit words (H). The beginning of H forms the array of contexts. An assembly language-like program is called once for each coded or decoded byte with that byte as input in A. The final context seen by the COMP section is the computed context combined with the previously seen bits of the current byte.

The optional PCOMP section is used for post-processing the decoded data. It runs in a separate virtual machine like that of HCOMP. However, unlike the COMP and HCOMP sections which are used for both compression and decompression, the PCOMP section is run only during decompression. The compressor is responsible for performing the inverse operation on the input data prior to coding.

ZPAQL Example

ZPAQL source code uses a textual syntax, with each space-delimited word assembling to one byte in most cases, and comments in parentheses. The following example is the mid configuration, similar to level 5 compression. It describes an ICM-ISSE chain of components taking hashed contexts of orders 0 through 5, a MATCH taking an order 7 context, and as a final step, averaging these bit predictions using a MIX. There is no post-processing.

comp 3 3 0 0 8 (hh hm ph pm n)    0 icm 5      (order 0...5 chain)    1 isse 13 0    2 isse 17 1    3 isse 18 2    4 isse 18 3    5 isse 19 4    6 match 22 24  (order 7)    7 mix 16 0 7 24 255  (order 1)  hcomp    c++ *c=a b=c a=0 (save in rotating buffer M)    d= 1 hash *d=a   (orders 1...5 for isse)    b-- d++ hash *d=a    b-- d++ hash *d=a    b-- d++ hash *d=a    b-- d++ hash *d=a    b-- d++ hash b-- hash *d=a (order 7 for match)    d++ a=*c a<<= 8 *d=a       (order 1 for mix)    halt  end

The COMP parameters describe the log base 2 sizes of the word and byte arrays (hh, hm), 8 bytes each in the HCOMP section and not used in the PCOMP section. There are n = 8 numbered components. The components take parameters describing their table sizes and inputs. In particular, each ISSE takes its input from the previous component, and the MIX takes input from the 7 components starting at 0. The line "5 isse 19 4" says that the ISSE has a table size of 219+6 bit histories and takes its input from component 4.

In the HCOMP section, registers B and C point into the 8 byte rotating array M, and D points to the 8 word array H. M is used to store the last 8 bytes of input from the A register. C points to the head of this buffer. The HASH instruction computes:

 a = (a + *b + 512) * 773;

Thus, the code stores context hashes of various orders in H[0]...H[7].

Deduplication

On update, ZPAQ divides the input files into fragments, computes their SHA-1 hashes, and compares them to the hashes stored in the archive. If there is a match, then the fragments are assumed to be identical, and only a pointer to the previously compressed fragment is stored. Otherwise the fragment is packed into a block to be compressed. Block sizes can be up to 16 MiB to 64 MiB depending on the compression level.

Files are divided into fragments on content-dependent boundaries. Rather than a Rabin fingerprint, ZPAQ uses a rolling hash that depends on the last 32 bytes that are not predicted by an order 1 context, plus any predicted bytes in between. If the leading 16 bits of the 32 bit hash are all 0, then a fragment boundary is marked. This gives an average fragment size of 64 KiB.

The rolling hash uses a 256 byte table containing the byte last seen in each possible order-1 context. The hash is updated by adding the next byte and then multiplying either by an odd constant if the byte was predicted or by an even number that is not a multiple of 4 if the byte was not predicted.

Compression

ZPAQ has 5 compression levels, from fastest to best. At all but the best level, it uses the statistics of the order-1 prediction table used for deduplication to test whether the input appears random. If so, it is stored without compression as a speed optimization.

ZPAQ will use an E8E9 transform (see: BCJ) to improve the compression of x86 code typically found in .exe and .dll files. An E8E9 transform scans for CALL and JMP instructions (opcodes E8 and E9 hex) and replaces their relative addresses with absolute addresses. Then it inserts code into the PCOMP section to perform the inverse transform.

Error recovery

ZPAQ lacks error correction but has several features that limit damage if the archive is corrupted. On decompression, all SHA-1 hashes are checked. If the hash does not match or if some other error occurs, then a warning is printed and the block is ignored. Blocks begin with a 13 byte "locator tag" containing a randomly chosen but fixed string to allow start of the next block to be found by scanning. If a data fragment is lost, then all of the files referencing that fragment and the remaining fragments in the block are also lost. If a fragment table is lost, then it can be recovered from a redundant list of fragment sizes stored in the corresponding data block and by recomputing the hashes. In this case, a second hash of the whole data block is checked. If an index block is lost, then the corresponding files are lost. Index blocks are small (16 KiB) in order to limit damage.

Updates are transacted by appending a temporary transaction header and then updating the header as the last step. If an update is interrupted, then the temporary header signals ZPAQ that no useful data is found after it. The next update will overwrite this excess data.

Basic usage

Creating an archive, and updating an archive

zpaq add directory/archive.zpaq directory/source_directory -mX -key password

The options -mX (with X being the compression level from 0 to 5) and -key (which performs AES-256 encryption) can be omitted. The 0 compression level does not compress data, but still carries out data deduplication. The compression levels 4 and 5 can be very time-consuming. The default (1) uses simple LZ77 compression.

Listing archive contents

zpaq list archive.zpaq lists the files and directories of the most recent version. Adding -all will list all versions of all files and directories, in the format version_number/directory/file_name. The output can be further processed with grep and other tools.

Extracting files

zpaq extract archive.zpaq will un-pack the last version of the entire archive in the active directory. zpaq extract backup.zpaq path will only extract the specified directory (or file). Appending the -until N option selects the version, where negative numbers are allowed. -2 would extract the third most recent version of the archive. The optional -to tells ZPAQ where to save the extracted files.

zpaq extract backup.zpaq -all -only "*muppet*" will extract all versions of all files and directories whose name contains "muppet". Different file versions will be placed in different directories (0001/ 0002/ 0003/ et cetera). -only is optional.

History

Related Research Articles

gzip GNU file compression/decompression tool

gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU. Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

zlib DEFLATE codec library

zlib is a software library used for data compression as well as a data format. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. zlib is also a crucial component of many software platforms, including Linux, macOS, and iOS. It has also been used in gaming consoles such as the PlayStation 4, PlayStation 3, Wii U, Wii, Xbox One and Xbox 360.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).

<span class="mw-page-title-main">FLAC</span> Lossless digital audio coding format

FLAC is an audio coding format for lossless compression of digital audio, developed by the Xiph.Org Foundation, and is also the name of the free software project producing the FLAC tools, the reference software package that includes a codec implementation. Digital audio compressed by FLAC's algorithm can typically be reduced to between 50 and 70 percent of its original size and decompresses to an identical copy of the original audio data.

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.

The BMP file format or bitmap, is a raster graphics image file format used to store bitmap digital images, independently of the display device, especially on Microsoft Windows and OS/2 operating systems.

RAR is a proprietary archive file format that supports data compression, error correction and file spanning. It was developed in 1993 by Russian software engineer Eugene Roshal and the software is licensed by win.rar GmbH. The name RAR stands for Roshal Archive.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 23.01.

<span class="mw-page-title-main">Smacker video</span> Digital video file format

Smacker video is a video file format developed by Epic Games Tools, and primarily used for full-motion video in video games. Smacker uses an adaptive 8-bit RGB palette. RAD's format for video at higher color depths is Bink Video. The Smacker format specifies a container format, a video compression format, and an audio compression format. Since its release in 1994, Smacker has been used in over 2300 games. Blizzard used this format for the cinematic videos seen in its games Warcraft II, StarCraft and Diablo I.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

<span class="mw-page-title-main">LHA (file format)</span>

LHA or LZH is a freeware compression utility and associated file format. It was created in 1988 by Haruyasu Yoshizaki, a doctor and originally named LHarc. A complete rewrite of LHarc, tentatively named LHx, was eventually released as LH. It was then renamed to LHA to avoid conflicting with the then-new MS-DOS 5.0 LH command. The original LHA and its Windows port, LHA32, are no longer in development because Yoshizaki is busy at work.

PAQ is a series of lossless data compression archivers that have gone through collaborative development to top rankings on several benchmarks measuring compression ratio. Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge. PAQ is free software distributed under the GNU General Public License.

The Apple Icon Image format (.icns) is an icon format used in Apple Inc.'s macOS. It supports icons of 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 points at 1x and 2x scale, with both 1- and 8-bit alpha channels and multiple image states. The fixed-size icons can be scaled by the operating system and displayed at any intermediate size.

<span class="mw-page-title-main">PeaZip</span> File archive computer program

PeaZip is a free and open-source file manager and file archiver for Microsoft Windows, ReactOS, Linux, MacOS and BSD by Giorgio Tani. It supports its native PEA archive format and other mainstream formats, with special focus on handling open formats. Version 9.4.0 supported 234 file extensions.

Dynamic Markov compression (DMC) is a lossless data compression algorithm developed by Gordon Cormack and Nigel Horspool. It uses predictive arithmetic coding similar to prediction by partial matching (PPM), except that the input is predicted one bit at a time. DMC has a good compression ratio and moderate speed, similar to PPM, but requires somewhat more memory and is not widely implemented. Some recent implementations include the experimental compression programs hook by Nania Francesco Antonio, ocamyd by Frank Schwellinger, and as a submodel in paq8l by Matt Mahoney. These are based on the 1993 implementation in C by Gordon Cormack.

XZ Utils is a set of free software command-line lossless data compressors, including the programs lzma and xz, for Unix-like operating systems and, from version 5.0 onwards, Microsoft Windows. For compression/decompression the Lempel–Ziv–Markov chain algorithm (LZMA) is used. XZ Utils started as a Unix port of Igor Pavlov's LZMA-SDK that has been adapted to fit seamlessly into Unix environments and their usual structure and behavior.

Compact File Set (CFS) is an open archive file format and software distribution container file format.

References

  1. "Release 7.15". 22 September 2016. Retrieved 15 March 2018.
  2. Mahoney, Matt (3 June 2013). "The ZPAQ Open Standard for Highly Compressed Data - Level 2" (PDF). Retrieved 28 May 2023.
  3. Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM Format Sequencing Data. PLoS ONE 8(3): e59190. doi:10.1371/journal.pone.0059190
  4. "[WCX] ZPAQ". Total Commander Forums. Retrieved 10 July 2021.