In computing, Deflate (stylized as DEFLATE, and also called Flate [1] [2] ) is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996). [3]
Katz also designed the original algorithm used to construct Deflate streams. This algorithm was patented as U.S. patent 5,051,745 , and assigned to PKWARE, Inc. [4] [5] As stated in the RFC document, an algorithm producing Deflate files was widely thought to be implementable in a manner not covered by patents. [3] This led to its widespread use – for example, in gzip compressed files and PNG image files, in addition to the ZIP file format for which Katz originally designed it. The patent has since expired.
A Deflate stream consists of a series of blocks. Each block is preceded by a 3-bit header:
1
: This is the last block in the stream.0
: There are more blocks to process after this one.00
: A stored (a.k.a. raw or literal) section, between 0 and 65,535 bytes in length01
: A static Huffman compressed block, using a pre-agreed Huffman tree defined in the RFC10
: A dynamic Huffman compressed block, complete with the Huffman table supplied11
: Reserved—don't use.The stored block option adds minimal overhead and is used for data that is incompressible.
Most compressible data will end up being encoded using method 10
, the dynamic Huffman encoding, which produces an optimized Huffman tree customized for each block of data individually. Instructions to generate the necessary Huffman tree immediately follow the block header. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the percentage compression loss due to using a non-optimal (thus, not technically Huffman) code.
Compression is achieved through two steps:
Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (3–258 bytes) and a 15-bit distance (1–32,768 bytes) to the beginning of the duplicate. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32 KiB of uncompressed data decoded (termed the sliding window).
If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of 10 identical bytes can be encoded as one byte, followed by a duplicate of length 9, beginning with the previous byte.
Searching the preceding text for duplicate substrings is the most computationally expensive part of the DEFLATE algorithm, and the operation which compression level settings affect.
The second compression stage consists of replacing commonly used symbols with shorter representations and less commonly used symbols with longer representations. The method used is Huffman coding which creates an unprefixed tree of non-overlapping intervals, where the length of each sequence is inversely proportional to the logarithm of the probability of that symbol needing to be encoded. The more likely it is that a symbol has to be encoded, the shorter its bit-sequence will be.
A tree is created, containing space for 288 symbols:
A match length code will always be followed by a distance code. Based on the distance code read, further "extra" bits may be read in order to produce the final distance. The distance tree contains space for 32 symbols:
Note that for the match distance symbols 2–29, the number of extra bits can be calculated as .
The two codes (the 288-symbol length/literal tree and the 32-symbol distance tree) are themselves encoded as canonical Huffman codes by giving the bit length of the code for each symbol. The bit lengths are themselves run-length encoded to produce as compact a representation as possible. As an alternative to including the tree representation, the "static tree" option provides standard fixed Huffman trees. The compressed size using the static trees can be computed using the same statistics (the number of times each symbol appears) as are used to generate the dynamic trees, so it is easy for a compressor to choose whichever is smaller.
During the compression stage, it is the encoder that chooses the amount of time spent looking for matching strings. The zlib/gzip reference implementation allows the user to select from a sliding scale of likely resulting compression-level vs. speed of encoding. Options range from 0
(do not attempt compression, just store uncompressed) to 9
representing the maximum capability of the reference implementation in zlib/gzip.
Other Deflate encoders have been produced, all of which will also produce a compatible bitstream capable of being decompressed by any existing Deflate decoder. Differing implementations will likely produce variations on the final encoded bit-stream produced. The focus with non-zlib versions of an encoder has normally been to produce a more efficiently compressed and smaller encoded stream.
Deflate64, specified by PKWARE, is a proprietary variant of Deflate. It's fundamentally the same algorithm. What has changed is the increase in dictionary size from 32 KB to 64 KB, an extension of the distance codes to 16 bits so that they may address a range of 64 KB, and the length code, which is extended to 16 bits so that it may define lengths of three to 65,538 bytes. [6] This leads to Deflate64 having a longer compression time, and potentially a slightly higher compression ratio, than Deflate. [7] Several free and/or open source projects support Deflate64, such as 7-Zip, [8] while others, such as zlib, do not, as a result of the proprietary nature of the procedure [9] and the very modest performance increase over Deflate. [10]
Implementations of Deflate are freely available in many languages. Apps written in C typically use the zlib library (under the permissive zlib License). Apps in Borland Pascal (and compatible languages) can use paszlib. Apps in C++ can take advantage of the improved Deflate library in 7-Zip. Both Java and .NET Framework offer out-of-the-box support for Deflate in their libraries (respectively, java.util.zip
and System.IO.Compression). Apps in Ada can use Zip-Ada (pure) or ZLib-Ada.
AdvanceCOMP uses the higher compression ratio versions of Deflate in 7-Zip, libdeflate, and Zopfli to enable recompression of gzip, PNG, MNG and ZIP files with the possibility of smaller file sizes than zlib is able to achieve at maximum settings. [14]
193f:0001
) capable of compressing streams using Deflate at a rate of up to 3.0 Gbit/s (375 MB/s) for incoming uncompressed data. Accompanying the Linux kernel driver for the AHA361-PCIX is an "ahagzip
" utility and customised "mod_deflate_aha
" capable of using the hardware compression from Apache. The hardware is based on a Xilinx Virtex FPGA and four custom AHA3601 ASICs. The AHA361/AHA362 boards are limited to only handling static Huffman blocks and require software to be modified to add support — the cards were not able to support the full Deflate specification, meaning they could only reliably decode their own output (a stream that did not contain any dynamic Huffman type 2 blocks).17b4:0011
) or PCI-X cards featuring between one and six compression engines with claimed processing speeds of up to 3.6 Gbit/s (450 MB/s). A version of the cards are available with the separate brand WebEnhance specifically designed for web-serving use rather than SAN or backup use; a PCIe revision, the MX4E is also produced.PCI-ID: 193f:0363
/193f:0364
) with a new hardware AHA3610 encoder chip. The new chip was designed to be capable of a sustained 2.5 Gbit/s. Using two of these chips, the AHA363-PCIe board can process Deflate at a rate of up to 5.0 Gbit/s (625 MB/s) using the two channels (two compression and two decompression). The AHA364-PCIe variant is an encode-only version of the card designed for out-going load balancers and instead has multiple register sets to allow 32 independent virtual compression channels feeding two physical compression engines. Linux, Microsoft Windows, and OpenSolaris kernel device drivers are available for both of the new cards, along with a modified zlib system library so that dynamically linked applications can automatically use the hardware support without internal modification. The AHA367-PCIe board (PCI-ID: 193f:0367
) is similar to the AHA363-PCIe but uses four AHA3610 chips for a sustained compression rate of 10 Gbit/s (1250 MB/s). Unlike the AHA362-PCIX, the decompression engines on the AHA363-PCIe and AHA367-PCIe boards are fully deflate compliant.Inflate is the decoding process that takes a Deflate bitstream for decompression and correctly produces the original full-size data or file.
The normal intent with an alternative Inflate implementation is highly optimized decoding speed, or extremely predictable RAM usage for micro-controller embedded systems.
PCDEZIP
, Bob Flanders and Michael Holmes, published in PC Magazine 1994-01-11.gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU. Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993.
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.
Run-length encoding (RLE) is a form of lossless data compression in which runs of data are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather than as the original run. As an imaginary example of the concept, when encoding an image built up from colored dots, the sequence "green green green green green green green green green" is shortened to "green x 9". This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, games, and animations. For files that do not have many runs, encoding them with RLE could increase the file size.
zlib is a software library used for data compression as well as a data format. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. zlib is also a crucial component of many software platforms, including Linux, macOS, and iOS. It has also been used in gaming consoles such as the PlayStation 4, PlayStation 3, Wii U, Wii, Xbox One and Xbox 360.
bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.
LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and the DEFLATE algorithm used in PNG and ZIP.
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.
compress is a Unix shell compression program based on the LZW compression algorithm. Compared to gzip's fastest setting, compress is slightly slower at compression, slightly faster at decompression, and has a significantly lower compression ratio. 1.8 MiB of memory is used to compress the Hutter Prize data, slightly more than gzip's slowest setting.
The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.
7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 24.08.
rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.
Snappy is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than gzip.
HTTP compression is a capability that can be built into web servers and web clients to improve transfer speed and bandwidth utilization.
mod_deflate is an optional module for the Apache HTTP Server, Apache v2.0 and later. It is based on Deflate lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. This module provides the DEFLATE output filter that allows output from Apache HTTP server to be compressed before being sent to the client over the network. It also provides a filter for decompressing a gzip compressed response body.
liblzg is a compression library for performing lossless data compression. It implements an algorithm that is a variation of the LZ77 algorithm, called the LZG algorithm, with the primary focus of providing a very simple and fast decoding method. One of the key features of the algorithm is that it requires no memory during decompression. The software library is free software, distributed under the zlib license.
LZ4 is a lossless data compression algorithm that is focused on compression and decompression speed. It belongs to the LZ77 family of byte-oriented compression schemes.
Brotli is a lossless data compression algorithm developed by Google. It uses a combination of the general-purpose LZ77 lossless compression algorithm, Huffman coding and 2nd-order context modelling. Brotli is primarily used by web servers and content delivery networks to compress HTTP content, making internet websites load faster. A successor to gzip, it is supported by all major web browsers and has become increasingly popular, as it provides better compression than gzip.
Zstandard is a lossless data compression algorithm developed by Yann Collet at Facebook. Zstd is the corresponding reference implementation in C, released as open-source software on 31 August 2016.
842, 8-4-2, or EFT is a data compression algorithm. It is a variation on Lempel–Ziv compression with a limited dictionary length. With typical data, 842 gives 80 to 90 percent of the compression of LZ77 with much faster throughput and less memory use. Hardware implementations also provide minimal use of energy and minimal chip area.
Package flate implements the DEFLATE compressed data format, described in RFC issue 1951.
FlateDecode [...] Decompresses data encoded using the zlib/deflate compression method
{{cite web}}
: CS1 maint: bot: original URL status unknown (link){{cite web}}
: CS1 maint: bot: original URL status unknown (link)appnote.txt
, .ZIP File Format Specification Archived 2014-12-05 at the Wayback Machine ; Section 10, X. Deflating – Method 8.