Gzip

Last updated

gzip (software)
Original author(s)
Developer(s) GNU Project
Initial release31 October 1992;32 years ago (1992-10-31)
Stable release
1.13 [1]   OOjs UI icon edit-ltr-progressive.svg / 19 August 2023
Repository git.savannah.gnu.org/cgit/gzip.git
Written in C
Operating system Unix-like, Plan 9, Inferno
Type Data compression
License GPL-3.0-or-later
Website www.gnu.org/software/gzip/

gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU (from where the "g" of gzip is derived). Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993.

Contents

The decompression of the gzip format can be implemented as a streaming algorithm, an important feature for Web protocols, data interchange and ETL (in standard pipes) applications.

File format

gzip (file format)
Filename extension
.gz
Internet media type
application/gzip [2]
Uniform Type Identifier (UTI) org.gnu.gnu-zip-archive
Magic number 1f 8b
Developed byJean-loup Gailly and Mark Adler
Type of format Data compression
Open format?Yes
Website gzip.org (obsolete)

gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. DEFLATE was intended as a replacement for LZW and other patent-encumbered data compression algorithms which, at the time, limited the usability of the compress utility and other popular archivers.

"gzip" is often also used to refer to the gzip file format, which is:

gzip can be combined with the tar program to compress multiple files. Targzip.svg
gzip can be combined with the tar program to compress multiple files.

Although its file format also allows for multiple such streams to be concatenated (gzipped files are simply decompressed concatenated as if they were originally one file), [5] gzip is normally used to compress just single files. [6] Compressed archives are typically created by assembling collections of files into a single tar archive (also called tarball), [7] and then compressing that archive with gzip. The final compressed file usually has the extension .tar.gz or .tgz.

gzip is not to be confused with the ZIP archive format, which also uses DEFLATE. The ZIP format can hold collections of files without an external archiver, but is less compact than compressed tarballs holding the same data, because it compresses files individually and cannot take advantage of redundancy between files (solid compression). The gzip file format is also not to be confused with that of the compress utility, based on LZW, with extension .Z; however, the gunzip utility is able to decompress .Z files. [8]

Implementations

NetBSD Gzip / FreeBSD Gzip
Developer(s) The NetBSD Foundation
Repository cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/gzip/
Written in C
Type Data compression
License Simplified BSD License

Various implementations of the program have been written. The most commonly known is the GNU Project's implementation using Lempel-Ziv coding (LZ77). OpenBSD's version of gzip is actually the compress program, to which support for the gzip format was added in OpenBSD 3.4. The 'g' in this specific version stands for gratis . [9] FreeBSD, DragonFly BSD and NetBSD use a BSD-licensed implementation instead of the GNU version; it is actually a command-line interface for zlib intended to be compatible with the GNU implementations' options. [10] These implementations originally come from NetBSD, and support decompression of bzip2 and the Unix pack format.

An alternative compression program achieving 3-8% better compression is Zopfli. It achieves gzip-compatible compression using more exhaustive algorithms, at the expense of compression time required. It does not affect decompression time.

pigz, written by Mark Adler, is compatible with gzip and speeds up compression by using all available CPU cores and threads. [11]

Damage recovery

Data in blocks prior to the first damaged part of the archive is usually fully readable. Data from blocks not demolished by damage that are located afterward may be recoverable through difficult workarounds. [12]

Derivatives and other uses

The tar utility included in most Linux distributions can extract .tar.gz files by passing the z option, e.g., tar -zxf file.tar.gz, where -z instructs decompression, -x means extraction, and -f specifies the name of the compressed archive file to extract from. Optionally, -v (verbose) lists files as they are being extracted. [13]

zlib is an abstraction of the DEFLATE algorithm in library form which includes support both for the gzip file format and a lightweight data stream format in its API. The zlib stream format, DEFLATE, and the gzip file format were standardized respectively as RFC 1950, RFC 1951, and RFC 1952.

The gzip format is used in HTTP compression, a technique used to speed up the sending of HTML and other content on the World Wide Web. It is one of the three standard formats for HTTP compression as specified in RFC 2616. This RFC also specifies a zlib format (called "DEFLATE"), which is equal to the gzip format except that gzip adds eleven bytes of overhead in the form of headers and trailers. Still, the gzip format is sometimes recommended over zlib because Internet Explorer does not implement the standard correctly and cannot handle the zlib format as specified in RFC 1950. [14]

zlib DEFLATE is used internally by the Portable Network Graphics (PNG) format.

Since the late 1990s, bzip2, a file compression utility based on a block-sorting algorithm, has gained some popularity as a gzip replacement. It produces considerably smaller files (especially for source code and other structured text), but at the cost of memory and processing time (up to a factor of 4). [15]

AdvanceCOMP, Zopfli, libdeflate and 7-Zip can produce gzip-compatible files, using an internal DEFLATE implementation with better compression ratios than gzip itself—at the cost of more processor time compared to the reference implementation.[ citation needed ]

Research published in 2023 showed that simple lossless compression techniques such as gzip could be combined with a k-nearest-neighbor classifier to create an attractive alternative to deep neural networks for text classification in natural language processing. This approach has been shown to equal and in some cases outperform conventional approaches such as BERT due to low resource requirements, e.g. no requirement for GPU hardware. [16]

See also

Notes

  1. Jim Meyering (19 August 2023). "gzip-1.13 released [stable]" . Retrieved 20 August 2023.
  2. The 'application/zlib' and 'application/gzip' Media Types. Internet Engineering Task Force. doi: 10.17487/RFC6713 . RFC 6713 . Retrieved 1 March 2014.
  3. Deutsch, L. Peter (May 1996). "GZIP file format specification version 4.3". Internet Engineering Task Force. doi:10.17487/RFC1952 . Retrieved 23 July 2019.
  4. Jean-loup Gailly. "GNU Gzip". Gnu.org. Archived from the original on 15 October 2015. Retrieved 11 October 2015.
  5. "GNU Gzip: Advanced usage". Gnu.org. Archived from the original on 24 December 2012. Retrieved 28 November 2012.
  6. "Can gzip compress several files into a single archive?". Gnu.org. Archived from the original on 22 July 2010. Retrieved 27 January 2010.
  7. "tarball, The Jargon File, version 4.4.7". Catb.org. Archived from the original on 20 March 2017. Retrieved 27 January 2010.
  8. "GNU Gzip". The GNU Operating System and the Free Software Movement. 5 February 2023. Retrieved 3 April 2024. gunzip can currently decompress files created by gzip, zip, compress or pack. The detection of the input format is automatic.
  9. "OpenBSD gzip(1) manual page". Openbsd.org. OpenBSD. Retrieved 4 February 2018.
  10. "gzip". Man.freebsd.org. 9 October 2011. Archived from the original on 17 December 2019. Retrieved 1 March 2014.
  11. Mark Adler (2017). "pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines". zlib.net. Archived from the original on 18 December 2018. Retrieved 23 December 2018.
  12. Recovering a damaged .gz file – Jean-loup Gailly, GZip.org
  13. "How To Extract / Unzip tar.gz Files From Linux Command Line". Knowledge Base by phoenixNAP. 14 November 2019. Retrieved 12 January 2022.
  14. Lawrence, Eric (21 November 2014). "Compressing the Web". MSDN Blogs > IEInternals. Microsoft. Archived from the original on 28 October 2015. Retrieved 2 November 2015.
  15. "Comparison Tool: 7-zip vs bzip2 vs gzip". compressionratings.com. Archived from the original on 1 November 2014. Retrieved 1 November 2014.
  16. Jiang, Zhiying; Yang, Matthew; Tsirlin, Mikhail; Tang, Raphael; Dai, Yiqin; Lin, Jimmy (July 2023). ""Low-Resource" Text Classification: A Parameter-Free Classification Method with Compressors". Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics: 6810–6828. doi: 10.18653/v1/2023.findings-acl.426 . S2CID   260668487.

Related Research Articles

A file archiver is a computer program that combines a number of files together into one archive file, or a series of archive files, for easier transportation or storage. File archivers may employ lossless data compression in their archive formats to reduce the size of the archive.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

zlib DEFLATE codec library

zlib is a software library used for data compression as well as a data format. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. zlib is also a crucial component of many software platforms, including Linux, macOS, and iOS. It has also been used in gaming consoles such as the PlayStation 4, PlayStation 3, Wii U, Wii, Xbox One and Xbox 360.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own, such as devices that use magnetic tape. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.

compress is a Unix shell compression program based on the LZW compression algorithm. Compared to gzip's fastest setting, compress is slightly slower at compression, slightly faster at decompression, and has a significantly lower compression ratio. 1.8 MiB of memory is used to compress the Hutter Prize data, slightly more than gzip's slowest setting.

<span class="mw-page-title-main">7-Zip</span> Open-source file archiver

7-Zip is a free and open-source file archiver, a utility used to place groups of files within compressed containers known as "archives". It is developed by Igor Pavlov and was first released in 1999. 7-Zip has its own archive format called 7z, but can read and write several others.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 24.05.

pax is an archiving utility available for various operating systems and defined since 1995. Rather than sort out the incompatible options that have crept up between tar and cpio, along with their implementations across various versions of Unix, the IEEE designed a new archive utility pax that could support various archive formats with useful options from both archivers. The pax command is available on Unix and Unix-like operating systems and on IBM i, and Microsoft Windows NT until Windows 2000.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

The following tables compare general and technical information for a number of file archivers. Please see the individual products' articles for further information. They are neither all-inclusive nor are some entries necessarily up to date. Unless otherwise specified in the footnotes section, comparisons are based on the stable versions—without add-ons, extensions or external programs.

<span class="mw-page-title-main">HTTP compression</span> Capability that can be built into web servers and web clients

HTTP compression is a capability that can be built into web servers and web clients to improve transfer speed and bandwidth utilization.

XZ Utils is a set of free software command-line lossless data compressors, including the programs lzma and xz, for Unix-like operating systems and, from version 5.0 onwards, Microsoft Windows. For compression/decompression the Lempel–Ziv–Markov chain algorithm (LZMA) is used. XZ Utils started as a Unix port of Igor Pavlov's LZMA-SDK that has been adapted to fit seamlessly into Unix environments and their usual structure and behavior.

mod_deflate is an optional module for the Apache HTTP Server, Apache v2.0 and later. It is based on Deflate lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. This module provides the DEFLATE output filter that allows output from Apache HTTP server to be compressed before being sent to the client over the network. It also provides a filter for decompressing a gzip compressed response body.

lzip Data compression utility

lzip is a free, command-line tool for the compression of data; it employs the Lempel–Ziv–Markov chain algorithm (LZMA) with a user interface that is familiar to users of usual Unix compression tools, such as gzip and bzip2.

Brotli is a lossless data compression algorithm developed by Google. It uses a combination of the general-purpose LZ77 lossless compression algorithm, Huffman coding and 2nd-order context modelling. Brotli is primarily used by web servers and content delivery networks to compress HTTP content, making internet websites load faster. A successor to gzip, it is supported by all major web browsers and has become increasingly popular, as it provides better compression than gzip.

Zstandard is a lossless data compression algorithm developed by Yann Collet at Facebook. Zstd is the corresponding reference implementation in C, released as open-source software on 31 August 2016.

References