Compress

Last updated
compress / uncompress
Original author(s) Spencer Thomas
Initial releaseFebruary 1985;38 years ago (1985-02)
Operating system Unix, Unix-like, IBM i
Type Command
compress .Z
Filename extension
.Z
Internet media type
application/x-compress
Developed bySpencer Thomas
Type of format data compression

compress is a Unix shell compression program based on the LZW compression algorithm. [1] Compared to gzip's fastest setting, compress is slightly slower at compression, slighty faster at decompression, and has a significantly lower compression ratio. [2] 1.8 MiB of memory is used to compress the Hutter Prize data, slightly more than gzip's slowest setting. [3]

Contents

The uncompress utility will restore files to their original state after they have been compressed using the compress utility. If no files are specified, the standard input will be uncompressed to the standard output.

In the upcoming POSIX and Single Unix Specification revision, it is planned that DEFLATE algorithm used in gzip format be supported in those utilities. [4]

Description of program

Files compressed by compress are typically given the extension ".Z" (modeled after the earlier pack program which used the extension ".z"). Most tar programs will pipe their data through compress when given the command line option "-Z". (The tar program in its own does not compress; it just stores multiple files within one tape archive.)

Files can be returned to their original state using uncompress. The usual action of uncompress is not merely to create an uncompressed copy of the file, but also to restore the timestamp and other attributes of the compressed file.

For files produced by compress on other systems, uncompress supports 9- to 16-bit compression.

History

The LZW algorithm used in compress was patented by Sperry Research Center in 1983. Terry Welch published an IEEE article on the algorithm in 1984, [5] but failed to note that he had applied for a patent on the algorithm. Spencer Thomas of the University of Utah took this article and implemented compress in 1984, without realizing that a patent was pending on the LZW algorithm. The GIF image format also incorporated LZW compression in this way, and Unisys later claimed royalties on implementations of GIF. Joseph M. Orost led the team and worked with Thomas et al. to create the 'final' (4.0) version of compress and published it as free software to the 'net.sources' USENET group in 1985. U.S. Patent 4,558,302 was granted in 1985, and this is why compress could not be used without paying royalties to Sperry Research, which was eventually merged into Unisys.

compress has fallen out of favor in particular user-groups because it makes use of the LZW algorithm, which was covered by a Unisys patent  because of this, gzip and bzip2 increased in popularity on Linux-based operating systems due to their alternative algorithms, along with better file compression. compress has, however, maintained a presence on Unix and BSD systems and the compress and uncompress commands have also been ported to the IBM i operating system. [6]

The US LZW patent expired in 2003, so it is now in the public domain in the United States. All patents on the LZW worldwide have also expired (see Graphics Interchange Format#Unisys and LZW patent enforcement).

In the up-coming POSIX and Single Unix Specification revision, it is planned that DEFLATE algorithm used in gzip format be supported in those utilities.

Special output format

Output binary consists of bit groups. Each bit group consists of codes with fixed amount of bits (9-16). Each group (except last) should be aligned by amount of bits multiplied by 8 and right padded with zeroes. Last group should be aligned by 8 and padded with zeroes. You can find more information in ncompress issue.

Example:

You want to output ten 9-bit codes, five 10-bit codes and thirteen 11-bit codes. You now have three groups of bits that you want to output: 90 bits, 50 bits and 143 bits.
  • First group should then be 90 bits of data + 54 zero bits of padding in order to be aligned to 72 bits (9 bits × 8).
  • Second group should then be 50 bits of data + 30 zero bits of padding in order to be aligned to 80 bits (10 bits × 8).
  • Third group should then be 143 bits of data + 1 zero bit of padding in order to be aligned to 8 bits (1 byte only, since this is the last group in the output).

It is actually a bug. LZW doesn't require any alignment. This bug is a part of original UNIX compress, ncompress, gzip and even windows port. It exists more than 35 years. All application/x-compress files were created using this bug. So we have to include it in output specification.

Some compress implementations write random bits from uninitialized buffer as alignment bits. There is no guarantee that alignment bits will be zeroes. So in terms of 100% compatibility decompressor have to just ignore alignment bit values.

Standardization and availability

compress was standardized in X/Open CAE Specification in 1994, [7] and further in The Open Group Base Specifications, Issue 6 and 7. [8] Linux Standard Base does not requires compress. [9]

compress is often not installed by default in Linux distributions, but can be installed from an additional package. [10] compress is available for FreeBSD, OpenBSD, MINIX, Solaris and AIX.

compress is allowed for Point-to-Point Protocol in RFC   1977 and for HTTP/1.1 in RFC   9110, though it is rarely used in modern deployments as the better deflate/gzip is available.

See also

Related Research Articles

<span class="mw-page-title-main">GIF</span> Bitmap image file format family

The Graphics Interchange Format is a bitmap image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released on 15 June 1987. It is in widespread usage on the World Wide Web due to its wide support and portability between applications and operating systems.

gzip GNU file compression/decompression tool

gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU. Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It was developed by Julian Seward, and maintained by Mark Wielaard and Micah Snyder.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and Mac OS X.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 22.01.

The archiver, also known simply as ar, is a Unix utility that maintains groups of files as a single archive file. Today, ar is generally used only to create and update static library files that the link editor or linker uses and for generating .deb packages for the Debian family; it can be used to create archives for any purpose, but has been largely replaced by tar for purposes other than static libraries. An implementation of ar is included as one of the GNU Binutils.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

pax is an archiving utility available for various operating systems and defined since 1995. Rather than sort out the incompatible options that have crept up between tar and cpio, along with their implementations across various versions of Unix, the IEEE designed new archive utility pax that could support various archive formats with useful options from both archivers. The pax command is available on Unix and Unix-like operating systems and on IBM i, and Microsoft Windows NT until Windows 2000.

cksum Unix command

cksum is a command in Unix and Unix-like operating systems that generates a checksum value for a file or stream of data. The cksum command reads each file given in its arguments, or standard input if no arguments are provided, and outputs the file's 32-bit cyclic redundancy check (CRC) checksum and byte count. The CRC output by cksum is different from the CRC-32 used in zip, PNG and zlib.

file (command) Standard Unix program

The file command is a standard program of Unix and Unix-like operating systems for recognizing the type of data contained in a computer file.

cpio is a general file archiver utility and its associated file format. It is primarily installed on Unix-like computer operating systems. The software utility was originally intended as a tape archiving program as part of the Programmer's Workbench (PWB/UNIX), and has been a component of virtually every Unix operating system released thereafter. Its name is derived from the phrase copy in and out, in close description of the program's use of standard input and standard output in its operation.

sum is a legacy utility available on some Unix and Unix-like operating systems. This utility outputs a 16-bit checksum of each argument file, as well as the number of blocks they take on disk. Two different checksum algorithms are in use. POSIX abandoned sum in favor of cksum.

XZ Utils is a set of free software command-line lossless data compressors, including the programs lzma and xz, for Unix-like operating systems and, from version 5.0 onwards, Microsoft Windows. For compression/decompression the Lempel–Ziv–Markov chain algorithm (LZMA) is used. XZ Utils started as a Unix port of Igor Pavlov's LZMA-SDK that has been adapted to fit seamlessly into Unix environments and their usual structure and behavior.

lzip Data compression utility

lzip is a free, command-line tool for the compression of data; it employs the Lempel–Ziv–Markov chain algorithm (LZMA) with a user interface that is familiar to users of usual Unix compression tools, such as gzip and bzip2.

Pack is a legacy Unix shell compression program based on Huffman coding.

References

  1. Frysinger, Mike. "ncompress: a public domain project" . Retrieved 2014-07-30. Compress is a fast, simple LZW file compressor. Compress does not have the highest compression rate, but it is one of the fastest programs to compress data. Compress is the de facto standard in the UNIX community for compressing files.
  2. Gommans, Luc. "compression - What's the difference between gzip and compress?". Unix & Linux Stack Exchange.
  3. "Large Text Compression Benchmark". mattmahoney.net. compress 4.3d....
  4. "0001041: Encourage implementations to include better integrity checksum, compression and decompression utilities if possible". Austin Group Bug Tracker. Retrieved 2017-11-23.
  5. Welch, Terry A. (1984). "A technique for high performance data compression" (PDF). IEEE Computer. 17 (6): 8–19. doi:10.1109/MC.1984.1659158. S2CID   2055321.
  6. IBM. "IBM System i Version 7.2 Programming Qshell" (PDF). IBM . Retrieved 2020-09-05.
  7. X/Open CAE Specification Commands and Utilities Issue 4, Version 2 (pdf), 1994, opengroup.org
  8. compress   Shell and Utilities Reference, The Single UNIX Specification , Version 3 from The Open Group
  9. Chapter 17. Commands and Utilities in Linux Standard Base Core Specification 5.0.0, linuxfoundation.org
  10. ncompress, pkgs.org