Snappy (compression)

Last updated
Original author(s) Jeff Dean, Sanjay Ghemawat, Steinar H. Gunderson
Developer(s) Google
Initial releaseMarch 18, 2011 (2011-03-18)
Stable release
1.1.10 / March 9, 2023;12 months ago (2023-03-09) [1]
Repository
Written in C++
Operating system Cross-platform
Platform Portable
Size 1.1 MiB
Type data compression
License Apache 2 (up to 1.0.1)/New BSD
Website google.github.io/snappy/
Snappy framed [2]
Filename extension
.sz
Internet media type
application/x-snappy-framed
Magic number ff 06 00 00 73 4e 61 50 70 59 (FF 06 00 00 "sNaPpY")
Type of format Data compression
Open format?yes
Free format?yes
Website github.com/google/snappy/blob/main/framing_format.txt

Snappy (previously known as Zippy) is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. [3] [4] It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than gzip. [5]

Contents

Snappy is widely used in Google projects like Bigtable, MapReduce and in compressing data for Google's internal RPC systems. It can be used in open-source projects like MariaDB ColumnStore, [6] Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, RocksDB, Lucene, Spark, InfluxDB, [7] and Ceph. [8] Firefox uses Snappy to compress data in localStorage. [9] Decompression is tested to detect any errors in the compressed stream. Snappy does not use inline assembler (except some optimizations [10] ) and is portable.

Stream format

Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). The format uses no entropy encoder, like Huffman coding or arithmetic coding.

The first bytes of the stream are the length of uncompressed data, stored as a little-endian varint, [11] :section 1 which allows for use of a variable-length code. The lower seven bits of each byte are used for data and the high bit is a flag to indicate the end of the length field.

The remaining bytes in the stream are encoded using one of four element types. The element type is encoded in the lower two bits of the first byte (tag byte) of the element: [12]

The copy refers to the dictionary (just-decompressed data). The offset is the shift from the current position back to the already decompressed stream. The length is the number of bytes to copy from the dictionary. The size of the dictionary was limited by the 1.0 Snappy compressor to 32,768 bytes, and updated to 65,536 in version 1.1.

The complete official description of the snappy format can be found in the google GitHub repository. [11]

Example of a compressed stream

The text

Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project.

may be compressed to this, shown as hex data with explanations:

000000 51f0 42 57 69 6b 69 70 65 64 69 61 20 69 73 20  >Q.BWikipedia is < 000010 61 20 66 72 65 65 2c 20 77 65 62 2d 62 61 73 65  >a free, web-base< 000020 64 2c 20 63 6f 6c 6c 61 62 6f 72 61 74 69 76 65  >d, collaborative< 000030 2c 20 6d 75 6c 74 69 6c 69 6e 67 75 61 6c 20 65  >, multilingual e<

The stream starts with the length of the uncompressed data as a varint [11] :section 1 – thus the first byte, with the high bit clear, corresponds to a length of 5116=81 bytes.

The first block must be a literal, and f042 corresponds thereto: the first byte is broken down as f016 ⇒ len−1=1111002;type=002; type 0 signifies a literal, and a length−1 of 1111002=60 means the length is read from the following byte, in this case 4216=66. The first 66 bytes of the text ("Wikipedia is a free, web-based, collaborative, multilingual encyclo") follow. [11] :2.1

000040 6e 63 79 63 6c 6f 09 3f1c 70 72 6f 6a 65 63 74  >ncyclo.?.project< 000050 2e                                               >.<

The next block's header consists of 093f, broken down as 0916 ⇒ offh=0002,len−4=0102;type=012: type 1 indicates a "copy with 1-byte offset": the length to copy works out to 0102+4=6 bytes, and the offset is an 11-bit integer whose top bits are offh and whose low bits are the next byte: 3f, so {offh}{3f16}=000001111112=63. [11] :2.2,2.2.1

This means to copy 6 bytes, starting 63 bytes ago – since 67 bytes have already been copied this evaluates to copying 6 bytes starting at position 4 (from the fifth byte), which produces "pedia ".

This block has no other content, and thus the following block starts immediately after – 1c16 ⇒ len−1=0001112;type=002, i.e. a literal of length 0001112+1=8. [11] :2.1 The final part of the text ("project.") follows.

In this example, all common substrings with four or more characters were eliminated by the compression process. More common compressors can compress this better. Unlike compression methods such as gzip and bzip2, there is no entropy encoding used to pack alphabet into the bit stream.

Framing format

The Snappy stream supports inputs with an overall size of up to 4GiB−1, [11] :section 1 and may add significant overhead to sections which are not or insufficiently compressed, as well as not being self-identifying, and having no data integrity mechanism beyond a simple output size check.

To combat these issues, the Snappy framing format [2] "Snappy framed" may be used, which breaks the input into chunks of up to 64KiB, [2] :4.2,4.3 delimited by 4-byte block headers (a one-byte identifier and three-byte length): [2] :section 1

Both types of data chunk also contain a CRC-32C checksum of the uncompressed data.

Chunks of types 2-7F16 are reserved and must result in errors. [2] :4.5 Those of types 8016-FE16 may be ignored by the decompressors which do not understand them. [2] :4.4,4.6

Interfaces

Snappy distributions include C++ and C bindings. Third party-provided bindings and ports include [13] C#, Common Lisp, Crystal (programming language), Erlang, Go, Haskell, Lua, Java, Nim, Node.js, Perl, PHP, Python, R, Ruby, Rust, Smalltalk, and OpenCL. [14] [15] A command-line interface program is also available. [16]

See also

Related Research Articles

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

zlib DEFLATE codec library

zlib is a software library used for data compression as well as a data format. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. zlib is also a crucial component of many software platforms, including Linux, macOS, and iOS. It has also been used in gaming consoles such as the PlayStation 4, PlayStation 3, Wii U, Wii, Xbox One and Xbox 360.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and the DEFLATE algorithm used in PNG and ZIP.

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.

compress is a Unix shell compression program based on the LZW compression algorithm. Compared to gzip's fastest setting, compress is slightly slower at compression, slightly faster at decompression, and has a significantly lower compression ratio. 1.8 MiB of memory is used to compress the Hutter Prize data, slightly more than gzip's slowest setting.

Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta compression, particularly where archival histories of changes are required.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

Executable compression is any means of compressing an executable file and combining the compressed data with decompression code into a single executable. When this compressed executable is executed, the decompression code recreates the original code from the compressed code before executing it. In most cases this happens transparently so the compressed executable can be used in exactly the same way as the original. Executable compressors are often referred to as executable packers, runtime packers, software packers, software protectors, or even "polymorphic packers" and "obfuscating tools".

Lempel–Ziv–Storer–Szymanski (LZSS) is a lossless data compression algorithm, a derivative of LZ77, that was created in 1982 by James A. Storer and Thomas Szymanski. LZSS was described in article "Data compression via textual substitution" published in Journal of the ACM.

Robust Header Compression (ROHC) is a standardized method to compress the IP, UDP, UDP-Lite, RTP, and TCP headers of Internet packets.

The Apple Icon Image format (.icns) is an icon format used in Apple Inc.'s macOS. It supports icons of 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 points at 1x and 2x scale, with both 1- and 8-bit alpha channels and multiple image states. The fixed-size icons can be scaled by the operating system and displayed at any intermediate size.

<span class="mw-page-title-main">HTTP compression</span> Capability that can be built into web servers and web clients

HTTP compression is a capability that can be built into web servers and web clients to improve transfer speed and bandwidth utilization.

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

Lempel–Ziv–Stac is a lossless data compression algorithm that uses a combination of the LZ77 sliding-window compression algorithm and fixed Huffman coding. It was originally developed by Stac Electronics for tape compression, and subsequently adapted for hard disk compression and sold as the Stacker disk compression software. It was later specified as a compression algorithm for various network protocols. LZS is specified in the Cisco IOS stack.

liblzg is a compression library for performing lossless data compression. It implements an algorithm that is a variation of the LZ77 algorithm, called the LZG algorithm, with the primary focus of providing a very simple and fast decoding method. One of the key features of the algorithm is that it requires no memory during decompression. The software library is free software, distributed under the zlib license.

LZ4 is a lossless data compression algorithm that is focused on compression and decompression speed. It belongs to the LZ77 family of byte-oriented compression schemes.

References

  1. "Releases - google/snappy" . Retrieved 4 October 2023 via GitHub.
  2. 1 2 3 4 5 6 7 8 9 "Snappy framing format description". GitHub . 26 October 2021.
  3. "Google Snappy–A Fast Compressing Library". InfoQ. Retrieved August 1, 2011.
  4. Google open sources MapReduce compression. In the name of speed // The Register, 2011-03-24
  5. "Snappy: A fast compressor/decompressor: Readme". Google Code . Archived from the original on September 8, 2015. Retrieved August 1, 2011. "Snappy vs lzo vs zlib".
  6. "ColumnStore Storage Architecture". MariaDB KnowledgeBase.
  7. snappy. A fast compressor/decompressor - Project page at Google Code
  8. "Compression — Ceph Documentation" . Retrieved 2024-01-03.
  9. "SnappyUtils.cpp - mozsearch" . Retrieved 2024-01-03.
  10. "Add a loop alignment directive to work around a performance regression. · google/snappy@824e671". GitHub.
  11. 1 2 3 4 5 6 7 "Snappy compressed format description". GitHub . 26 October 2021.
  12. "GitHub - google/snappy: A fast compressor/decompressor". November 11, 2019 via GitHub.
  13. "snappy". snappy.
  14. "Xilinx". Xilinx.
  15. "InAccel". InAccel.
  16. "snappy-tools: snappy(1): Snappy compression and decompression with and without framing". sourcehut. Retrieved 2024-02-15.