Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta compression, particularly where archival histories of changes are required (e.g., in revision control software).
The differences are recorded in discrete files called "deltas" or "diffs". In situations where differences are small – for example, the change of a few words in a large document or the change of a few records in a large table – delta encoding greatly reduces data redundancy. Collections of unique deltas are substantially more space-efficient than their non-encoded equivalents.
From a logical point of view, the difference between two data values is the information required to obtain one value from the other – see relative entropy. The difference between identical values (under some equivalence) is often called 0 or the neutral element.
Perhaps the simplest example is storing values of bytes as differences (deltas) between sequential values, rather than the values themselves. So, instead of 2, 4, 6, 9, 7, we would store 2, 2, 2, 3, −2. This reduces the variance (range) of the values when neighbor samples are correlated, enabling a lower bit usage for the same data. IFF 8SVX sound format applies this encoding to raw sound data before applying compression to it. Not even all 8-bit sound samples compress better when delta encoded, and the usability of delta encoding is even smaller for 16-bit and better samples. Therefore, compression algorithms often choose to delta encode only when the compression is better than without. However, in video compression, delta frames can considerably reduce frame size and are used in virtually every video compression codec.
A delta can be defined in 2 ways, symmetric delta and directed delta. A symmetric delta can be expressed as
where and represent two versions.
A directed delta, also called a change, is a sequence of (elementary) change operations which, when applied to one version , yields another version (note the correspondence to transaction logs in databases). In computer implementations, they typically take the form of a language with two commands: copy data from v1 and write literal data.
A variation of delta encoding which encodes differences between the prefixes or suffixes of strings is called incremental encoding. It is particularly effective for sorted lists with small differences between strings, such as a list of words from a dictionary.
The nature of the data to be encoded influences the effectiveness of a particular compression algorithm.
Delta encoding performs best when data has small or constant variation; for an unsorted data set, there may be little to no compression possible with this method.
In delta encoded transmission over a network where only a single copy of the file is available at each end of the communication channel, special error control codes are used to detect which parts of the file have changed since its previous version. For example, rsync uses a rolling checksum algorithm based on Mark Adler's adler-32 checksum.
The following C code performs a simple form of delta encoding and decoding on a sequence of characters:
voiddelta_encode(unsignedchar*buffer,intlength){unsignedcharlast=0;for(inti=0;i<length;i++){unsignedcharcurrent=buffer[i];buffer[i]=current-last;last=current;}}voiddelta_decode(unsignedchar*buffer,intlength){unsignedcharlast=0;for(inti=0;i<length;i++){unsignedchardelta=buffer[i];buffer[i]=delta+last;last=buffer[i];}}
Another instance of use of delta encoding is RFC 3229, "Delta encoding in HTTP", which proposes that HTTP servers should be able to send updated Web pages in the form of differences between versions (deltas), which should decrease Internet traffic, as most pages change slowly over time, rather than being completely rewritten repeatedly:
This document describes how delta encoding can be supported as a compatible extension to HTTP/1.1.
Many HTTP (Hypertext Transport Protocol) requests cause the retrieval of slightly modified instances of resources for which the client already has a cache entry. Research has shown that such modifying updates are frequent, and that the modifications are typically much smaller than the actual entity. In such cases, HTTP would make more efficient use of network bandwidth if it could transfer a minimal description of the changes, rather than the entire new instance of the resource.
[...] We believe that it might be possible to support rsync using the "instance manipulation" framework described later in this document, but this has not been worked out in any detail.
The suggested rsync-based framework was implemented in the rproxy system as a pair of HTTP proxies. [1] Like the basic vcdiff-based implementation, both systems are rarely used.
Delta copying is a fast way of copying a file that is partially changed, when a previous version is present on the destination location. With delta copying, only the changed part of a file is copied. It is usually used in backup or file copying software, often to save bandwidth when copying between computers over a private network or the internet. One notable open-source example is rsync. [2] [3] [4]
Many of the online backup services adopt this methodology, often known simply as deltas, in order to give their users previous versions of the same file from previous backups. This reduces associated costs, not only in the amount of data that has to be stored as differing versions (as the whole of each changed version of a file has to be offered for users to access), but also those costs in the uploading (and sometimes the downloading) of each file that has been updated (by just the smaller delta having to be used, rather than the whole file).
For large software packages, there is usually little data changed between versions. Many vendors choose to use delta transfers to save time and bandwidth.
Diff is a file comparison program, which is mainly used for text files. By default, it generates symmetric deltas that are reversible. Two formats used for software patches, context and unified, provides additional context lines that allow for tolerating shifts in line number.
The Git source code control system employs delta compression in an auxiliary "git repack" operation. Objects in the repository that have not yet been delta-compressed ("loose objects") are compared against a heuristically chosen subset of all other objects, and the common data and differences are concatenated into a "pack file" which is then compressed using conventional methods. In common use cases, where source or data files are changed incrementally between commits, this can result in significant space savings. The repack operation is typically performed as part of the " git gc" [5] process, which is triggered automatically when the numbers of loose objects or pack files exceed configured thresholds.
The format is documented in the pack-format page of the Git documentation. It implements a directed delta. [6]
One general format for directed delta encoding is VCDIFF, described in RFC 3284. Free software implementations include Xdelta and open-vcdiff.
Generic Diff Format (GDIFF) is another directed delta encoding format. It was submitted to W3C in 1997. [7] In many cases, VCDIFF has better compression rate than GDIFF.
Bsdiff is a binary diff program using suffix sorting. For executables that contain many changes in pointer addresses, it performs better than VCDIFF-type "copy and literal" encodings. The intent is to find a way to generate a small diff without needing to parse assembly code (as in Google's Courgette). Bsdiff achieves this by allowing "copy" matches with errors, which are then corrected using an extra "add" array of bytewise differences. Since this array is mostly either zero or repeated values for offset changes, it takes up little space after compression. [8]
Bsdiff is useful for delta updates. Google uses bsdiff in Chromium and Android. The deltarpm feature of the RPM Package Manager is based on a heavily modified bsdiff that can use a hash table for matching. [9] FreeBSD also uses bsdiff for updates. [10]
Since the 4.3 release of bsdiff in 2005, various improvements or fixes have been produced for it. Google maintains multiple versions of the code for each of its products. [11] FreeBSD takes many of Google's compatible changes, mainly a vulnerability fix and a switch to the faster divsufsort
suffix-sorting routine. [12] Debian has a series of performance tweaks to the program. [13]
ddelta is a rewrite of bsdiff proposed for use in Debian's delta updates. Among other efficiency improvements, it uses a sliding window to reduce memory and CPU cost. [14]
JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality. Since its introduction in 1992, JPEG has been the most widely used image compression standard in the world, and the most widely used digital image format, with several billion JPEG images produced every day as of 2015.
Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".
zlib is a software library used for data compression as well as a data format. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. zlib is also a crucial component of many software platforms, including Linux, macOS, and iOS. It has also been used in gaming consoles such as the PlayStation 4, PlayStation 3, Wii U, Wii, Xbox One and Xbox 360.
bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.
In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other. The utility displays the changes in one of several standard formats, such that both humans or computers can parse the changes, and use them for patching.
rsync is a utility for transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operating systems and is under the GPL-3.0-or-later license.
OpenEXR is a high-dynamic range, multi-channel raster file format, released as an open standard along with a set of software tools created by Industrial Light & Magic (ILM), under a free software license similar to the BSD license.
The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.
ANIM is a file format, used to store digital movies and computer generated animations, and is a variation of the ILBM format, which is a subformat of Interchange File Format.
WavPack is a free and open-source lossless audio compression format and application implementing the format. It is unique in the way that it supports hybrid audio compression alongside normal compression which is similar to how FLAC works. It also supports compressing a wide variety of lossless formats, including various variants of PCM and also DSD as used in SACDs, together with its support for surround audio.
In computer graphics, the X Window System used X BitMap (XBM), a plain text binary image format, for storing cursor and icon bitmaps used in the X GUI. The XBM format is superseded by XPM, which first appeared for X11 in 1989.
Editing documents, program code, or any data always risks introducing errors. Displaying the differences between two or more sets of data, file comparison tools can make computing simpler, and more efficient by focusing on new data and ignoring what did not change. Generically known as a diff after the Unix diff
utility, there are a range of ways to compare data sources and display the results.
HTTP compression is a capability that can be built into web servers and web clients to improve transfer speed and bandwidth utilization.
A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to LEB128 except in endianness. See the example below.
Xdelta is a command line program for delta encoding, which generates the difference between two files. This is similar to diff and patch, but it is targeted for binary files and does not generate human readable output.
In computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data.
A delta update is a software update that requires the user to download only those parts of the software's code that are new, or have been changed from their previous state, in contrast to having to download the entire program. The use of delta updates can save significant amounts of time and computing bandwidth. The name "delta" derives from the mathematical science use of the Greek letter delta, Δ or δ to denote change.
JPEG XL is a royalty-free open standard for the compressed representation of raster graphics images. It defines a graphics file format and the abstract device for coding JPEG XL bitstreams. It is developed by the Joint Photographic Experts Group (JPEG) and standardized by the International Electrotechnical Commission (IEC) and the International Organization for Standardization (ISO) as the international standard ISO/IEC 18181. As a superset of JPEG/JFIF encoding, with a compression mode built on a traditional block-based transform coding core and a "modular mode" for synthetic image content and lossless compression. Optional lossy quantization enables both lossless and lossy compression.
In data compression, BCJ, short for branch/call/jump, refers to a technique that improves the compression of machine code by replacing relative branch addresses with absolute ones. This allows a Lempel–Ziv compressor to identify duplicate targets and more efficiently encode them. On decompression, the inverse filter restores the original encoding. Different BCJ filters are used for different instruction sets, as each use different opcodes for branching.
The Quite OK Image Format (QOI) is a specification for lossless image compression of 24-bit or 32-bit color raster (bitmapped) images, invented by Dominic Szablewski and first announced on 24 November 2021.