Data compression ratio

Last updated

Data compression ratio, also known as compression power, is a measurement of the relative reduction in size of data representation produced by a data compression algorithm. It is typically expressed as the division of uncompressed size by compressed size.

Contents

Definition

Data compression ratio is defined as the ratio between the uncompressed size and compressed size: [1] [2] [3] [4] [5]

Thus, a representation that compresses a file's storage size from 10 MB to 2 MB has a compression ratio of 10/2 = 5, often notated as an explicit ratio, 5:1 (read "five" to "one"), or as an implicit ratio, 5/1. This formulation applies equally for compression, where the uncompressed size is that of the original; and for decompression, where the uncompressed size is that of the reproduction.

Sometimes the space saving is given instead, which is defined as the reduction in size relative to the uncompressed size:

Thus, a representation that compresses the storage size of a file from 10 MB to 2 MB yields a space saving of 1 - 2/10 = 0.8, often notated as a percentage, 80%.

For signals of indefinite size, such as streaming audio and video, the compression ratio is defined in terms of uncompressed and compressed data rates instead of data sizes:

and instead of space saving, one speaks of data-rate saving, which is defined as the data-rate reduction relative to the uncompressed data rate:

For example, uncompressed songs in CD format have a data rate of 16 bits/channel x 2 channels x 44.1 kHz ≅ 1.4 Mbit/s, whereas AAC files on an iPod are typically compressed to 128 kbit/s, yielding a compression ratio of 10.9, for a data-rate saving of 0.91, or 91%.

When the uncompressed data rate is known, the compression ratio can be inferred from the compressed data rate.

Lossless vs. Lossy

Lossless compression of digitized data such as video, digitized film, and audio preserves all the information, but it does not generally achieve compression ratio much better than 2:1 because of the intrinsic entropy of the data. Compression algorithms which provide higher ratios either incur very large overheads or work only for specific data sequences (e.g. compressing a file with mostly zeros). In contrast, lossy compression (e.g. JPEG for images, or MP3 and Opus for audio) can achieve much higher compression ratios at the cost of a decrease in quality, such as Bluetooth audio streaming, as visual or audio compression artifacts from loss of important information are introduced. A compression ratio of at least 50:1 is needed to get 1080i video into a 20 Mbit/s MPEG transport stream. [1]

Uses

The data compression ratio can serve as a measure of the complexity of a data set or signal. In particular it is used to approximate the algorithmic complexity. It is also used to see how much of a file is able to be compressed without increasing its original size.

Related Research Articles

<span class="mw-page-title-main">Adiabatic process</span> Thermodynamic process in which no mass or heat is exchanged with surroundings

An adiabatic process is a type of thermodynamic process that occurs without transferring heat or mass between the thermodynamic system and its environment. Unlike an isothermal process, an adiabatic process transfers energy to the surroundings only as work. As a key concept in thermodynamics, the adiabatic process supports the theory that explains the first law of thermodynamics.

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

Signal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to noise power, often expressed in decibels. A ratio higher than 1:1 indicates more signal than noise.

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

In telecommunications and computing, bit rate is the number of bits that are conveyed or processed per unit of time.

<span class="mw-page-title-main">Quantization (signal processing)</span> Process of mapping a continuous set to a countable set

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. Quantization is involved to some degree in nearly all digital signal processing, as the process of representing a signal in digital form ordinarily involves rounding. Quantization also forms the core of essentially all lossy compression algorithms.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

<span class="mw-page-title-main">Compressor</span> Machine to increase pressure of gas by reducing its volume

A compressor is a mechanical device that increases the pressure of a gas by reducing its volume. An air compressor is a specific type of gas compressor.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

<span class="mw-page-title-main">Lawson criterion</span> Criterion for igniting a nuclear fusion chain reaction

The Lawson criterion is a figure of merit used in nuclear fusion research. It compares the rate of energy being generated by fusion reactions within the fusion fuel to the rate of energy losses to the environment. When the rate of production is higher than the rate of loss, the system will produce net energy. If enough of that energy is captured by the fuel, the system will become self-sustaining and is said to be ignited.

Generation loss is the loss of quality between subsequent copies or transcodes of data. Anything that reduces the quality of the representation when copying, and would cause further reduction in quality on making a copy of the copy, can be considered a form of generation loss. File size increases are a common result of generation loss, as the introduction of artifacts may actually increase the entropy of the data through each generation.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

In information theory, redundancy measures the fractional difference between the entropy H(X) of an ensemble X, and its maximum possible value . Informally, it is the amount of wasted "space" used to transmit certain data. Data compression is a way to reduce or eliminate unwanted redundancy, while forward error correction is a way of adding desired redundancy for purposes of error detection and correction when communicating over a noisy channel of limited capacity.

<span class="mw-page-title-main">Wavelet transform</span> Mathematical technique used in data compression and analysis

In mathematics, a wavelet series is a representation of a square-integrable function by a certain orthonormal series generated by a wavelet. This article provides a formal, mathematical definition of an orthonormal wavelet and of the integral wavelet transform.

<span class="mw-page-title-main">One-way compression function</span> Cryptographic primitive

In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is "one-way", meaning that it is difficult given a particular output to compute inputs which compress to that output. One-way compression functions are not related to conventional data compression algorithms, which instead can be inverted exactly or approximately to the original data.

In signal processing, multidimensional empirical mode decomposition is an extension of the one-dimensional (1-D) EMD algorithm to a signal encompassing multiple dimensions. The Hilbert–Huang empirical mode decomposition (EMD) process decomposes a signal into intrinsic mode functions combined with the Hilbert spectral analysis, known as the Hilbert–Huang transform (HHT). The multidimensional EMD extends the 1-D EMD algorithm into multiple-dimensional signals. This decomposition can be applied to image processing, audio signal processing, and various other multidimensional signals.

<span class="mw-page-title-main">Crack growth equation</span>

A crack growth equation is used for calculating the size of a fatigue crack growing from cyclic loads. The growth of a fatigue crack can result in catastrophic failure, particularly in the case of aircraft. When many growing fatigue cracks interact with one another it is known as widespread fatigue damage. A crack growth equation can be used to ensure safety, both in the design phase and during operation, by predicting the size of cracks. In critical structure, loads can be recorded and used to predict the size of cracks to ensure maintenance or retirement occurs prior to any of the cracks failing. Safety factors are used to reduce the predicted fatigue life to a service fatigue life because of the sensitivity of the fatigue life to the size and shape of crack initiating defects and the variability between assumed loading and actual loading experienced by a component.

References

  1. 1 2 "Pixel grids, bit rate and compression ratio". Broadcast Engineering. 2007-12-01. Archived from the original on 2013-10-10. Retrieved 2013-06-05.
  2. Charles Poynton (2012-02-07). "Digital Video and HD: Algorithms and Interfaces" (2nd ed.). Morgan Kaufmann Publishers. ISBN   9780123919267.
  3. "High Efficiency Video Coding (HEVC) text specification draft 10 (for FDIS & Consent)". JCT-VC. 2013-01-17. Retrieved 2013-06-05.
  4. "The H.264 Advanced Video Coding (AVC) Standard" (PDF). Logitech. Archived (PDF) from the original on 2013-02-19. Retrieved 2013-06-05.
  5. "White Paper on Performance Characteristics of MPEG-2 Long GoP vs AVC-I video compression techniques for Broadcast Applications" (PDF). Sony. Archived (PDF) from the original on 2009-12-29. Retrieved 2013-06-05.