Weissman score

Last updated

The Weissman score is a performance metric for lossless compression applications. It was developed by Tsachy Weissman, a professor at Stanford University, and Vinith Misra, a graduate student, at the request of producers for HBO's television series Silicon Valley , a television show about a fictional tech start-up working on a data compression algorithm. [1] [2] [3] [4] It compares both required time and compression ratio of measured applications, with those of a de facto standard according to the data type.

Contents

The formula is the following; where r is the compression ratio, T is the time required to compress, the overlined ones are the same metrics for a standard compressor, and alpha is a scaling constant. [1]

Weissman score has been used by Daniel Reiter Horn and Mehant Baid of Dropbox to explain real-world work on lossless compression. According to the authors it "favors compression speed over ratio in most cases." [5]

Example

This example shows the score for the data of the Hutter Prize, [6] using the paq8f as a standard and 1 as the scaling constant.

ApplicationCompression ratioCompression time [min]Weissman score
paq8f5.4676003001.000000
raq8g5.5149904200.720477
paq8hkcc5.6825933001.039321
paq8hp15.6925663001.041145
paq8hp25.7502793001.051701
paq8hp35.8000333001.060801
paq8hp45.8688293001.073826
paq8hp55.9177193001.082325
paq8hp65.9766433001.093102
paq8hp126.1042765400.620247
decomp86.2615745400.63623
decomp86.2762955400.637726

Limitations

Although the value is relative to the standards against which it is compared, the unit used to measure the times changes the score (see examples 1 and 2). This is a consequence of the requirement that the argument of the logarithmic function must be dimensionless. The multiplier also can't have a numeric value of 1 or less, because the logarithm of 1 is 0 (examples 3 and 4), and the logarithm of any value less than 1 is negative (examples 5 and 6); that would result in scores of value 0 (even with changes), undefined, or negative (even if better than positive).

Examples

#Standard compressorScored compressorWeissman scoreObservations
Compression ratioCompression timeLog (compression time)Compression ratioCompression timeLog (compression time)
12.12 min0.301033.43 min0.4771211×(3.4/2.1)×(0.30103/0.477121)=1.021506Change in unit or scale, changes the result.
22.1120 sec2.0791813.4180 sec2.2552731×(3.4/2.1)×(2.079181/2.255273)=1.492632
32.21 min03.31.5 min0.1760911×(3.3/2.2)×(0/0.176091)=0If time is 1, its log is 0; then the score can be 0 or infinity.
42.20.667 min−0.1760913.31 min01×(3.3/2.2)×(−0.176091/0)=infinity
51.60.5 h−0.301032.91.1 h0.0413931×(2.9/1.6)×(−0.30103/0.041393)=−13.18138If time is less than 1, its log is negative; then the score can be negative.
61.61.1 h0.0413931.60.9 h−0.0457571×(1.6/1.6)×(0.041393/−0.045757)=−0.904627

See also

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

<span class="mw-page-title-main">Logarithm</span> Inverse of the exponential function

In mathematics, the logarithm is the inverse function to exponentiation. That means that the logarithm of a number x to the base b is the exponent to which b must be raised to produce x. For example, since 1000 = 103, the logarithm base 10 of 1000 is 3, or log10 (1000) = 3. The logarithm of x to base b is denoted as logb (x), or without parentheses, logbx, or even without the explicit base, log x, when no confusion is possible, or when the base does not matter such as in big O notation.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

<span class="mw-page-title-main">PNG</span> Family of lossless compression file formats for image files

Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".

PCX, standing for PiCture eXchange, was an image file format developed by the now-defunct ZSoft Corporation of Marietta, Georgia, United States. It was the native file format for PC Paintbrush and became one of the first widely accepted DOS imaging standards, although it has since been succeeded by more sophisticated image formats, such as BMP, JPEG, and PNG. PCX files commonly stored palette-indexed images ranging from 2 or 4 colors to 16 and 256 colors, although the format has been extended to record true-color (24-bit) images as well.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

<span class="mw-page-title-main">Logit</span> Function in statistics

In statistics, the logit function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations.

<span class="mw-page-title-main">Binary logarithm</span> Exponent of a power of two

In mathematics, the binary logarithm is the power to which the number 2 must be raised to obtain the value n. That is, for any real number x,

WavPack is a free and open-source lossless audio compression format and application implementing the format. It is unique in the way that it supports hybrid audio compression alongside normal compression which is similar to how FLAC works. It also supports compressing a wide variety of lossless formats, including various variants of PCM and also DSD as used in SACDs, together with its support for surround audio.

In information theory, redundancy measures the fractional difference between the entropy H(X) of an ensemble X, and its maximum possible value . Informally, it is the amount of wasted "space" used to transmit certain data. Data compression is a way to reduce or eliminate unwanted redundancy, while forward error correction is a way of adding desired redundancy for purposes of error detection and correction when communicating over a noisy channel of limited capacity.

An image file format is a file format for a digital image. There are many formats that can be used, such as JPEG, PNG, and GIF. Most formats up until 2022 were for storing 2D images, not 3D ones. The data stored in an image file format may be compressed or uncompressed. If the data is compressed, it may be done so using lossy compression or lossless compression. For graphic design applications, vector formats are often used. Some image file formats support transparency.

Lossless JPEG is a 1993 addition to JPEG standard by the Joint Photographic Experts Group to enable lossless compression. However, the term may also be used to refer to all lossless compression schemes developed by the group, including JPEG 2000 and JPEG-LS.

<span class="mw-page-title-main">Phred quality score</span> Measurement in DNA sequencing

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

<span class="mw-page-title-main">Progressive Graphics File</span> File format

PGF is a wavelet-based bitmapped image format that employs lossless and lossy data compression. PGF was created to improve upon and replace the JPEG format. It was developed at the same time as JPEG 2000 but with a focus on speed over compression ratio.

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

<i>Silicon Valley</i> (TV series) 2014–2019 American television series

Silicon Valley is an American comedy television series created by Mike Judge, John Altschuler and Dave Krinsky. It premiered on HBO on April 6, 2014, and concluded on December 8, 2019, running for six seasons for a total of 53 episodes. Parodying the culture of the technology industry in Silicon Valley, the series focuses on Richard Hendricks, a programmer who founds a startup company called Pied Piper, and chronicles his struggles to maintain his company while facing competition from larger entities. Co-stars include T.J. Miller, Josh Brener, Martin Starr, Kumail Nanjiani, Zach Woods, Amanda Crew, Matt Ross, and Jimmy O. Yang.

<span class="mw-page-title-main">Audio coding format</span> Digitally coded format for audio signals

An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

Tsachy (Itschak) Weissman is a professor of Electrical Engineering at Stanford University. He is the founding director of the Stanford Compression Forum. His research interests include information theory, statistical signal processing, their applications, with recent emphasis on biological applications, in genomics in particular, lossless compression, lossy compression, delay-constrained and complexity-constrained compression and communication, network information theory, feedback communications, directed information, the interplay between estimation theory and information theory, entropy, noise reduction (denoising), filtering, prediction, sequential decision making, learning, and connections with probability, statistics, and computer science.

References

  1. 1 2 Perry, Tekla (July 28, 2014). "A Fictional Compression Metric Moves Into the Real World" . Retrieved January 25, 2016.
  2. Perry, Tekla (July 25, 2014). "A Made-For-TV Compression Algorithm" . Retrieved January 25, 2016.
  3. Sandberg, Elise (April 12, 2014). "HBO's 'Silicon Valley' Tech Advisor on Realism, Possible Elon Musk Cameo". The Hollywood Reporter. Retrieved June 10, 2014.
  4. Jurgensen, John; Rusli, Evelyn M. (April 3, 2014). "There's a New Geek in Town: HBO's 'Silicon Valley'". The Wall Street Journal. Retrieved June 10, 2014.
  5. "Lossless compression with Brotli in Rust for a bit of Pied Piper on the backend". Dropbox Tech Blog. Retrieved 2017-06-24.
  6. Hutter, Marcus (July 2016). "Contestants" . Retrieved January 25, 2016.