Canonical Huffman code

Last updated April 19, 2023

In computer science and information theory, a canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

Motivation

Data compressors generally work in one of two ways. Either the decompressor can infer what codebook the compressor has used from previous context, or the compressor must tell the decompressor what the codebook is. Since a canonical Huffman codebook can be stored especially efficiently, most compressors start by generating a "normal" Huffman codebook, and then convert it to canonical Huffman before using it.

In order for a symbol code scheme such as the Huffman code to be decompressed, the same model that the encoding algorithm used to compress the source data must be provided to the decoding algorithm so that it can use it to decompress the encoded data. In standard Huffman coding this model takes the form of a tree of variable-length codes, with the most frequent symbols located at the top of the structure and being represented by the fewest bits.

However, this code tree introduces two critical inefficiencies into an implementation of the coding scheme. Firstly, each node of the tree must store either references to its child nodes or the symbol that it represents. This is expensive in memory usage and if there is a high proportion of unique symbols in the source data then the size of the code tree can account for a significant amount of the overall encoded data. Secondly, traversing the tree is computationally costly, since it requires the algorithm to jump randomly through the structure in memory as each bit in the encoded data is read in.

Canonical Huffman codes address these two issues by generating the codes in a clear standardized format; all the codes for a given length are assigned their values sequentially. This means that instead of storing the structure of the code tree for decompression only the lengths of the codes are required, reducing the size of the encoded data. Additionally, because the codes are sequential, the decoding algorithm can be dramatically simplified so that it is computationally efficient.

Algorithm

The normal Huffman coding algorithm assigns a variable length code to every symbol in the alphabet. More frequently used symbols will be assigned a shorter code. For example, suppose we have the following non-canonical codebook:

A = 11 B = 0 C = 101 D = 100

Here the letter A has been assigned 2 bits, B has 1 bit, and C and D both have 3 bits. To make the code a canonical Huffman code, the codes are renumbered. The bit lengths stay the same with the code book being sorted first by codeword length and secondly by alphabetical value of the letter:

B = 0 A = 11 C = 101 D = 100

Each of the existing codes are replaced with a new one of the same length, using the following algorithm:

The first symbol in the list gets assigned a codeword which is the same length as the symbol's original codeword but all zeros. This will often be a single zero ('0').
Each subsequent symbol is assigned the next binary number in sequence, ensuring that following codes are always higher in value.
When you reach a longer codeword, then after incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. This can be thought of as a left shift.

By following these three rules, the canonical version of the code book produced will be:

B = 0 A = 10 C = 110 D = 111

As a fractional binary number

Another perspective on the canonical codewords is that they are the digits past the radix point (binary decimal point) in a binary representation of a certain series. Specifically, suppose the lengths of the codewords are l₁ ... l_n. Then the canonical codeword for symbol i is the first l_i binary digits past the radix point in the binary representation of

$\sum _{j=1}^{i-1}2^{-l_{j}}.$

This perspective is particularly useful in light of Kraft's inequality, which says that the sum above will always be less than or equal to 1 (since the lengths come from a prefix free code). This shows that adding one in the algorithm above never overflows and creates a codeword that is longer than intended.

Encoding the codebook

The advantage of a canonical Huffman tree is that it can be encoded in fewer bits than an arbitrary tree.

Let us take our original Huffman codebook:

A = 11 B = 0 C = 101 D = 100

There are several ways we could encode this Huffman tree. For example, we could write each symbol followed by the number of bits and code:

('A',2,11), ('B',1,0), ('C',3,101), ('D',3,100)

Since we are listing the symbols in sequential alphabetical order, we can omit the symbols themselves, listing just the number of bits and code:

(2,11), (1,0), (3,101), (3,100)

With our canonical version we have the knowledge that the symbols are in sequential alphabetical order and that a later code will always be higher in value than an earlier one. The only parts left to transmit are the bit-lengths (number of bits) for each symbol. Note that our canonical Huffman tree always has higher values for longer bit lengths and that any symbols of the same bit length (C and D) have higher code values for higher symbols:

A = 10    (code value: 2 decimal, bits: 2) B = 0     (code value: 0 decimal, bits: 1) C = 110   (code value: 6 decimal, bits: 3) D = 111   (code value: 7 decimal, bits: 3)

Since two-thirds of the constraints are known, only the number of bits for each symbol need be transmitted:

2, 1, 3, 3

With knowledge of the canonical Huffman algorithm, it is then possible to recreate the entire table (symbol and code values) from just the bit-lengths. Unused symbols are normally transmitted as having zero bit length.

Another efficient way representing the codebook is to list all symbols in increasing order by their bit-lengths, and record the number of symbols for each bit-length. For the example mentioned above, the encoding becomes:

(1,1,2), ('B','A','C','D')

This means that the first symbol B is of length 1, then the A of length 2, and remains of 3. Since the symbols are sorted by bit-length, we can efficiently reconstruct the codebook. A pseudo code describing the reconstruction is introduced on the next section.

This type of encoding is advantageous when only a few symbols in the alphabet are being compressed. For example, suppose the codebook contains only 4 letters C, O, D and E, each of length 2. To represent the letter O using the previous method, we need to either add a lot of zeros:

0, 0, 2, 2, 2, 0, ... , 2, ...

or record which 4 letters we have used. Each way makes the description longer than:

(0,4), ('C','O','D','E')

The JPEG File Interchange Format uses this method of encoding, because at most only 162 symbols out of the 8-bit alphabet, which has size 256, will be in the codebook.

Pseudocode

Given a list of symbols sorted by bit-length, the following pseudocode will print a canonical Huffman code book:

code := 0 while more symbols do     print symbol, codecode := (code + 1) << ((bit length of the next symbol) − (current bit length))

algorithm compute huffman code isinput:  message ensemble (set of (message, probability)).                   base D.     output: code ensemble (set of (message, code)).       1- sort the message ensemble by decreasing probability.     2- N is the cardinal of the message ensemble (number of different        messages).     3- compute the integer  $n_{0}$  such as  $2\leq n_{0}\leq D$  and  $(N-n_{0})/(D-1)$  is integer.     4- select the  $n_{0}$  least probable messages, and assign them each a        digit code.     5- substitute the selected messages by a composite message summing        their probability, and re-order it.     6- while there remains more than one message, do steps thru 8.     7-    select D least probable messages, and assign them each a           digit code.     8-    substitute the selected messages by a composite message           summing their probability, and re-order it.     9- the code of each message is given by the concatenation of the        code digits of the aggregate they've been put in.

^[1]^[2]

Related Research Articles

In mathematics and computing, the hexadecimal numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexadecimal uses 16 distinct symbols, most often the symbols "0"–"9" to represent values 0 to 9, and "A"–"F" to represent values from 10 to 15.

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

Range coding is an entropy coding method defined by G. Nigel N. Martin in a 1979 paper, which effectively rediscovered the FIFO arithmetic code first introduced by Richard Clark Pasco in 1976. Given a stream of symbols and their probabilities, a range coder produces a space-efficient stream of bits to represent these symbols and, given the stream and the probabilities, a range decoder reverses the process.

In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains no other instances of "11" before the end.

The reflected binary code (RBC), also known as reflected binary (RB) or Gray code after Frank Gray, is an ordering of the binary numeral system such that two successive values differ in only one bit.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a name given to two different but related techniques for constructing a prefix code based on a set of symbols and their probabilities.

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

A prefix code is a type of code system distinguished by its possession of the "prefix property", which requires that there is no whole code word in the system that is a prefix of any other code word in the system. It is trivially true for fixed-length code, so only a point of consideration in variable-length code.

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

In computing, Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. It was designed by Phil Katz, for version 2 of his PKZIP archiving tool. Deflate was later specified in RFC 1951 (1996).

Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number n. It is a slightly more general form of binary encoding when n is not a power of two.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

<span class="mw-page-title-main">Aztec Code</span> Type of matrix barcode

The Aztec Code is a matrix code invented by Andrew Longacre, Jr. and Robert Hussey in 1995. The code was published by AIM, Inc. in 1997. Although the Aztec Code was patented, that patent was officially made public domain. The Aztec Code is also published as ISO/IEC 24778:2008 standard. Named after the resemblance of the central finder pattern to an Aztec pyramid, Aztec Code has the potential to use less space than other matrix barcodes because it does not require a surrounding blank "quiet zone".

In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic (i.e., p(i) ≥ p(i + 1) for all positive i), the expected lengths of the codewords are within a constant factor of the expected lengths that the optimal code for that probability distribution would have assigned. A universal code is asymptotically optimal if the ratio between actual and optimal expected lengths is bounded by a function of the information entropy of the code that, in addition to being bounded, approaches 1 as entropy approaches infinity.

In coding theory a variable-length code is a code which maps source symbols to a variable number of bits.

The package-merge algorithm is an O(nL)-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code word is longer than L. It is a greedy algorithm, and a generalization of Huffman's original algorithm. Package-merge works by reducing the code construction problem to the binary coin collector's problem.

In computer science, compressed pattern matching is the process of searching for patterns in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less space.

A locally decodable code (LDC) is an error-correcting code that allows a single bit of the original message to be decoded with high probability by only examining a small number of bits of a possibly corrupted codeword. This property could be useful, say, in a context where information is being transmitted over a noisy channel, and only a small subset of the data is required at a particular time and there is no need to decode the entire message at once. Note that locally decodable codes are not a subset of locally testable codes, though there is some overlap between the two.

Asymmetric numeral systems (ANS) is a family of entropy encoding methods introduced by Jarosław (Jarek) Duda from Jagiellonian University, used in data compression since 2014 due to improved performance compared to previous methods. ANS combines the compression ratio of arithmetic coding, with a processing cost similar to that of Huffman coding. In the tabled ANS (tANS) variant, this is achieved by constructing a finite-state machine to operate on a large alphabet without using multiplication.

References

↑ This algorithm described in: "A Method for the Construction of Minimum-Redundancy Codes" David A. Huffman, Proceedings of the I.R.E.
↑ Managing Gigabytes: book with an implementation of canonical huffman codes for word dictionaries.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] This algorithm described in: "A Method for the Construction of Minimum-Redundancy Codes" David A. Huffman, Proceedings of the I.R.E.

[2] Managing Gigabytes: book with an implementation of canonical huffman codes for word dictionaries.

[1]

[2]