Adaptive Huffman coding

Last updated

Adaptive Huffman coding (also called Dynamic Huffman coding) is an adaptive coding technique based on Huffman coding. It permits building the code as the symbols are being transmitted, having no initial knowledge of source distribution, that allows one-pass encoding and adaptation to changing conditions in data. [1]

Contents

The benefit of one-pass procedure is that the source can be encoded in real time, though it becomes more sensitive to transmission errors, since just a single loss ruins the whole code, requiring error detection and correction.

Algorithms

There are a number of implementations of this method, the most notable are FGK (Faller-Gallager-Knuth) and Vitter algorithm.

FGK Algorithm

It is an online coding technique based on Huffman coding. Having no initial knowledge of occurrence frequencies, it permits dynamically adjusting the Huffman's tree as data are being transmitted. In a FGK Huffman tree, a special external node, called 0-node, is used to identify a newly coming character. That is, whenever new data is encountered, output the path to the 0-node followed by the data. For a past-coming character, just output the path of the data in the current Huffman's tree. Most importantly, we have to adjust the FGK Huffman tree if necessary, and finally update the frequency of related nodes. As the frequency of a datum is increased, the sibling property of the Huffman's tree may be broken. The adjustment is triggered for this reason. It is accomplished by consecutive swappings of nodes, subtrees, or both. The data node is swapped with the highest-ordered node of the same frequency in the Huffman's tree, (or the subtree rooted at the highest-ordered node). All ancestor nodes of the node should also be processed in the same manner.

Since the FGK Algorithm has some drawbacks about the node-or-subtree swapping, Vitter proposed another algorithm to improve it.

Vitter algorithm

Some important terminologies & constraints :-

Blocks are interlinked by increasing order of their weights.

A leaf block always precedes internal block of same weight, thus maintaining the invariant.

NYT (Not Yet Transferred) is a special node used to represent symbols which are 'not yet transferred'.

Slide_And_Increment(leaf node) sliding starts. P is a leaf node. Leaf step one.png
Slide_And_Increment(leaf node) sliding starts. P is a leaf node.
Slide_And_Increment(leaf node) sliding step 2. As P is leaf node, it slides in front of next block nodes of equal weight. Leaf step two.png
Slide_And_Increment(leaf node) sliding step 2. As P is leaf node, it slides in front of next block nodes of equal weight.
Slide_And_Increment(leaf node) sliding step 3. Here we increase the current weight by 1. Leaf step three.png
Slide_And_Increment(leaf node) sliding step 3. Here we increase the current weight by 1.
Slide_And_Increment(leaf node) sliding step 4. Method comes to an end. P is the new parent. Leaf step four.png
Slide_And_Increment(leaf node) sliding step 4. Method comes to an end. P is the new parent.
Slide_And_Increment(internal node) sliding starts. P is an internal node. Internal one.png
Slide_And_Increment(internal node) sliding starts. P is an internal node.
Slide_And_Increment(internal node) sliding step 2. Node P slides in front of next block of leaves nodes, with weight wt+1. Internal two.png
Slide_And_Increment(internal node) sliding step 2. Node P slides in front of next block of leaves nodes, with weight wt+1.
Slide_And_Increment(internal node) sliding step 3. Now we increase the weight to 9. Thus the invariant is maintained as the current node is an internal node and should occur in front of leaf nodes of equal weight as we have increased the weight. Internal three.png
Slide_And_Increment(internal node) sliding step 3. Now we increase the weight to 9. Thus the invariant is maintained as the current node is an internal node and should occur in front of leaf nodes of equal weight as we have increased the weight.
Slide_And_Increment(internal node) sliding step 4. Now the 'P' points to the former parent ( as in the case of internal node according to algorithm) Internal four.png
Slide_And_Increment(internal node) sliding step 4. Now the 'P' points to the former parent ( as in the case of internal node according to algorithm)
algorithm for adding a symbol is     leaf_to_increment := NULL     p := pointer to the leaf node containing the next symbol      if (p is NYT) then         Extend p by adding two children         Left child becomes new NYT and right child is the new symbol leaf node         p := parent of new symbol leaf node         leaf_to_increment := Right Child of p     else         Swap p with leader of its block         if (new p is sibling to NYT) then             leaf_to_increment := p             p := parent of p      while (p ≠ NULL) do         Slide_And_Increment(p)      if (leaf_to_increment != NULL) then         Slide_And_Increment(leaf_to_increment)
function Slide_And_Increment(p) is     previous_p := parent of pif (p is an internal node) then         Slide p in the tree higher than the leaf nodes of weight wt + 1         increase weight of p by 1         p := previous_p     else         Slide p in the tree higher than the internal nodes of weight wt         increase weight of p by 1         p := new parent of p.

Encoder and decoder start with only the root node, which has the maximum number. In the beginning it is our initial NYT node.

When we transmit an NYT symbol, we have to transmit code for the NYT node, then for its generic code.

For every symbol that is already in the tree, we only have to transmit code for its leaf node.

Example

Adaptive Huffman Vitter.jpg

Encoding "abb" gives 01100001 001100010 11.

Step 1:

Start with an empty tree.

For "a" transmit its binary code.

Step 2:

NYT spawns two child nodes: 254 and 255, both with weight 0. Increase weight for root and 255. Code for "a", associated with node 255, is 1.

For "b" transmit 0 (for NYT node) then its binary code.

Step 3:

NYT spawns two child nodes: 252 for NYT and 253 for leaf node, both with weight 0. Increase weights for 253, 254, and root. To maintain Vitter's invariant that all leaves of weight w precede (in the implicit numbering) all internal nodes of weight w, the branch starting with node 254 should be swapped (in terms of symbols and weights, but not number ordering) with node 255. Code for "b" is 11.

For the second "b" transmit 11.

For the convenience of explanation this step doesn't exactly follow Vitter's algorithm, [2] but the effects are equivalent.

Step 4:

Go to leaf node 253. Notice we have two blocks with weight 1. Node 253 and 254 is one block (consisting of leaves), node 255 is another block (consisting of internal nodes). For node 253, the biggest number in its block is 254, so swap the weights and symbols of nodes 253 and 254. Now node 254 and the branch starting from node 255 satisfy the SlideAndIncrement condition [2] and hence must be swapped. At last increase node 255 and 256's weight.

Future code for "b" is 1, and for "a" is now 01, which reflects their frequency.

Related Research Articles

<span class="mw-page-title-main">AVL tree</span> Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

<span class="mw-page-title-main">Binary search tree</span> Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.

<span class="mw-page-title-main">Binary tree</span> Limited form of tree data structure

In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child. That is, it is a k-ary tree with k = 2. A recursive definition using set theory is that a binary tree is a tuple (L, S, R), where L and R are binary trees or the empty set and S is a singleton set containing the root.

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In computer science, a red–black tree is a specialised binary search tree data structure noted for fast storage and retrieval of ordered information, and a guarantee that operations will complete within a known time. Compared to other self-balancing binary search trees, the nodes in a red-black tree hold an extra bit called "color" representing "red" and "black" which is used when re-organising the tree to ensure that it is always approximately balanced.

<span class="mw-page-title-main">Tree (data structure)</span> Abstract data type simulating a hierarchical tree structure and represented as a set of linked nodes

In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children, but must be connected to exactly one parent, except for the root node, which has no parent. These constraints mean there are no cycles or "loops", and also that each child can be treated like the root node of its own subtree, making recursion a useful technique for tree traversal. In contrast to linear data structures, many trees cannot be represented by relationships between neighboring nodes in a single straight line.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a name given to two different but related techniques for constructing a prefix code based on a set of symbols and their probabilities.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

In computer science, tree traversal is a form of graph traversal and refers to the process of visiting each node in a tree data structure, exactly once. Such traversals are classified by the order in which the nodes are visited. The following algorithms are described for a binary tree, but they may be generalized to other trees as well.

The Wedderburn–Etherington numbers are an integer sequence named for Ivor Malcolm Haddon Etherington and Joseph Wedderburn that can be used to count certain kinds of binary trees. The first few numbers in the sequence are

In computer science, the Sethi–Ullman algorithm is an algorithm named after Ravi Sethi and Jeffrey D. Ullman, its inventors, for translating abstract syntax trees into machine code that uses as few registers as possible.

In computer science, a scapegoat tree is a self-balancing binary search tree, invented by Arne Andersson in 1989 and again by Igal Galperin and Ronald L. Rivest in 1993. It provides worst-case lookup time and amortized insertion and deletion time.

An AA tree in computer science is a form of balanced tree used for storing and retrieving ordered data efficiently. AA trees are named after their originator, Swedish computer scientist Arne Andersson.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:

In computer science, a leftist tree or leftist heap is a priority queue implemented with a variant of a binary heap. Every node x has an s-value which is the distance to the nearest leaf in subtree rooted at x. In contrast to a binary heap, a leftist tree attempts to be very unbalanced. In addition to the heap property, leftist trees are maintained so the right descendant of each node has the lower s-value.

In computer science, weight-balanced binary trees (WBTs) are a type of self-balancing binary search trees that can be used to implement dynamic sets, dictionaries (maps) and sequences. These trees were introduced by Nievergelt and Reingold in the 1970s as trees of bounded balance, or BB[α] trees. Their more common name is due to Knuth.

In computer science and information theory, a canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

In computer science, a weak heap is a data structure for priority queues, combining features of the binary heap and binomial heap. It can be stored in an array as an implicit binary tree like a binary heap, and has the efficiency guarantees of binomial heaps.

References

  1. Ze-Nian Li; Mark S. Drew; Jiangchuan Liu (9 April 2014). Fundamentals of Multimedia. Springer Science & Business Media. ISBN   978-3-319-05290-8.
  2. 1 2 "Adaptive Huffman Coding". Cs.duke.edu. Retrieved 2012-02-26.