Package-merge algorithm

Last updated

The package-merge algorithm is an O(nL)-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code word is longer than L. It is a greedy algorithm, and a generalization of Huffman's original algorithm. Package-merge works by reducing the code construction problem to the binary coin collector's problem . [1]

Contents

The coin collector's problem

Suppose a coin collector has a number of coins of various denominations, each of which has a numismatic value unrelated to its denomination. The coin collector has run out of money and needs to use some of his coin collection to buy something of cost N. He wishes to select a subset of coins from his collection of minimum numismatic value whose denominations total N.

The binary version of this problem is that all denominations are powers of 2, that is, 1, 1/2, 1/4, etc. dollars.

Description of the package-merge algorithm

Assume that the largest denomination is 1 dollar, and that N is an integer. (The algorithm works even if these assumptions do not hold, by trivial modifications.) The coin collector first separates his coins into lists, one for each denomination, sorted by numismatic value. He then packages the smallest denomination coins in pairs, starting from the pair of smallest total numismatic value. If there is one coin left over, it will be the coin of highest numismatic value of that denomination, and it is set aside and ignored henceforth. These packages are then merged into the list of coins of the next smallest denomination, again in order of numismatic value. The items in that list are then packaged in pairs, and merged into the next smallest list, and so forth.

Finally, there is a list of items, each of which is a 1 dollar coin or a package consisting of two or more smaller coins whose denominations total 1 dollar. They are also sorted in order of numismatic value. The coin collector then selects the least value N of them.

Note that the time of the algorithm is linear in the number of coins.

Reduction of length-limited Huffman coding to the coin collector's problem

Let L be the maximum length any code word is permitted to have. Let p1, …, pn be the frequencies of the symbols of the alphabet to be encoded. We first sort the symbols so that pi  pi+1. Create L coins for each symbol, of denominations 21, …, 2L, each of numismatic value pi. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total n  1. Let hi be the number of coins of numismatic value pi selected. The optimal length-limited Huffman code will encode symbol i with a bit string of length hi. The canonical Huffman code can easily be constructed by a simple bottom-up greedy method, given that the hi are known, and this can be the basis for fast data compression. [2]

Performance improvements and generalizations

With this reduction, the algorithm is O(nL)-time and O(nL)-space. However, the original paper, "A fast algorithm for optimal length-limited Huffman codes", shows how this can be improved to O(nL)-time and O(n)-space. The idea is to run the algorithm a first time, only keeping enough data to be able to determine two equivalent subproblems that sum to half the size of the original problem. This is done recursively, resulting in an algorithm that takes about twice as long but requires only linear space. [1]

Many other improvements have been made to the package-merge algorithm to reduce the multiplicative constant and to make it faster in special cases, such as those problems having repeated pis. [3] The package-merge approach has also been adapted to related problems such as alphabetic coding. [4]

Methods involving graph theory have been shown to have better asymptotic space complexity than the package-merge algorithm, but these have not seen as much practical application.

Related Research Articles

Binary search algorithm Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

Huffman coding Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

Trie Type of search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In information theory an entropy coding is a lossless data compression scheme that is independent of the specific characteristics of the medium.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a name given to two different but related techniques for constructing a prefix code based on a set of symbols and their probabilities.

Arithmetic coding Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

A prefix code is a type of code system distinguished by its possession of the "prefix property", which requires that there is no whole code word in the system that is a prefix of any other code word in the system. It is trivially true for fixed-length code, so only a point of consideration in variable-length code.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It was developed by Julian Seward, and maintained by Mark Wielaard and Micah Snyder.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure.

Universal code (data compression)

In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic, the expected lengths of the codewords are within a constant factor of the expected lengths that the optimal code for that probability distribution would have assigned. A universal code is asymptotically optimal if the ratio between actual and optimal expected lengths is bounded by a function of the information entropy of the code that, in addition to being bounded, approaches 1 as entropy approaches infinity.

A timeline of events related to  information theory,  quantum information theory and statistical physics,  data compression,  error correcting codes and related subjects.

A sorting algorithm falls into the adaptive sort family if it takes advantage of existing order in its input. It benefits from the presortedness in the input sequence – or a limited amount of disorder for various definitions of measures of disorder – and sorts faster. Adaptive sorting is usually performed by modifying existing sorting algorithms.

Lawrence L. Larmore American mathematician

Lawrence L. Larmore is an American mathematician and theoretical computer scientist, currently tenuring as the professor of computer science at the University of Nevada, Las Vegas (UNLV). He is best known for his work with competitive analysis of online algorithms, particularly for the k-server problem. His contributions, with his co-author Marek Chrobak, led to the application of T-theory to the server problem. In addition, he developed the package-merge algorithm for the length-limited Huffman coding problem, as well as an algorithm for optimizing paragraph breaking in linear time.

A canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner.

The change-making problem addresses the question of finding the minimum number of coins that add up to a given amount of money. It is a special case of the integer knapsack problem, and has applications wider than just currency.

The SMAWK algorithm is an algorithm for finding the minimum value in each row of an implicitly-defined totally monotone matrix. It is named after the initials of its five inventors, Peter Shor, Shlomo Moran, Alok Aggarwal, Robert Wilber, and Maria Klawe.

The Garsia–Wachs algorithm is an efficient method for computers to construct optimal binary search trees and alphabetic Huffman codes, in linearithmic time. It is named after Adriano Garsia and Michelle L. Wachs.

References

  1. 1 2 Larmore, Lawrence L.; Hirschberg, Daniel S. (1990). "A fast algorithm for optimal length-limited Huffman codes". Journal of the Association for Computing Machinery . 37 (3): 464–473. doi:10.1145/79147.79150.
  2. Moffat, Alistair; Turpin, Andrew (Oct 1997). "On the implementation of minimum redundancy prefix codes". IEEE Transactions on Communications . 45 (10): 1200–1207. doi:10.1109/26.634683.
  3. Witten, Ian H.; Moffat, Alistair; Bell, Timothy Clinton (1999). Managing Gigabytes: Compressing and indexing documents and images (2 ed.). Morgan Kaufmann Publishers. ISBN   978-1-55860-570-1. 1558605703.
  4. Larmore, Lawrence L.; Przytycka, Teresa M. (1994). "A Fast Algorithm for Optimal Height-Limited Alphabetic Binary-Trees". SIAM Journal on Computing . 23 (6): 1283–1312. doi:10.1137/s0097539792231167.