Package-merge algorithm

Last updated December 13, 2021

The package-merge algorithm is an O(nL)-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code word is longer than L. It is a greedy algorithm, and a generalization of Huffman's original algorithm. Package-merge works by reducing the code construction problem to the binary coin collector's problem .^[1]

The coin collector's problem

Suppose a coin collector has a number of coins of various denominations, each of which has a numismatic value unrelated to its denomination. The coin collector has run out of money and needs to use some of his coin collection to buy something of cost N. He wishes to select a subset of coins from his collection of minimum numismatic value whose denominations total N.

The binary version of this problem is that all denominations are powers of 2, that is, 1, 1/2, 1/4, etc. dollars.

Description of the package-merge algorithm

Assume that the largest denomination is 1 dollar, and that N is an integer. (The algorithm works even if these assumptions do not hold, by trivial modifications.) The coin collector first separates his coins into lists, one for each denomination, sorted by numismatic value. He then packages the smallest denomination coins in pairs, starting from the pair of smallest total numismatic value. If there is one coin left over, it will be the coin of highest numismatic value of that denomination, and it is set aside and ignored henceforth. These packages are then merged into the list of coins of the next smallest denomination, again in order of numismatic value. The items in that list are then packaged in pairs, and merged into the next smallest list, and so forth.

Finally, there is a list of items, each of which is a 1 dollar coin or a package consisting of two or more smaller coins whose denominations total 1 dollar. They are also sorted in order of numismatic value. The coin collector then selects the least value N of them.

Note that the time of the algorithm is linear in the number of coins.

Reduction of length-limited Huffman coding to the coin collector's problem

Let L be the maximum length any code word is permitted to have. Let p₁, …, p_n be the frequencies of the symbols of the alphabet to be encoded. We first sort the symbols so that p_i ≤ p_i+1. Create L coins for each symbol, of denominations 2⁻¹, …, 2^−L, each of numismatic value p_i. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total n − 1. Let h_i be the number of coins of numismatic value p_i selected. The optimal length-limited Huffman code will encode symbol i with a bit string of length h_i. The canonical Huffman code can easily be constructed by a simple bottom-up greedy method, given that the h_i are known, and this can be the basis for fast data compression.^[2]

Performance improvements and generalizations

With this reduction, the algorithm is O(nL)-time and O(nL)-space. However, the original paper, "A fast algorithm for optimal length-limited Huffman codes", shows how this can be improved to O(nL)-time and O(n)-space. The idea is to run the algorithm a first time, only keeping enough data to be able to determine two equivalent subproblems that sum to half the size of the original problem. This is done recursively, resulting in an algorithm that takes about twice as long but requires only linear space.^[1]

Many other improvements have been made to the package-merge algorithm to reduce the multiplicative constant and to make it faster in special cases, such as those problems having repeated p_is.^[3] The package-merge approach has also been adapted to related problems such as alphabetic coding.^[4]

Methods involving graph theory have been shown to have better asymptotic space complexity than the package-merge algorithm, but these have not seen as much practical application.

Related Research Articles

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In computer science, a trie, also called digital tree or prefix tree, is a type of search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In information theory an entropy coding is a lossless data compression scheme that is independent of the specific characteristics of the medium.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a name given to two different but related techniques for constructing a prefix code based on a set of symbols and their probabilities.

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

A prefix code is a type of code system distinguished by its possession of the "prefix property", which requires that there is no whole code word in the system that is a prefix of any other code word in the system. It is trivially true for fixed-length code, so only a point of consideration in variable-length code.

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It was developed by Julian Seward, and maintained by Mark Wielaard and Micah Snyder.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure.

In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic, the expected lengths of the codewords are within a constant factor of the expected lengths that the optimal code for that probability distribution would have assigned. A universal code is asymptotically optimal if the ratio between actual and optimal expected lengths is bounded by a function of the information entropy of the code that, in addition to being bounded, approaches 1 as entropy approaches infinity.

A timeline of events related to information theory, quantum information theory and statistical physics, data compression, error correcting codes and related subjects.

A sorting algorithm falls into the adaptive sort family if it takes advantage of existing order in its input. It benefits from the presortedness in the input sequence – or a limited amount of disorder for various definitions of measures of disorder – and sorts faster. Adaptive sorting is usually performed by modifying existing sorting algorithms.

Lawrence L. Larmore American mathematician

Lawrence L. Larmore is an American mathematician and theoretical computer scientist, currently tenuring as the professor of computer science at the University of Nevada, Las Vegas (UNLV). He is best known for his work with competitive analysis of online algorithms, particularly for the k-server problem. His contributions, with his co-author Marek Chrobak, led to the application of T-theory to the server problem. In addition, he developed the package-merge algorithm for the length-limited Huffman coding problem, as well as an algorithm for optimizing paragraph breaking in linear time.

A canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner.

The change-making problem addresses the question of finding the minimum number of coins that add up to a given amount of money. It is a special case of the integer knapsack problem, and has applications wider than just currency.

The SMAWK algorithm is an algorithm for finding the minimum value in each row of an implicitly-defined totally monotone matrix. It is named after the initials of its five inventors, Peter Shor, Shlomo Moran, Alok Aggarwal, Robert Wilber, and Maria Klawe.

The Garsia–Wachs algorithm is an efficient method for computers to construct optimal binary search trees and alphabetic Huffman codes, in linearithmic time. It is named after Adriano Garsia and Michelle L. Wachs.

References

1 2 Larmore, Lawrence L.; Hirschberg, Daniel S. (1990). "A fast algorithm for optimal length-limited Huffman codes". Journal of the Association for Computing Machinery . 37 (3): 464–473. doi:10.1145/79147.79150.
↑ Moffat, Alistair; Turpin, Andrew (Oct 1997). "On the implementation of minimum redundancy prefix codes". IEEE Transactions on Communications . 45 (10): 1200–1207. doi:10.1109/26.634683.
↑ Witten, Ian H.; Moffat, Alistair; Bell, Timothy Clinton (1999). Managing Gigabytes: Compressing and indexing documents and images (2 ed.). Morgan Kaufmann Publishers. ISBN 978-1-55860-570-1. 1558605703.
↑ Larmore, Lawrence L.; Przytycka, Teresa M. (1994). "A Fast Algorithm for Optimal Height-Limited Alphabetic Binary-Trees". SIAM Journal on Computing . 23 (6): 1283–1312. doi:10.1137/s0097539792231167.

External links

Baer, Michael B. (2006). "Twenty (or so) Questions: D-ary Length-Bounded Prefix Coding". arXiv: cs.IT/0602085 .
Moffat, Alistair; Turpin, Andrew; Katajainen, Jyrki (March 1995). Space-Efficient Construction of Optimal Prefix Codes. IEEE Data Compression Conference. Snowbird, Utah, USA. doi:10.1109/DCC.1995.515509.
An implementation of the package-merge algorithm ""
A fast entropy coder that uses package-merge algorithm

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[LaHi-1] 1 2 Larmore, Lawrence L.; Hirschberg, Daniel S. (1990). "A fast algorithm for optimal length-limited Huffman codes". Journal of the Association for Computing Machinery . 37 (3): 464–473. doi:10.1145/79147.79150.

[MoTu-2] Moffat, Alistair; Turpin, Andrew (Oct 1997). "On the implementation of minimum redundancy prefix codes". IEEE Transactions on Communications . 45 (10): 1200–1207. doi:10.1109/26.634683.

[MG-3] Witten, Ian H.; Moffat, Alistair; Bell, Timothy Clinton (1999). Managing Gigabytes: Compressing and indexing documents and images (2 ed.). Morgan Kaufmann Publishers. ISBN 978-1-55860-570-1. 1558605703.

[LaPr-4] Larmore, Lawrence L.; Przytycka, Teresa M. (1994). "A Fast Algorithm for Optimal Height-Limited Alphabetic Binary-Trees". SIAM Journal on Computing . 23 (6): 1283–1312. doi:10.1137/s0097539792231167.

[1]

[2]

[3]

[4]