Perfect hash function

Last updated
A perfect hash function for the four names shown Hash table 4 1 1 0 0 0 0 LL.svg
A perfect hash function for the four names shown
A minimal perfect hash function for the four names shown Hash table 4 1 0 0 0 0 0 LL.svg
A minimal perfect hash function for the four names shown

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

Contents

Perfect hash functions may be used to implement a lookup table with constant worst-case access time. A perfect hash function can, as any hash function, be used to implement hash tables, with the advantage that no collision resolution has to be implemented. In addition, if the keys are not in the data and if it is known that queried keys will be valid, then the keys do not need to be stored in the lookup table, saving space.

Disadvantages of perfect hash functions are that S needs to be known for the construction of the perfect hash function. Non-dynamic perfect hash functions need to be re-constructed if S changes. For frequently changing S dynamic perfect hash functions may be used at the cost of additional space. [1] The space requirement to store the perfect hash function is in O(n) where n is the number of keys in the structure.

The important performance parameters for perfect hash functions are the evaluation time, which should be constant, the construction time, and the representation size.

Application

A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys from S (or other associated values) in a lookup table indexed by the output of the function. One can then test whether a key is present in S, or look up a value associated with that key, by looking for it at its cell of the table. Each such lookup takes constant time in the worst case. [2] With perfect hashing, the associated data can be read or written with a single access to the table. [3]

Performance of perfect hash functions

The important performance parameters for perfect hashing are the representation size, the evaluation time, the construction time, and additionally the range requirement . [4] The evaluation time can be as fast as O(1), which is optimal. [2] [4] The construction time needs to be at least O(n), because each element in S needs to be considered, and S contains n elements. This lower bound can be achieved in practice. [4]

The lower bound for the representation size depends on m and n. Let m = (1+ε) n and h a perfect hash function. A good approximation for the lower bound is Bits per element. For minimal perfect hashing, ε = 0, the lower bound is log e ≈ 1.44 bits per element. [4]

Construction

A perfect hash function for a specific set S that can be evaluated in constant time, and with values in a small range, can be found by a randomized algorithm in a number of operations that is proportional to the size of S. The original construction of Fredman, Komlós & Szemerédi (1984) uses a two-level scheme to map a set S of n elements to a range of O(n) indices, and then map each index to a range of hash values. The first level of their construction chooses a large prime p (larger than the size of the universe from which S is drawn), and a parameter k, and maps each element x of S to the index

If k is chosen randomly, this step is likely to have collisions, but the number of elements ni that are simultaneously mapped to the same index i is likely to be small. The second level of their construction assigns disjoint ranges of O(ni2) integers to each index i. It uses a second set of linear modular functions, one for each index i, to map each member x of S into the range associated with g(x). [2]

As Fredman, Komlós & Szemerédi (1984) show, there exists a choice of the parameter k such that the sum of the lengths of the ranges for the n different values of g(x) is O(n). Additionally, for each value of g(x), there exists a linear modular function that maps the corresponding subset of S into the range associated with that value. Both k, and the second-level functions for each value of g(x), can be found in polynomial time by choosing values randomly until finding one that works. [2]

The hash function itself requires storage space O(n) to store k, p, and all of the second-level linear modular functions. Computing the hash value of a given key x may be performed in constant time by computing g(x), looking up the second-level function associated with g(x), and applying this function to x. A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps S into a smaller range of length n + o(n). [2]

A more recent method for constructing a perfect hash function is described by Belazzougui, Botelho & Dietzfelbinger (2009) as "hash, displace, and compress". Here a first-level hash function g is also used to map elements onto a range of r integers. An element xS is stored in the Bucket Bg(x). [4]

Then, in descending order of size, each bucket's elements are hashed by a hash function of a sequence of independent fully random hash functions (Φ1, Φ2, Φ3, ...), starting with Φ1. If the hash function does not produce any collisions for the bucket, and the resulting values are not yet occupied by other elements from other buckets, the function is chosen for that bucket. If not, the next hash function in the sequence is tested. [4]

To evaluate the perfect hash function h(x) one only has to save the mapping σ of the bucket index g(x) onto the correct hash function in the sequence, resulting in h(x) = Φσ(g(x)). [4]

Finally, to reduce the representation size, the (σ(i))0 ≤ i < r are compressed into a form that still allows the evaluation in O(1). [4]

This approach needs linear time in n for construction, and constant evaluation time. The representation size is in O(n), and depends on the achieved range. For example, with m = 1.23n Belazzougui, Botelho & Dietzfelbinger (2009) achieved a representation size between 3.03 bits/key and 1.40 bits/key for their given example set of 10 million entries, with lower values needing a higher computation time. The space lower bound in this scenario is 0.88 bits/key. [4]

Pseudocode

algorithmhash, displace, and compressis (1) Split S into buckets Bi := g−1({i})S,0 ≤ i < r (2) Sort buckets Bi in falling order according to size |Bi| (3) Initialize array T[0...m-1] with 0's (4) for all i∈[r], in the order from (2), do (5)     for l1,2,... (6)         repeat forming Ki{Φl(x)|xBi} (6)         until |Ki|=|Bi| and Ki{j|T[j]=1}= (7)     let σ(i):= the successful l (8)     for all jKilet T[j]:=1 (9) Transform (σi)0i<r into compressed form, retaining O(1) access.

Space lower bounds

The use of O(n) words of information to store the function of Fredman, Komlós & Szemerédi (1984) is near-optimal: any perfect hash function that can be calculated in constant time requires at least a number of bits that is proportional to the size of S. [5]

For minimal perfect hash functions the information theoretic space lower bound is

bits/key. [4]

For perfect hash functions, it is first assumed that the range of h is bounded by n as m = (1+ε) n. With the formula given by Belazzougui, Botelho & Dietzfelbinger (2009) and for a universe whose size |U| = u tends towards infinity, the space lower bounds is

bits/key, minus log(n) bits overall. [4]

Extensions

Memory address identity

A trivial but pervasive example of perfect hashing is implicit in the (virtual) memory address space of a computer. Since each byte of virtual memory is a distinct, unique, directly addressable storage location, the value of the starting address where any object is stored in memory can be considered a de facto perfect hash of that object into the entire memory address range. [6]

Dynamic perfect hashing

Using a perfect hash function is best in situations where there is a frequently queried large set, S, which is seldom updated. This is because any modification of the set S may cause the hash function to no longer be perfect for the modified set. Solutions which update the hash function any time the set is modified are known as dynamic perfect hashing, [1] but these methods are relatively complicated to implement.

Minimal perfect hash function

A minimal perfect hash function is a perfect hash function that maps n keys to n consecutive integers – usually the numbers from 0 to n 1 or from 1 to n. A more formal way of expressing this is: Let j and k be elements of some finite set S. Then h is a minimal perfect hash function if and only if h(j) = h(k) implies j = k (injectivity) and there exists an integer a such that the range of h is a..a + |S| 1. It has been proven that a general purpose minimal perfect hash scheme requires at least lg e ≈ 1.44 bits/key. [4] Although this space bound has been achieved by theoretical works, in practice, the best known minimal perfect hashing schemes require roughly 1.56 bits/key if given enough time. [7]

k-perfect hashing

A hash function is k-perfect if at most k elements from S are mapped onto the same value in the range. The "hash, displace, and compress" algorithm can be used to construct k-perfect hash functions by allowing up to k collisions. The changes necessary to accomplish this are minimal, and are underlined in the adapted pseudocode below:

(4) for all i∈[r], in the order from (2), do (5)     for l1,2,... (6)         repeat forming Ki{Φl(x)|xBi} (6)         until |Ki|=|Bi| and Ki{j|T[j]=k}= (7)     let σ(i):= the successful l (8)     for all jKisetT[j]T[j]+1

Order preservation

A minimal perfect hash function F is order preserving if keys are given in some order a1, a2, ..., an and for any keys aj and ak, j < k implies F(aj) < F(ak). [8] In this case, the function value is just the position of each key in the sorted ordering of all of the keys. A simple implementation of order-preserving minimal perfect hash functions with constant access time is to use an (ordinary) perfect hash function to store a lookup table of the positions of each key. This solution uses bits, which is optimal in the setting where the comparison function for the keys may be arbitrary. [9] However, if the keys a1, a2, ..., an are integers drawn from a universe , then it is possible to construct an order-preserving hash function using only bits of space. [10] Moreover, this bound is known to be optimal. [11]

While well-dimensioned hash tables have amortized average O(1) time (amortized average constant time) for lookups, insertions, and deletion, most hash table algorithms suffer from possible worst-case times that take much longer. A worst-case O(1) time (constant time even in the worst case) would be better for many applications (including network router and memory caches). [12] :41

Few hash table algorithms support worst-case O(1) lookup time (constant lookup time even in the worst case). The few that do include: perfect hashing; dynamic perfect hashing; cuckoo hashing; hopscotch hashing; and extendible hashing. [12] :42–69

A simple alternative to perfect hashing, which also allows dynamic updates, is cuckoo hashing. This scheme maps keys to two or more locations within a range (unlike perfect hashing which maps each key to a single location) but does so in such a way that the keys can be assigned one-to-one to locations to which they have been mapped. Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant worst-case time. [13]

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table, also known as a hash map, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

<span class="mw-page-title-main">Cuckoo hashing</span> Data structure hashing scheme

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is needed, unless keys are kept in buckets. The average-case cost of a successful search is , where is the number of keys and is the size of the array. The most collisions is with high probability."

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure. While more memory-intensive than its hash table counterparts, this technique is useful for situations where fast queries, insertions, and deletions must be made on a large set of elements.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

In computer science, integer sorting is the algorithmic problem of sorting a collection of data values by integer keys. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numbers, rational numbers, or text strings. The ability to perform integer arithmetic on the keys allows integer sorting algorithms to be faster than comparison sorting algorithms in many cases, depending on the details of which operations are allowed in the model of computing and how large the integers to be sorted are.

A cuckoo filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters.

In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of pairs that allows the following operations:

Approximate Membership Query Filter (AMQ-Filter) is a group of space-efficient probabilistic data structures that supports approximate membership queries. An approximate membership query answers if an element is in a set or not with a false positive rate of .

References

  1. 1 2 Dietzfelbinger, Martin; Karlin, Anna; Mehlhorn, Kurt; Meyer auf der Heide, Friedhelm; Rohnert, Hans; Tarjan, Robert E. (1994), "Dynamic perfect hashing: upper and lower bounds", SIAM Journal on Computing , 23 (4): 738–761, doi:10.1137/S0097539791194094, MR   1283572 .
  2. 1 2 3 4 5 Fredman, Michael L.; Komlós, János; Szemerédi, Endre (1984), "Storing a Sparse Table with O(1) Worst Case Access Time", Journal of the ACM , 31 (3): 538, doi: 10.1145/828.1884 , MR   0819156, S2CID   5399743
  3. Lu, Yi; Prabhakar, Balaji; Bonomi, Flavio (2006), "Perfect Hashing for Network Applications", 2006 IEEE International Symposium on Information Theory, pp. 2774–2778, doi:10.1109/ISIT.2006.261567, ISBN   1-4244-0505-X, S2CID   1494710
  4. 1 2 3 4 5 6 7 8 9 10 11 12 Belazzougui, Djamal; Botelho, Fabiano C.; Dietzfelbinger, Martin (2009), "Hash, displace, and compress" (PDF), Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7-9, 2009, Proceedings (PDF), Lecture Notes in Computer Science, vol. 5757, Berlin: Springer, pp. 682–693, CiteSeerX   10.1.1.568.130 , doi:10.1007/978-3-642-04128-0_61, MR   2557794 .
  5. Fredman, Michael L.; Komlós, János (1984), "On the size of separating systems and families of perfect hash functions", SIAM Journal on Algebraic and Discrete Methods , 5 (1): 61–68, doi:10.1137/0605009, MR   0731857 .
  6. Witold Litwin; Tadeusz Morzy; Gottfried Vossen (19 August 1998). Advances in Databases and Information Systems. Springer Science+Business Media. p. 254. ISBN   9783540649243.
  7. Esposito, Emmanuel; Mueller Graf, Thomas; Vigna, Sebastiano (2020), "RecSplit: Minimal Perfect Hashing via Recursive Splitting", 2020 Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX), Proceedings, pp. 175–185, arXiv: 1910.06416 , doi: 10.1137/1.9781611976007.14 .
  8. Jenkins, Bob (14 April 2009), "order-preserving minimal perfect hashing", in Black, Paul E. (ed.), Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology, retrieved 2013-03-05
  9. Fox, Edward A.; Chen, Qi Fan; Daoud, Amjad M.; Heath, Lenwood S. (July 1991), "Order-preserving minimal perfect hash functions and information retrieval" (PDF), ACM Transactions on Information Systems, New York, NY, USA: ACM, 9 (3): 281–308, doi:10.1145/125187.125200, S2CID   53239140 .
  10. Belazzougui, Djamal; Boldi, Paolo; Pagh, Rasmus; Vigna, Sebastiano (November 2008), "Theory and practice of monotone minimal perfect hashing", Journal of Experimental Algorithmics, 16, Art. no. 3.2, 26pp, doi:10.1145/1963190.2025378, S2CID   2367401 .
  11. Assadi, Sepehr; Farach-Colton, Martín; Kuszmaul, William (January 2023), "Tight Bounds for Monotone Minimal Perfect Hashing", Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Philadelphia, PA: Society for Industrial and Applied Mathematics, pp. 456–476, arXiv: 2207.10556 , doi:10.1137/1.9781611977554.ch20, ISBN   978-1-61197-755-4 , retrieved 2023-04-27
  12. 1 2 Timothy A. Davis. "Chapter 5 Hashing": subsection "Hash Tables with Worst-Case O(1) Access"
  13. Pagh, Rasmus; Rodler, Flemming Friche (2004), "Cuckoo hashing", Journal of Algorithms, 51 (2): 122–144, doi:10.1016/j.jalgor.2003.12.002, MR   2050140 .

Further reading