K-independent hashing

Last updated

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal [1] if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables (see precise mathematical definitions below). Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

Contents

Background

The goal of hashing is usually to map keys from some large domain (universe) into a smaller range, such as bins (labelled ). In the analysis of randomized algorithms and data structures, it is often desirable for the hash codes of various keys to "behave randomly". For instance, if the hash code of each key were an independent random choice in , the number of keys per bin could be analyzed using the Chernoff bound. A deterministic hash function cannot offer any such guarantee in an adversarial setting, as the adversary may choose the keys to be the precisely the preimage of a bin. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.

The solution to these problems is to pick a function randomly from a large family of hash functions. The randomness in choosing the hash function can be used to guarantee some desired random behavior of the hash codes of any keys of interest. The first definition along these lines was universal hashing, which guarantees a low collision probability for any two designated keys. The concept of -independent hashing, introduced by Wegman and Carter in 1981, [2] strengthens the guarantees of random behavior to families of designated keys, and adds a guarantee on the uniform distribution of hash codes.

Definitions

The strictest definition, introduced by Wegman and Carter [2] under the name "strongly universal hash family", is the following. A family of hash functions is -independent if for any distinct keys and any hash codes (not necessarily distinct) , we have:

This definition is equivalent to the following two conditions:

  1. for any fixed , as is drawn randomly from , is uniformly distributed in .
  2. for any fixed, distinct keys , as is drawn randomly from , are independent random variables.

Often it is inconvenient to achieve the perfect joint probability of due to rounding issues. Following, [3] one may define a -independent family to satisfy:

distinct and ,

Observe that, even if is close to 1, are no longer independent random variables, which is often a problem in the analysis of randomized algorithms. Therefore, a more common alternative to dealing with rounding issues is to prove that the hash family is close in statistical distance to a -independent family, which allows black-box use of the independence properties.

Techniques

Polynomials with random coefficients

The original technique for constructing k-independent hash functions, given by Carter and Wegman, was to select a large prime number p, choose k random numbers modulo p, and use these numbers as the coefficients of a polynomial of degree k 1 whose values modulo p are used as the value of the hash function. All polynomials of the given degree modulo p are equally likely, and any polynomial is uniquely determined by any k-tuple of argument-value pairs with distinct arguments, from which it follows that any k-tuple of distinct arguments is equally likely to be mapped to any k-tuple of hash values. [2]

In general the polynomial can be evaluated in any finite field. Besides the fields modulo prime, a popular choice is the field of size , which supports fast finite field arithmetic on modern computers. This was the approach taken by Daniel Lemire and Owen Kaser for CLHash. [4]

Tabulation hashing

Tabulation hashing is a technique for mapping keys to hash values by partitioning each key into bytes, using each byte as the index into a table of random numbers (with a different table for each byte position), and combining the results of these table lookups by a bitwise exclusive or operation. Thus, it requires more randomness in its initialization than the polynomial method, but avoids possibly-slow multiplication operations. It is 3-independent but not 4-independent. [5] Variations of tabulation hashing can achieve higher degrees of independence by performing table lookups based on overlapping combinations of bits from the input key, or by applying simple tabulation hashing iteratively. [6] [7]

Independence needed by different types of collision resolution

The notion of k-independence can be used to differentiate between different collision resolution in hashtables, according to the level of independence required to guarantee constant expected time per operation.

For instance, hash chaining takes constant expected time even with a 2-independent family of hash functions, because the expected time to perform a search for a given key is bounded by the expected number of collisions that key is involved in. By linearity of expectation, this expected number equals the sum, over all other keys in the hash table, of the probability that the given key and the other key collide. Because the terms of this sum only involve probabilistic events involving two keys, 2-independence is sufficient to ensure that this sum has the same value that it would for a truly random hash function. [2]

Double hashing is another method of hashing that requires a low degree of independence. It is a form of open addressing that uses two hash functions: one to determine the start of a probe sequence, and the other to determine the step size between positions in the probe sequence. As long as both of these are 2-independent, this method gives constant expected time per operation. [8]

On the other hand, linear probing, a simpler form of open addressing where the step size is always one can be guaranteed to work in constant expected time per operation with a 5-independent hash function, [9] and there exist 4-independent hash functions for which it takes logarithmic time per operation. [10]

For Cuckoo hashing the required k-independence is not known as of 2021. In 2009 it was shown [11] that -independence suffices, and at least 6-independence is needed. Another approach is to use Tabulation hashing, which is not 6-independent, but was shown in 2012 [12] to have other properties sufficient for Cuckoo hashing. A third approach from 2014 [13] is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.

Other applications

Kane, Nelson and David Woodruff improved the Flajolet–Martin algorithm for the Distinct Elements Problem in 2010. [14] To give an approximation to the correct answer, they need a -independent hash function.

The Count sketch algorithm for dimensionality reduction requires two hash functions, one 2-independent and one 4-independent.

The Karloff–Zwick algorithm for the MAX-3SAT problem can be implemented with 3-independent random variables.

The MinHash algorithm can be implemented using a -independent hash function as was proven by Piotr Indyk in 1999 [15]

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

<span class="mw-page-title-main">Perfect hash function</span> Hash function without any collisions

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

In computer science, a one-way function is a function that is easy to compute on every input, but hard to invert given the image of a random input. Here, "easy" and "hard" are to be understood in the sense of computational complexity theory, specifically the theory of polynomial time problems. Not being one-to-one is not considered sufficient for a function to be called one-way.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

In computational complexity theory, the 3SUM problem asks if a given set of real numbers contains three elements that sum to zero. A generalized version, k-SUM, asks the same question on k numbers. 3SUM can be easily solved in time, and matching lower bounds are known in some specialized models of computation.

<span class="mw-page-title-main">Linear probing</span> Computer programming method for hashing

Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene Amdahl, Elaine M. McGraw, and Arthur Samuel and first analyzed in 1963 by Donald Knuth.

<span class="mw-page-title-main">Cuckoo hashing</span> Data structure hashing scheme

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In computer science and graph theory, the term color-coding refers to an algorithmic technique which is useful in the discovery of network motifs. For example, it can be used to detect a simple path of length k in a given graph. The traditional color-coding algorithm is probabilistic, but it can be derandomized without much overhead in the running time.

In cryptography, SWIFFT is a collection of provably secure hash functions. It is based on the concept of the fast Fourier transform (FFT). SWIFFT is not the first hash function based on FFT, but it sets itself apart by providing a mathematical proof of its security. It also uses the LLL basis reduction algorithm. It can be shown that finding collisions in SWIFFT is at least as difficult as finding short vectors in cyclic/ideal lattices in the worst case. By giving a security reduction to the worst-case scenario of a difficult mathematical problem, SWIFFT gives a much stronger security guarantee than most other cryptographic hash functions.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In computer science, integer sorting is the algorithmic problem of sorting a collection of data values by integer keys. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numbers, rational numbers, or text strings. The ability to perform integer arithmetic on the keys allows integer sorting algorithms to be faster than comparison sorting algorithms in many cases, depending on the details of which operations are allowed in the model of computing and how large the integers to be sorted are.

In mathematics, a submodular set function is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an input set decreases as the size of the input set increases. Submodular functions have a natural diminishing returns property which makes them suitable for many applications, including approximation algorithms, game theory and electrical networks. Recently, submodular functions have also found immense utility in several real world problems in machine learning and artificial intelligence, including automatic summarization, multi-document summarization, feature selection, active learning, sensor placement, image collection summarization and many other domains.

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.

In cryptography, an accumulator is a one way membership hash function. It allows users to certify that potential candidates are a member of a certain set without revealing the individual members of the set. This concept was formally introduced by Josh Benaloh and Michael de Mare in 1993.

In mathematics and computer science, polynomial evaluation refers to computation of the value of a polynomial when its indeterminates are substituted for some values. In other words, evaluating the polynomial at consists of computing See also Polynomial ring § Polynomial evaluation

References

  1. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN   0-262-03384-4.
  2. 1 2 3 4 Wegman, Mark N.; Carter, J. Lawrence (1981). "New hash functions and their use in authentication and set equality" (PDF). Journal of Computer and System Sciences. 22 (3): 265–279. doi: 10.1016/0022-0000(81)90033-7 . Conference version in FOCS'79. Retrieved 9 February 2011.
  3. Siegel, Alan (2004). "On universal classes of extremely random constant-time hash functions and their time-space tradeoff" (PDF). SIAM Journal on Computing. 33 (3): 505–543. doi:10.1137/S0097539701386216. Conference version in FOCS'89.
  4. Lemire, Daniel, and Owen Kaser. "Faster 64-bit universal hashing using carry-less multiplications." Journal of Cryptographic Engineering 6.3 (2016): 171-185.
  5. Pătraşcu, Mihai; Thorup, Mikkel (2012), "The power of simple tabulation hashing", Journal of the ACM , 59 (3): Art. 14, arXiv: 1011.5200 , doi:10.1145/2220357.2220361, MR   2946218
  6. Siegel, Alan (2004), "On universal classes of extremely random constant-time hash functions" (PDF), SIAM Journal on Computing , 33 (3): 505–543, doi:10.1137/S0097539701386216, MR   2066640, archived from the original (PDF) on 2019-02-17
  7. Thorup, M. (2013), "Simple tabulation, fast expanders, double tabulation, and high independence", Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2013), pp. 90–99, arXiv: 1311.3121 , doi:10.1109/FOCS.2013.18, ISBN   978-0-7695-5135-7, MR   3246210
  8. Bradford, Phillip G.; Katehakis, Michael N. (2007), "A probabilistic study on combinatorial expanders and hashing" (PDF), SIAM Journal on Computing , 37 (1): 83–111, doi:10.1137/S009753970444630X, MR   2306284, archived from the original (PDF) on 2016-01-25, retrieved 2016-01-19
  9. Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), "Linear probing with constant independence", SIAM Journal on Computing , 39 (3): 1107–1120, arXiv: cs/0612055 , doi:10.1137/070702278, MR   2538852
  10. Pătraşcu, Mihai; Thorup, Mikkel (2010), "On the k-independence required by linear probing and minwise independence" (PDF), Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, vol. 6198, Springer, pp. 715–726, arXiv: 1302.5127 , doi:10.1007/978-3-642-14165-2_60, ISBN   978-3-642-14164-5
  11. Cohen, Jeffrey S., and Daniel M. Kane. "Bounds on the independence required for cuckoo hashing." ACM Transactions on Algorithms (2009).
  12. Pǎtraşcu, Mihai, and Mikkel Thorup. "The power of simple tabulation hashing." Journal of the ACM (JACM) 59.3 (2012): 1-50.
  13. Aumüller, Martin, Martin Dietzfelbinger, and Philipp Woelfel. "Explicit and efficient hash families suffice for cuckoo hashing with a stash." Algorithmica 70.3 (2014): 428-456.
  14. Kane, Daniel M., Jelani Nelson, and David P. Woodruff. "An optimal algorithm for the distinct elements problem." Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2010.
  15. Indyk, Piotr. "A small approximately min-wise independent family of hash functions." Journal of Algorithms 38.1 (2001): 84-90.

Further reading