Tabulation hashing

Last updated

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.

Contents

Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is 3-independent: every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values. However, it is not 4-independent. More sophisticated but slower variants of tabulation hashing extend the method to higher degrees of independence.

Because of its high degree of independence, tabulation hashing is usable with hashing methods that require a high-quality hash function, including hopscotch hashing, cuckoo hashing, and the MinHash technique for estimating the size of set intersections.

Method

The basic idea is as follows:

First, divide the key to be hashed into smaller "blocks" of a chosen length. Then, create a set of lookup tables, one for each block, and fill them with random values. Finally, use the tables to compute a hash value for each block, and combine all of these hashes into a final hash value using the bitwise exclusive or operation. [1]

More formally:

Let p be the number of bits in a key to be hashed, and q be the number of bits desired in an output hash function. Choose a block size rp; the choice of block size controls the tradeoff between time and memory usage, so it should be made so that the tables are not too large, e.g., so that the tables fit into the computer's cache memory. [2] Smaller blocks use less memory but slow down the hash function. Compute t = ceil(p/r), the number of r-bit blocks needed to represent a key.

Create a two-dimensional 2r × t array, T, and fill it with random q-bit numbers. Now T can be used to compute the hash value h(x) of any given key x. To do so, partition x into r-bit values, where x0 consists of the lowest r bits of x, x1 consists of the next r bits, etc. For example, if r = 8, then xi is just the ith byte of x. Then, use these r-bit and position values as indices into T, and combine the results using the exclusive or operation: [1]

h(x) = T[0][x0] ⊕ T[1][x1] ⊕ T[2][x2] ⊕ ... ⊕ T[t-1][xt-1].

Note that it is not valid to use the same table (e.g. T[0]) for each xi, since then the hash function would not be able to distinguish between strings with the same xis, but permuted differently.

Code for a typical example with r = t = 8 and q = p = 64 is given below.

// Secret table of random numbersuint64_tT[8][256];for(inti=0;i<8;i++)for(intj=0;j<256;j++)T[i][j]=getRandomUInt64();// Simple Tabulation Hash functionuint64_thash(uint64_tx){uint64_tres=0;for(inti=0;i<8;i++)res^=T[i][(char)(x>>8*i)];returnres;}

History

The first instance of tabulation hashing is Zobrist hashing, a method for hashing positions in abstract board games such as chess named after Albert Lindsey Zobrist, who published it in 1970. [3] In this method, a random bitstring is generated for each game feature such as a combination of a chess piece and a square of the chessboard. Then, to hash any game position, the bitstrings for the features of that position are combined by a bitwise exclusive or. The resulting hash value can then be used as an index into a transposition table. Because each move typically changes only a small number of game features, the Zobrist value of the position after a move can be updated quickly from the value of the position before the move, without needing to loop over all of the features of the position. [4]

Tabulation hashing in greater generality, for arbitrary binary values, was later rediscovered by Carter & Wegman (1979) and studied in more detail by Pătraşcu & Thorup (2012).

Universality

Carter & Wegman (1979) define a randomized scheme for generating hash functions to be universal if, for any two keys, the probability that they collide (that is, they are mapped to the same value as each other) is 1/m, where m is the number of values that the keys can take on. They defined a stronger property in the subsequent paper Wegman & Carter (1981): a randomized scheme for generating hash functions is k-independent if, for every k-tuple of keys, and each possible k-tuple of values, the probability that those keys are mapped to those values is 1/mk. 2-independent hashing schemes are automatically universal, and any universal hashing scheme can be converted into a 2-independent scheme by storing a random number x as part of the initialization phase of the algorithm and adding x to each hash value. Thus, universality is essentially the same as 2-independence. However, k-independence for larger values of k is a stronger property, held by fewer hashing algorithms.

As Pătraşcu & Thorup (2012) observe, tabulation hashing is 3-independent but not 4-independent. For any single key x, T[x0,0] is equally likely to take on any hash value, and the exclusive or of T[x0,0] with the remaining table values does not change this property. For any two keys x and y, x is equally likely to be mapped to any hash value as before, and there is at least one position i where xi  yi; the table value T[yi,i] is used in the calculation of h(y) but not in the calculation of h(x), so even after the value of h(x) has been determined, h(y) is equally likely to be any valid hash value. Similarly, for any three keys x, y, and z, at least one of the three keys has a position i where its value zi differs from the other two, so that even after the values of h(x) and h(y) are determined, h(z) is equally likely to be any valid hash value. [5]

However, this reasoning breaks down for four keys because there are sets of keys w, x, y, and z where none of the four has a byte value that it does not share with at least one of the other keys. For instance, if the keys have two bytes each, and w, x, y, and z are the four keys that have either zero or one as their byte values, then each byte value in each position is shared by exactly two of the four keys. For these four keys, the hash values computed by tabulation hashing will always satisfy the equation h(w) ⊕ h(x) ⊕ h(y) ⊕ h(z) = 0, whereas for a 4-independent hashing scheme the same equation would only be satisfied with probability 1/m. Therefore, tabulation hashing is not 4-independent. [5]

Application

Because tabulation hashing is a universal hashing scheme, it can be used in any hashing-based algorithm in which universality is sufficient. For instance, in hash chaining, the expected time per operation is proportional to the sum of collision probabilities, which is the same for any universal scheme as it would be for truly random hash functions, and is constant whenever the load factor of the hash table is constant. Therefore, tabulation hashing can be used to compute hash functions for hash chaining with a theoretical guarantee of constant expected time per operation. [6]

However, universal hashing is not strong enough to guarantee the performance of some other hashing algorithms. For instance, for linear probing, 5-independent hash functions are strong enough to guarantee constant time operation, but there are 4-independent hash functions that fail. [7] Nevertheless, despite only being 3-independent, tabulation hashing provides the same constant-time guarantee for linear probing. [8]

Cuckoo hashing, another technique for implementing hash tables, guarantees constant time per lookup (regardless of the hash function). Insertions into a cuckoo hash table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected time per insertion (using either a truly random hash function or a hash function with logarithmic independence) is constant. With tabulation hashing, on the other hand, the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construction of a cuckoo hash table for a static set of keys that does not change as the table is used. [8]

Extensions

Although tabulation hashing as described above ("simple tabulation hashing") is only 3-independent, variations of this method can be used to obtain hash functions with much higher degrees of independence. Siegel (2004) uses the same idea of using exclusive or operations to combine random values from a table, with a more complicated algorithm based on expander graphs for transforming the key bits into table indices, to define hashing schemes that are k-independent for any constant or even logarithmic value of k. However, the number of table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive. Thorup (2013) provides a scheme based on tabulation hashing that reaches high degrees of independence more quickly, in a more constructive way. He observes that using one round of simple tabulation hashing to expand the input keys to six times their original length, and then a second round of simple tabulation hashing on the expanded keys, results in a hashing scheme whose independence number is exponential in the parameter r, the number of bits per block in the partition of the keys into blocks.

Simple tabulation is limited to keys of a fixed length, because a different table of random values needs to be initialized for each position of a block in the keys. Lemire (2012) studies variations of tabulation hashing suitable for variable-length keys such as character strings. The general type of hashing scheme studied by Lemire uses a single table T indexed by the value of a block, regardless of its position within the key. However, the values from this table may be combined by a more complicated function than bitwise exclusive or. Lemire shows that no scheme of this type can be 3-independent. Nevertheless, he shows that it is still possible to achieve 2-independence. In particular, a tabulation scheme that interprets the values T[xi] (where xi is, as before, the ith block of the input) as the coefficients of a polynomial over a finite field and then takes the remainder of the resulting polynomial modulo another polynomial, gives a 2-independent hash function.

Mixed Tabulation

Mixed Tabulation hashing (and the less general Twisted Tabulation) were introduced by Dahlgaard and Thorup [9] as a way to strengthen the properties of Tabulation hashing while keeping nearly the same performance. Mixed tabulation can be seen as a xor'ing a "Double Tabulation" Thorup (2013) hash function with a simple tabulation hash function. This turns out to have many nice properties even when parameters are chosen to make the mixed tabulation much faster than double tabulation [10]

The idea is to pick a number and hash to bits rather than just . This gives new "derived characters" which are hashed by a second hash function and the two values are xor'ed together. Formally we have and , both simple tabulation functions. If , then the mixed tabulation hash is defined to be

The following example shows the algorithm with , and :

intD=2;uint128_tT1[8][256];uint64_tT2[D][256];// Fill tables with random valuesfor(intj=0;j<256;j++){for(inti=0;i<8;i++)T1[i][j]=getRandomUInt128();for(inti=0;i<D;i++)T2[i][j]=getRandomUInt64();}// Compute Mixed Tabulation of x with D derived charactersuint64_thash(uint64_tx){uint128_tv1v2=0;for(inti=0;i<8;i++)v1v2^=T1[i][(char)(x>>8*i)];uint64_tv1=v1v2>>64;// Take v1 from low bitsuint64_th=(uint64_t)v1v2;// Take v2 from high bitsfor(inti=0;i<D;i++)h^=T2[i][(char)(v1>>8*i)];returnh;}

Mixed Tabulation was shown in 2016 [11] to have strong concentration with regards to k-partitions, which are useful in algorithms for counting distinct elements, such as the classical method by Flajolet and Martin.

Notes

  1. 1 2 Morin (2014); Mitzenmacher & Upfal (2014).
  2. Mitzenmacher & Upfal (2014).
  3. Thorup (2013).
  4. Zobrist (1970).
  5. 1 2 Pătraşcu & Thorup (2012); Mitzenmacher & Upfal (2014).
  6. Carter & Wegman (1979).
  7. For the sufficiency of 5-independent hashing for linear probing, see Pagh, Pagh & Ružić (2009). For examples of weaker hashing schemes that fail, see Pătraşcu & Thorup (2010).
  8. 1 2 Pătraşcu & Thorup (2012).
  9. Dahlgaard, Søren, and Mikkel Thorup. "Approximately minwise independence with twisted tabulation." Scandinavian Workshop on Algorithm Theory. Springer, Cham, 2014.
  10. Aamand, Anders, Jakob Bæk Tejs Knudsen, Mathias Bæk Tejs Knudsen, Peter Michael Reichstein Rasmussen, Mikkel Thorup. "Fast hashing with strong concentration bounds." Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. 2020.
  11. Dahlgaard, Søren, et al. "Hashing for statistics over k-partitions." 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 2015.

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter-storage addressing.

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called a hash map.

A birthday attack is a bruteforce collision attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likelihood of collisions found between random attack attempts and a fixed degree of permutations (pigeonholes). With a birthday attack, it is possible to find a collision of a hash function with chance in , with being the classical preimage resistance security with the same probability. There is a general result that quantum computers can perform birthday attacks, thus breaking collision resistance, in .

<span class="mw-page-title-main">Perfect hash function</span> Hash function without any collisions

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

A commitment scheme is a cryptographic primitive that allows one to commit to a chosen value while keeping it hidden to others, with the ability to reveal the committed value later. Commitment schemes are designed so that a party cannot change the value or statement after they have committed to it: that is, commitment schemes are binding. Commitment schemes have important applications in a number of cryptographic protocols including secure coin flipping, zero-knowledge proofs, and secure computation.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table .

<span class="mw-page-title-main">Linear probing</span> Computer programming method for hashing

Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene Amdahl, Elaine M. McGraw, and Arthur Samuel and first analyzed in 1963 by Donald Knuth.

Zobrist hashing is a hash function construction used in computer programs that play abstract board games, such as chess and Go, to implement transposition tables, a special kind of hash table that is indexed by a board position and used to avoid analyzing the same position more than once. Zobrist hashing is named for its inventor, Albert Lindsey Zobrist. It has also been applied as a method for recognizing substitutional alloy configurations in simulations of crystalline materials. Zobrist hashing is the first known instance of the generally useful underlying technique called tabulation hashing.

<span class="mw-page-title-main">One-way compression function</span> Cryptographic primitive

In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is "one-way", meaning that it is difficult given a particular output to compute inputs which compress to that output. One-way compression functions are not related to conventional data compression algorithms, which instead can be inverted exactly or approximately to the original data.

<span class="mw-page-title-main">Cuckoo hashing</span> Data structure hashing scheme

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

A rolling hash is a hash function where the input is hashed in a window that moves through the input.

The leftover hash lemma is a lemma in cryptography first stated by Russell Impagliazzo, Leonid Levin, and Michael Luby.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

Mikkel Thorup is a Danish computer scientist working at University of Copenhagen. He completed his undergraduate education at Technical University of Denmark and his doctoral studies at Oxford University in 1993. From 1993 to 1998, he was at University of Copenhagen and from 1998 to 2013 he was at AT&T Labs-Research in New Jersey. Since 2013 he has been at the University of Copenhagen as a Professor and Head of Center for Efficient Algorithms and Data Structures (EADS).

Static hashing is a form of hashing where lookups are performed on a finalized dictionary set.

References

Secondary sources
Primary sources