2-choice hashing

Last updated January 13, 2022

2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is needed, unless keys are kept in buckets. The average-case cost of a successful search is $O(2+(m-1)/n)$ , where $m$ is the number of keys and $n$ is the size of the array. The most collisions is $\log _{2}\ln n+\theta (m/n)$ with high probability."^[1]

How it works

2-choice hashing utilizes two hash functions h₁(x) and h₂(x) which work as hash functions are expected to work (i.e. mapping integers from the universe into a specified range). The two hash functions should be independent and have no correlation to each other. Having two hash functions allows any key x to have up to two potential locations to be stored based on the values of the respective outputs, h₁(x) and h₂(x). It is important to note that, although there are two hash functions, there is only one table; both hash functions map to locations on that table.

Implementation

The most important functions of the hashing implementation in this case are insertion and search.

Insertion: When inserting the values of both hash functions are computed for the to-be-inserted object. The object is then placed in the bucket which contains fewer objects. If the buckets are equal in size, the default location is the h₁(x) value.
Search: Effective searches are done by looking in both buckets (the bucket locations to which h₁(x) and h₂(x) mapped) for the desired value.

Performance

As is true with all hash tables, the performance is based on the largest bucket. Although there are instances where bucket sizes happen to be large based on the values and the hash functions used, this is rare. Having two hash functions and, therefore, two possible locations for any one value, makes the possibility of large buckets even more unlikely to happen.

The expected bucket size while using 2-choice hashing is: $θ (log(log(n)))$ . This improvement is due to the randomized concept known as The Power of Two Choices.

Using two hash functions offers substantial benefits over a single hash function. There is little improvement (and no change to the expected order statistics) if more than two hash functions are used: "Additional hash functions only decrease the maximum by a constant factor."^[2]

Some people recommend a type of 2-choice hashing called two-way skewed-associative cache in some CPU caches.^[3]

2-left hashing —using two hash tables of equal size n/2, and asymmetrically resolving ties by putting the key in the left hash table—has fewer collisions and therefore better performance than 2-choice hashing with one large hash table of size n.^[4]^{[ full citation needed ]}

Related Research Articles

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computing, a hash table is a data structure that implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

A birthday attack is a type of cryptographic attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likelihood of collisions found between random attack attempts and a fixed degree of permutations (pigeonholes). With a birthday attack, it is possible to find a collision of a hash function in $, with being the classical preimage resistance security. There is a general result that quantum computers can perform birthday attacks, thus breaking collision resistance, in .$

In computer science, a perfect hash function $h$ for a set $S$ is a hash function that maps distinct elements in $S$ to a set of $m$ integers, with no collisions. In mathematical terms, it is an injective function.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values. In fact, the node ID provides a direct map to file hashes and that node stores information on where to obtain the file or resource.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table $.$

Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial until an open slot is found.

Coalesced hashing, also called coalesced chaining, is a strategy of collision resolution in a hash table that forms a hybrid of separate chaining and open addressing.

In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only $keys need to be remapped on average where is the number of keys and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.$

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

Extendible hashing is a type of hash system which treats a hash as a bit string and uses a trie for bucket lookup. Because of the hierarchical nature of the system, re-hashing is an incremental operation. This means that time-sensitive applications are less affected by table growth than by standard full-table rehashes.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure. While more memory-intensive than its hash table counterparts, this technique is useful for situations where fast queries, insertions, and deletions must be made on a large set of elements.

In cryptography, SWIFFT is a collection of provably secure hash functions. It is based on the concept of the fast Fourier transform (FFT). SWIFFT is not the first hash function based on FFT, but it sets itself apart by providing a mathematical proof of its security. It also uses the LLL basis reduction algorithm. It can be shown that finding collisions in SWIFFT is at least as difficult as finding short vectors in cyclic/ideal lattices in the worst case. By giving a security reduction to the worst-case scenario of a difficult mathematical problem, SWIFFT gives a much stronger security guarantee than most other cryptographic hash functions.

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.

The balls into bins problem is a classic problem in probability theory that has many applications in computer science. The problem involves m balls and n boxes. Each time, a single ball is placed into one of the bins. After all balls are in the bins, we look at the number of balls in each bin; we call this number the load on the bin and ask: what is the maximum load on a single bin?

References

↑ This article incorporates public domain material from the NIST document: Black, Paul E. "2-choice hashing". Dictionary of Algorithms and Data Structures . 2008. (accessed 2016-07-28).
↑ Paul E. Black, DADS, retrieved 29 January 2015.
↑ "Micro-Architecture".
↑ This article incorporates public domain material from the NIST document: Black, Paul E. "2-left hashing". Dictionary of Algorithms and Data Structures . 19 December 2012. (accessed 2015-09-15).

This article incorporates public domain material from the NIST document: Black, Paul E. "2-choice hashing". Dictionary of Algorithms and Data Structures .