Primary clustering

Last updated August 13, 2023

In computer programming, primary clustering is a phenomenon that causes performance degradation in linear-probing hash tables. The phenomenon states that, as elements are added to a linear probing hash table, they have a tendency to cluster together into long runs (i.e., long contiguous regions of the hash table that contain no free slots). If the hash table is at a load factor of $1-1/x$ for some parameter $x\geq 2$ , then the expected length of the run containing a given element $u$ is $\Theta (x^{2})$ . This causes insertions and negative queries to take expected time $\Theta (x^{2})$ in a linear-probing hash table.

Causes of primary clustering

Primary clustering has two causes:

Winner keeps winning: The longer that a run becomes, the more likely it is to accrue additional elements. This causes a positive feedback loop that contributes to the clustering effect. However, this alone would not cause the quadratic blowup.^[1]^[2]
Joining of runs: A single insertion may not only increase the length of the run that it is in by one, but may instead connect together two runs that were already relatively long. This is what causes the quadratic blowup in expected run length.^[1]

Another way to understand primary clustering is by examining the standard deviation on the number of items that hash to a given region within the hash table.^[2] Consider a sub-region of the hash table of size $x^{2}$ . The expected number of items that hash into the region is $(1-1/x)x^{2}=x^{2}-x$ . On the other hand, the standard deviation on the number of such items is $\Theta (x)$ . It follows that, with probability $\Omega (1)$ , the number of items that hash into the region will exceed the size $x^{2}$ of the region. Intuitively, this means that regions of size $\Theta (x^{2})$ will often overflow, while larger regions typically will not. This intuition is often used as the starting point for formal analyses of primary clustering.^[2]^[3]^[4]

Effect on performance

Primary clustering causes performance degradation for both insertions and queries in a linear probing hash table. Insertions must travel to the end of a run, and therefore take expected time $\Theta (x^{2})$ .^[1] Negative queries (i.e., queries that are searching for an element that turns out not to be present) must also travel to the end of a run, and thus also take expected time $\Theta (x^{2})$ .^[1] Positive queries can terminate as soon as they find the element that they are searching for. As a result, the expected time to query a random element in the hash table is $\Theta (x)$ .^[1] However, positive queries to recently-inserted elements (e.g., an element that was just inserted) take expected time $\Theta (x^{2})$ .^[1]

These bounds also hold for linear probing with lazy deletions (i.e., using tombstones for deletions), as long as the hash table is rebuilt (and the tombstones are cleared out) semi-frequently. It suffices to perform such a rebuild at least once every $n/(2x)$ insertions.^[2]

Common misconceptions

Many textbooks describe the winner-keeps-winning effect (in which the longer a run becomes, the more likely it is to accrue additional elements) as the sole cause of primary clustering.^[5]^[6]^[7]^[8]^[9]^[10]^[11] However, as noted by Knuth,^[1] this is not the main cause of primary clustering.

Some textbooks state that the expected time for a positive query is $\Theta (x)$ ,^[11]^[12] typically citing Knuth.^[1] This is true for a query to a random element. Some positive queries may have much larger expected running times, however. For example, if one inserts an element and then immediately queries that element, the query will take the same amount of time as did the insertion, which is $\Theta (x^{2})$ in expectation.

Techniques for avoiding primary clustering

Ordered linear probing^[13] (often referred to as Robin Hood hashing^[14]) is a technique for reducing the effects of primary clustering on queries. Ordered linear probing sorts the elements within each run by their hash. Thus, a query can terminate as soon as it encounters any element whose hash is larger than that of the element being queried. This results in both positive and negative queries taking expected time $O(x)$ .

Graveyard hashing is a variant of ordered linear probing that eliminates the asymptotic effects of primary clustering for all operations.^[2] Graveyard hashing strategically leaves gaps within runs that future insertions can make use of. These gaps, which can be thought of as tombstones (like those created by lazy deletions), are inserted into the table during semi-regular rebuilds. The gaps then speed up the insertions that take place until the next semi-regular rebuild occurs. Every operation in a graveyard hash table takes expected time $O(x)$ .

Many sources recommend the use of quadratic probing as an alternative to linear probing that empirically avoids the effects of primary clustering.

Related Research Articles

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

In algebra, a quadratic equation is any equation that can be rearranged in standard form as

In mathematics, a square root of a number $x$ is a number $y$ such that $; in other words, a number y whose square is x . For example, 4 and -4 are square roots of 16 because .$

In computer science, selection sort is an in-place comparison sorting algorithm. It has an O(n²) time complexity, which makes it inefficient on large lists, and generally performs worse than the similar insertion sort. Selection sort is noted for its simplicity and has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited.

In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite, in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

<span class="mw-page-title-main">Linear probing</span> Computer programming method for hashing

Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene Amdahl, Elaine M. McGraw, and Arthur Samuel and first analyzed in 1963 by Donald Knuth.

Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial until an open slot is found.

In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only $keys need to be remapped on average where is the number of keys and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.$

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is needed, unless keys are kept in buckets. The average-case cost of a successful search is , where $is the number of keys and is the size of the array. The most collisions is with high probability."$

In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure. While more memory-intensive than its hash table counterparts, this technique is useful for situations where fast queries, insertions, and deletions must be made on a large set of elements.

A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning "element is in the set" when in fact the element is not present in the set. There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include "insert" and "optionally delete". The more elements are added to the set, the larger the probability of false positives.

A counting Bloom filter is a generalized data structure of Bloom filter, that is used to test whether a count number of a given element is smaller than a given threshold when a sequence of elements is given. As a generalized form of Bloom filter, false positive matches are possible, but false negatives are not – in other words, a query returns either "possibly bigger or equal than the threshold" or "definitely smaller than the threshold".

In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of pairs that allows the following operations:

References

1 2 3 4 5 6 7 8 Knuth, Donald Ervin (1997). The art of computer programming, volume 3, sorting and searching. Reading, Mass.: Addison-Wesley. pp. 527–528. ISBN 0-201-89683-4. OCLC 36241708.
1 2 3 4 5 Bender, Michael A.; Kuszmaul, Bradley C.; Kuszmaul, William (February 2022). "Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering". 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). IEEE. pp. 1171–1182. doi:10.1109/focs52979.2021.00115. ISBN 978-1-6654-2055-6. S2CID 235731820.
↑ Pagh, Anna; Pagh, Rasmus; Ruzic, Milan (2007-06-11). "Linear probing with constant independence". Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. New York, NY, USA: ACM. pp. 318–327. doi:10.1145/1250790.1250839. ISBN 9781595936318. S2CID 7523004.
↑ Thorup, Mikkel; Zhang, Yin (January 2012). "Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation". SIAM Journal on Computing. 41 (2): 293–331. doi:10.1137/100800774. ISSN 0097-5397.
↑ Cormen, Thomas H. (2022). Introduction to algorithms. Charles Eric Leiserson, Ronald L. Rivest, Clifford Stein (Fourth ed.). Cambridge, Massachusetts. ISBN 978-0-262-04630-5. OCLC 1264174621.
↑ Drozdek, Adam (1995). Data structures in C. PWS Pub. Co. ISBN 0-534-93495-1. OCLC 31077222.
↑ Kruse, Robert L. (1987). Data structures and program design (2nd ed.). Englewood Cliffs, N.J.: Prentice-Hall. ISBN 0-13-195884-4. OCLC 13823328.
↑ McMillan, Michael (2014). Data structures and algorithms with JavaScript. Sebastopol, CA: O'Reilly. ISBN 978-1-4493-6493-9. OCLC 876268837.
↑ Smith, Peter, February 1- (2004). Applied data structures with C++. Sudbury, Mass.: Jones and Bartlett Publishers. ISBN 0-7637-2562-5. OCLC 53138521.
↑ Tremblay, Jean-Paul (1976). An introduction to data structures with applications. P. G. Sorenson. New York: McGraw-Hill. ISBN 0-07-065150-7. OCLC 1858301.
1 2 Handbook of Data Structures and Applications. [S.l.]: CRC PRESS. 2020. ISBN 978-0-367-57200-6. OCLC 1156995269.
↑ Sedgewick, Robert (1998). Algorithms in C (Third ed.). Reading, Massachusetts. ISBN 0-201-31452-5. OCLC 37141168.
↑ Amble, Knuth (1974). "Ordered hash tables". The Computer Journal. 17 (2): 135–142. doi: 10.1093/comjnl/17.2.135 .
↑ Celis, Pedro, Per-Ake Larson, and J. Ian Munro. "Robin hood hashing." 26th Annual Symposium on Foundations of Computer Science (sfcs 1985). IEEE, 1985.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 4 5 6 7 8 Knuth, Donald Ervin (1997). The art of computer programming, volume 3, sorting and searching. Reading, Mass.: Addison-Wesley. pp. 527–528. ISBN 0-201-89683-4. OCLC 36241708.

[:1-2] 1 2 3 4 5 Bender, Michael A.; Kuszmaul, Bradley C.; Kuszmaul, William (February 2022). "Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering". 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). IEEE. pp. 1171–1182. doi:10.1109/focs52979.2021.00115. ISBN 978-1-6654-2055-6. S2CID 235731820.

[3] Pagh, Anna; Pagh, Rasmus; Ruzic, Milan (2007-06-11). "Linear probing with constant independence". Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. New York, NY, USA: ACM. pp. 318–327. doi:10.1145/1250790.1250839. ISBN 9781595936318. S2CID 7523004.

[4] Thorup, Mikkel; Zhang, Yin (January 2012). "Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation". SIAM Journal on Computing. 41 (2): 293–331. doi:10.1137/100800774. ISSN 0097-5397.

[5] Cormen, Thomas H. (2022). Introduction to algorithms. Charles Eric Leiserson, Ronald L. Rivest, Clifford Stein (Fourth ed.). Cambridge, Massachusetts. ISBN 978-0-262-04630-5. OCLC 1264174621.

[6] Drozdek, Adam (1995). Data structures in C. PWS Pub. Co. ISBN 0-534-93495-1. OCLC 31077222.

[7] Kruse, Robert L. (1987). Data structures and program design (2nd ed.). Englewood Cliffs, N.J.: Prentice-Hall. ISBN 0-13-195884-4. OCLC 13823328.

[8] McMillan, Michael (2014). Data structures and algorithms with JavaScript. Sebastopol, CA: O'Reilly. ISBN 978-1-4493-6493-9. OCLC 876268837.

[9] Smith, Peter, February 1- (2004). Applied data structures with C++. Sudbury, Mass.: Jones and Bartlett Publishers. ISBN 0-7637-2562-5. OCLC 53138521.

[10] Tremblay, Jean-Paul (1976). An introduction to data structures with applications. P. G. Sorenson. New York: McGraw-Hill. ISBN 0-07-065150-7. OCLC 1858301.

[:2-11] 1 2 Handbook of Data Structures and Applications. [S.l.]: CRC PRESS. 2020. ISBN 978-0-367-57200-6. OCLC 1156995269.

[12] Sedgewick, Robert (1998). Algorithms in C (Third ed.). Reading, Massachusetts. ISBN 0-201-31452-5. OCLC 37141168.

[13] Amble, Knuth (1974). "Ordered hash tables". The Computer Journal. 17 (2): 135–142. doi: 10.1093/comjnl/17.2.135 .

[14] Celis, Pedro, Per-Ake Larson, and J. Ian Munro. "Robin hood hashing." 26th Annual Symposium on Foundations of Computer Science (sfcs 1985). IEEE, 1985.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]