Universal hashing

Last updated April 19, 2024

In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

Introduction

Assume we want to map keys from some universe $U$ into $m$ bins (labelled $[m]=\{0,\dots ,m-1\}$ ). The algorithm will have to handle some data set $S\subseteq U$ of $|S|=n$ keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from $S$ that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if $|U|>m\cdot n$ , since the adversary may choose $S$ to be precisely the preimage of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.

The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions $H=\{h:U\to [m]\}$ is called a universal family if, $\forall x,y\in U,~x\neq y:~~|\{h\in H:h(x)=h(y)\}|\leq {\frac {|H|}{m}}$ .

In other words, any two different keys of the universe collide with probability at most $1/m$ when the hash function $h$ is drawn uniformly at random from $H$ . This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key.

Sometimes, the definition is relaxed by a constant factor, only requiring collision probability $O(1/m)$ rather than $\leq 1/m$ . This concept was introduced by Carter and Wegman^[1] in 1977, and has found numerous applications in computer science (see, for example^[2]).

If we have an upper bound of $\epsilon <1$ on the collision probability, we say that we have $\epsilon$ -almost universality. So for example, a universal family has $1/m$ -almost universality.

Many, but not all, universal families have the following stronger uniform difference property:

\forall x,y\in U,~x\neq y

, when

h

is drawn randomly from the family

H

, the difference

h(x)-h(y)~{\bmod {~}}m

is uniformly distributed in

[m]

.

Note that the definition of universality is only concerned with whether $h(x)-h(y)=0$ , which counts collisions. The uniform difference property is stronger.

(Similarly, a universal family can be XOR universal if $\forall x,y\in U,~x\neq y$ , the value $h(x)\oplus h(y)~{\bmod {~}}m$ is uniformly distributed in $[m]$ where $\oplus$ is the bitwise exclusive or operation. This is only possible if $m$ is a power of two.)

An even stronger condition is pairwise independence: we have this property when $\forall x,y\in U,~x\neq y$ we have the probability that $x,y$ will hash to any pair of hash values $z_{1},z_{2}$ is as if they were perfectly random: $P(h(x)=z_{1}\land h(y)=z_{2})=1/m^{2}$ . Pairwise independence is sometimes called strong universality.

Another property is uniformity. We say that a family is uniform if all hash values are equally likely: $P(h(x)=z)=1/m$ for any hash value $z$ . Universality does not imply uniformity. However, strong universality does imply uniformity.

Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in $[m]$ to the hash functions. (Similarly, if $m$ is a power of two, we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant.) Since a shift by a constant is sometimes irrelevant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise independent is sometimes not made.^[3]

For some applications (such as hash tables), it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if $H$ is a strongly universal family with $m=2^{L}$ , then the family made of the functions $h{\bmod {2^{L'}}}$ for all $h\in H$ is also strongly universal for $L'\leq L$ . Unfortunately, the same is not true of (merely) universal families. For example, the family made of the identity function $h(x)=x$ is clearly universal, but the family made of the function $h(x)=x{\bmod {2^{L'}}}$ fails to be universal.

UMAC and Poly1305-AES and several other message authentication code algorithms are based on universal hashing.^[4]^[5] In such applications, the software chooses a new hash function for every message, based on a unique nonce for that message.

Several hash table implementations are based on universal hashing. In such applications, typically the software chooses a new hash function only after it notices that "too many" keys have collided; until then, the same hash function continues to be used over and over. (Some collision resolution schemes, such as dynamic perfect hashing, pick a new hash function every time there is a collision. Other collision resolution schemes, such as cuckoo hashing and 2-choice hashing, allow a number of collisions before picking a new hash function). A survey of fastest known universal and strongly universal hash functions for integers, vectors, and strings is found in.^[6]

Mathematical guarantees

For any fixed set $S$ of $n$ keys, using a universal family guarantees the following properties.

For any fixed $x$ in $S$ , the expected number of keys in the bin $h(x)$ is $n/m$ . When implementing hash tables by chaining, this number is proportional to the expected running time of an operation involving the key $x$ (for example a query, insertion or deletion).
The expected number of pairs of keys $x,y$ in $S$ with $x\neq y$ that collide ( $h(x)=h(y)$ ) is bounded above by $n(n-1)/2m$ , which is of order $O(n^{2}/m)$ . When the number of bins, $m$ is chosen linear in $n$ (i.e., is determined by a function in $\Omega (n)$ ), the expected number of collisions is $O(n)$ . When hashing into $n^{2}$ bins, there are no collisions at all with probability at least a half.
The expected number of keys in bins with at least $t$ keys in them is bounded above by $2n/(t-2(n/m)+1)$ .^[7] Thus, if the capacity of each bin is capped to three times the average size ( $t=3n/m$ ), the total number of keys in overflowing bins is at most $O(m)$ . This only holds with a hash family whose collision probability is bounded above by $1/m$ . If a weaker definition is used, bounding it by $O(1/m)$ , this result is no longer true.^[7]

As the above guarantees hold for any fixed set $S$ , they hold if the data set is chosen by an adversary. However, the adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing.

The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some $O(n)$ number of collisions. If it observes too many collisions, it chooses another random $h$ from the family and repeats. Universality guarantees that the number of repetitions is a geometric random variable.

Constructions

Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").

Hashing integers

This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be $\{0,\dots ,|U|-1\}$ .

The original proposal of Carter and Wegman^[1] was to pick a prime $p\geq |U|$ and define

h_{a,b}(x)=((ax+b)~{\bmod {~}}p)~{\bmod {~}}m

where $a,b$ are randomly chosen integers modulo $p$ with $a\neq 0$ . (This is a single iteration of a linear congruential generator.)

To see that $H=\{h_{a,b}\}$ is a universal family, note that $h(x)=h(y)$ only holds when

ax+b\equiv ay+b+i\cdot m{\pmod {p}}

for some integer $i$ between $0$ and $(p-1)/m$ . Since $p\geq |U|$ , if $x\neq y$ their difference $x-y$ is nonzero and has an inverse modulo $p$ . Solving for $a$ yields

a\equiv i\cdot m\cdot (x-y)^{-1}{\pmod {p}}

.

There are $p-1$ possible choices for $a$ (since $a=0$ is excluded) and, varying $i$ in the allowed range, $\lfloor (p-1)/m\rfloor$ possible non-zero values for the right hand side. Thus the collision probability is

\lfloor (p-1)/m\rfloor /(p-1)\leq ((p-1)/m)/(p-1)=1/m

.

Another way to see $H$ is a universal family is via the notion of statistical distance. Write the difference $h(x)-h(y)$ as

h(x)-h(y)\equiv (a(x-y)~{\bmod {~}}p){\pmod {m}}

.

Since $x-y$ is nonzero and $a$ is uniformly distributed in $\{1,\dots ,p-1\}$ , it follows that $a(x-y)$ modulo $p$ is also uniformly distributed in $\{1,\dots ,p-1\}$ . The distribution of $(h(x)-h(y))~{\bmod {~}}m$ is thus almost uniform, up to a difference in probability of $\pm 1/p$ between the samples. As a result, the statistical distance to a uniform family is $O(m/p)$ , which becomes negligible when $p\gg m$ .

The family of simpler hash functions

h_{a}(x)=(ax~{\bmod {~}}p)~{\bmod {~}}m

is only approximately universal: $\Pr\{h_{a}(x)=h_{a}(y)\}\leq 2/m$ for all $x\neq y$ .^[1] Moreover, this analysis is nearly tight; Carter and Wegman ^[1] show that $\Pr\{h_{a}(1)=h_{a}(m+1)\}\geq 2/(m+1)$ whenever $(p-1)~{\bmod {~}}m=1$ .

Avoiding modular arithmetic

The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997.^[8] By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four^[9]). The scheme assumes the number of bins is a power of two, $m=2^{M}$ . Let $w$ be the number of bits in a machine word. Then the hash functions are parametrised over odd positive integers $a<2^{w}$ (that fit in a word of $w$ bits). To evaluate $h_{a}(x)$ , multiply $x$ by $a$ modulo $2^{w}$ and then keep the high order $M$ bits as the hash code. In mathematical notation, this is

h_{a}(x)=(a\cdot x\,\,{\bmod {\,}}2^{w})\,\,\mathrm {div} \,\,2^{w-M}.

This scheme does not satisfy the uniform difference property and is only $2/m$ -almost-universal; for any $x\neq y$ , $\Pr\{h_{a}(x)=h_{a}(y)\}\leq 2/m$ .

To understand the behavior of the hash function, notice that, if $ax{\bmod {2}}^{w}$ and $ay{\bmod {2}}^{w}$ have the same highest-order 'M' bits, then $a(x-y){\bmod {2}}^{w}$ has either all 1's or all 0's as its highest order M bits (depending on whether $ax{\bmod {2}}^{w}$ or $ay{\bmod {2}}^{w}$ is larger). Assume that the least significant set bit of $x-y$ appears on position $w-c$ . Since $a$ is a random odd integer and odd integers have inverses in the ring $Z_{2^{w}}$ , it follows that $a(x-y){\bmod {2}}^{w}$ will be uniformly distributed among $w$ -bit integers with the least significant set bit on position $w-c$ . The probability that these bits are all 0's or all 1's is therefore at most $2/2^{M}=2/m$ . On the other hand, if $c<M$ , then higher-order M bits of $a(x-y){\bmod {2}}^{w}$ contain both 0's and 1's, so it is certain that $h(x)\neq h(y)$ . Finally, if $c=M$ then bit $w-M$ of $a(x-y){\bmod {2}}^{w}$ is 1 and $h_{a}(x)=h_{a}(y)$ if and only if bits $w-1,\ldots ,w-M+1$ are also 1, which happens with probability $1/2^{M-1}=2/m$ .

This analysis is tight, as can be shown with the example $x=2^{w-M-2}$ and $y=3x$ . To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme that picks higher-order bits

h_{a,b}(x)=((ax+b){\bmod {2}}^{w+M})\,\mathrm {div} \,2^{w},

where $a$ is a random positive integer with $a<2^{2w}$ and $b$ is a random non-negative integer with $b<2^{2w}$ . This requires doing arithmetic on $2w$ -bit unsigned integers. This version of multiply-shift is due to Dietzfelbinger, and was later analyzed more precisely by Woelfel.^[10]

Hashing vectors

This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector ${\bar {x}}=(x_{0},\dots ,x_{k-1})$ of $k$ machine words (integers of $w$ bits each). If $H$ is a universal family with the uniform difference property, the following family (dating back to Carter and Wegman^[1]) also has the uniform difference property (and hence is universal):

h({\bar {x}})=\left(\sum _{i=0}^{k-1}h_{i}(x_{i})\right)\,{\bmod {~}}m

, where each

h_{i}\in H

is chosen independently at random.

If $m$ is a power of two, one may replace summation by exclusive or.^[11]

In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of hash functions.^[12] Initialize the hash function with a vector ${\bar {a}}=(a_{0},\dots ,a_{k-1})$ of random odd integers on $2w$ bits each. Then if the number of bins is $m=2^{M}$ for $M\leq w$ :

h_{\bar {a}}({\bar {x}})=\left({\big (}\sum _{i=0}^{k-1}x_{i}\cdot a_{i}{\big )}~{\bmod {~}}2^{2w}\right)\,\,\mathrm {div} \,\,2^{2w-M}

.

It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.^[11] Initialize the hash function with a vector ${\bar {a}}=(a_{0},\dots ,a_{k-1})$ of random odd integers on $2w$ bits each. The following hash family is universal:^[13]

h_{\bar {a}}({\bar {x}})=\left({\Big (}\sum _{i=0}^{\lceil k/2\rceil }(x_{2i}+a_{2i})\cdot (x_{2i+1}+a_{2i+1}){\Big )}{\bmod {~}}2^{2w}\right)\,\,\mathrm {div} \,\,2^{2w-M}

.

If double-precision operations are not available, one can interpret the input as a vector of half-words ( $w/2$ -bit integers). The algorithm will then use $\lceil k/2\rceil$ multiplications, where $k$ was the number of half-words in the vector. Thus, the algorithm runs at a "rate" of one multiplication per word of input.

The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based universal hashing schemes.^[14]

Strong universality at high speed is also possible.^[15] Initialize the hash function with a vector ${\bar {a}}=(a_{0},\dots ,a_{k})$ of random integers on $2w$ bits. Compute

h_{\bar {a}}({\bar {x}})^{\mathrm {strong} }=(a_{0}+\sum _{i=0}^{k-1}a_{i+1}x_{i}{\bmod {~}}2^{2w})\,\,\mathrm {div} \,\,2^{w}

.

The result is strongly universal on $w$ bits. Experimentally, it was found to run at 0.2 CPU cycle per byte on recent Intel processors for $w=32$ .

Hashing strings

This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate $h(s)$ is just the length of $s$ . As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function without affecting universality.^[11] Note that if zeroes are allowed in the string, then it might be best to append a fictitious non-zero (e.g., 1) character to all strings prior to padding: this will ensure that universality is not affected.^[15]

Now assume we want to hash ${\bar {x}}=(x_{0},\dots ,x_{\ell })$ , where a good bound on $\ell$ is not known a priori. A universal family proposed by ^[12] treats the string $x$ as the coefficients of a polynomial modulo a large prime. If $x_{i}\in [u]$ , let $p\geq \max\{u,m\}$ be a prime and define:

h_{a}({\bar {x}})=h_{\mathrm {int} }\left({\big (}\sum _{i=0}^{\ell }x_{i}\cdot a^{\ell -i}{\big )}{\bmod {~}}p\right)

, where

a\in [p]

is uniformly random and

h_{\mathrm {int} }

is chosen randomly from a universal family mapping integer domain

[p]\mapsto [m]

.

Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows:^[16]

uinthash(Stringx,inta,intp)uinth=INITIAL_VALUEfor(uinti=0;i<x.length;++i)h=((h*a)+x[i])modpreturnh

This Rabin-Karp rolling hash is based on a linear congruential generator.^[17] Above algorithm is also known as Multiplicative hash function.^[18] In practice, the mod operator and the parameter p can be avoided altogether by simply allowing integer to overflow because it is equivalent to mod (Max-Int-Value + 1) in many programming languages. Below table shows values chosen to initialize h and a for some of the popular implementations.

Implementation	INITIAL_VALUE	a
Bernstein's hash function djb2^[19]	5381	33
STLPort 4.6.2	0	5
Kernighan and Ritchie's hash function^[20]	0	31
`java.lang.String.hashCode()`^[21]	0	31

Consider two strings ${\bar {x}},{\bar {y}}$ and let $\ell$ be length of the longer one; for the analysis, the shorter string is conceptually padded with zeros up to length $\ell$ . A collision before applying $h_{\mathrm {int} }$ implies that $a$ is a root of the polynomial with coefficients ${\bar {x}}-{\bar {y}}$ . This polynomial has at most $\ell$ roots modulo $p$ , so the collision probability is at most $\ell /p$ . The probability of collision through the random $h_{\mathrm {int} }$ brings the total collision probability to ${\frac {1}{m}}+{\frac {\ell }{p}}$ . Thus, if the prime $p$ is sufficiently large compared to the length of strings hashed, the family is very close to universal (in statistical distance).

Other universal families of hash functions used to hash unknown-length strings to fixed-length hash values include the Rabin fingerprint and the Buzhash.

Avoiding modular arithmetic

To mitigate the computational penalty of modular arithmetic, three tricks are used in practice:^[11]

One chooses the prime $p$ to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulo $p$ to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work with $p=2^{61}-1$ , while $x_{i}$ 's are 32-bit values.
One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the $\lceil k/16\rceil$ results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing.
One chooses a power-of-two as the divisor, allowing arithmetic modulo $2^{w}$ to be implemented without division (using faster operations of bit masking). The NH hash-function family takes this approach.

Related Research Articles

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computing, a hash table, also known as a hash map or a hash set, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

The Digital Signature Algorithm (DSA) is a public-key cryptosystem and Federal Information Processing Standard for digital signatures, based on the mathematical concept of modular exponentiation and the discrete logarithm problem. DSA is a variant of the Schnorr and ElGamal signature schemes.

In computer science, a one-way function is a function that is easy to compute on every input, but hard to invert given the image of a random input. Here, "easy" and "hard" are to be understood in the sense of computational complexity theory, specifically the theory of polynomial time problems. Not being one-to-one is not considered sufficient for a function to be called one-way.

In cryptography, the Elliptic Curve Digital Signature Algorithm (ECDSA) offers a variant of the Digital Signature Algorithm (DSA) which uses elliptic-curve cryptography.

The Rabin cryptosystem is a family of public-key encryption schemes based on a trapdoor function whose security, like that of RSA, is related to the difficulty of integer factorization.

KCDSA is a digital signature algorithm created by a team led by the Korea Internet & Security Agency (KISA). It is an ElGamal variant, similar to the Digital Signature Algorithm and GOST R 34.10-94. The standard algorithm is implemented over $, but an elliptic curve variant (EC-KCDSA) is also specified.$

Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table $.$

In cryptography, a message authentication code based on universal hashing, or UMAC, is a type of message authentication code (MAC) calculated choosing a hash function from a class of hash functions according to some secret (random) process and applying it to the message. The resulting digest or fingerprint is then encrypted to hide the identity of the hash function used. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message. In contrast to traditional MACs, which are serializable, UMAC can be executed in parallel. Thus as machines continue to offer more parallel processing capabilities, the speed of implementing UMAC will increase.

The ElGamal signature scheme is a digital signature scheme which is based on the difficulty of computing discrete logarithms. It was described by Taher Elgamal in 1985.

In cryptography, the Rabin signature algorithm is a method of digital signature originally proposed by Michael O. Rabin in 1978.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In cryptography, Very Smooth Hash (VSH) is a provably secure cryptographic hash function invented in 2005 by Scott Contini, Arjen Lenstra and Ron Steinfeld. Provably secure means that finding collisions is as difficult as some known hard mathematical problem. Unlike other provably secure collision-resistant hashes, VSH is efficient and usable in practice. Asymptotically, it only requires a single multiplication per log(n) message-bits and uses RSA-type arithmetic. Therefore, VSH can be useful in embedded environments where code space is limited.

In cryptography, cryptographic hash functions can be divided into two main categories. In the first category are those functions whose designs are based on mathematical problems, and whose security thus follows from rigorous mathematical proofs, complexity theory and formal reduction. These functions are called Provably Secure Cryptographic Hash Functions. To construct these is very difficult, and few examples have been introduced. Their practical use is limited.

In cryptography, SWIFFT is a collection of provably secure hash functions. It is based on the concept of the fast Fourier transform (FFT). SWIFFT is not the first hash function based on FFT, but it sets itself apart by providing a mathematical proof of its security. It also uses the LLL basis reduction algorithm. It can be shown that finding collisions in SWIFFT is at least as difficult as finding short vectors in cyclic/ideal lattices in the worst case. By giving a security reduction to the worst-case scenario of a difficult mathematical problem, SWIFFT gives a much stronger security guarantee than most other cryptographic hash functions.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

Badger is a Message Authentication Code (MAC) based on the idea of universal hashing and was developed by Boesgaard, Scavenius, Pedersen, Christensen, and Zenner. It is constructed by strengthening the ∆-universal hash family MMH using an ϵ-almost strongly universal (ASU) hash function family after the application of ENH, where the value of ϵ is $. Since Badger is a MAC function based on the universal hash function approach, the conditions needed for the security of Badger are the same as those for other universal hash functions such as UMAC.$

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

Network coding has been shown to optimally use bandwidth in a network, maximizing information flow but the scheme is very inherently vulnerable to pollution attacks by malicious nodes in the network. A node injecting garbage can quickly affect many receivers. The pollution of network packets spreads quickly since the output of honest node is corrupted if at least one of the incoming packets is corrupted.

References

1 2 3 4 5 Carter, Larry; Wegman, Mark N. (1979). "Universal Classes of Hash Functions". Journal of Computer and System Sciences. 18 (2): 143–154. doi: 10.1016/0022-0000(79)90044-8 . Conference version in STOC'77.
↑ Miltersen, Peter Bro. "Universal Hashing" (PDF). Archived from the original (PDF) on 24 May 2011. Retrieved 24 June 2009.
↑ Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p. 221. ISBN 0-521-47465-5.
↑ David Wagner, ed. "Advances in Cryptology - CRYPTO 2008". p. 145.
↑ Jean-Philippe Aumasson, Willi Meier, Raphael Phan, Luca Henzen. "The Hash Function BLAKE". 2014. p. 10.
↑ Thorup, Mikkel (2015). "High Speed Hashing for Integers and Strings". arXiv: 1504.06804 [cs.DS].
1 2 Baran, Ilya; Demaine, Erik D.; Pătraşcu, Mihai (2008). "Subquadratic Algorithms for 3SUM" (PDF). Algorithmica. 50 (4): 584–596. doi:10.1007/s00453-007-9036-3. S2CID 9855995.
↑ Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, Jyrki; Penttonen, Martti (1997). "A Reliable Randomized Algorithm for the Closest-Pair Problem" (Postscript). Journal of Algorithms. 25 (1): 19–51. doi:10.1006/jagm.1997.0873 . Retrieved 10 February 2011.
↑ Thorup, Mikkel (18 December 2009). "Text-book algorithms at SODA".
↑ Woelfel, Philipp (1999). Efficient Strongly Universal and Optimally Universal Hashing. Mathematical Foundations of Computer Science 1999. LNCS. Vol. 1672. pp. 262–272. doi:10.1007/3-540-48340-3_24.
1 2 3 4 Thorup, Mikkel (2009). String hashing for linear probing. Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA). pp. 655–664. CiteSeerX 10.1.1.215.4253 . doi:10.1137/1.9781611973068.72. ISBN 978-0-89871-680-1., section 5.3
1 2 Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pippenger, Nicholas (1992). Polynomial Hash Functions Are Reliable (Extended Abstract). Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP). pp. 235–246.
↑ Black, J.; Halevi, S.; Krawczyk, H.; Krovetz, T. (1999). UMAC: Fast and Secure Message Authentication (PDF). Advances in Cryptology (CRYPTO '99)., Equation 1
↑ Pătraşcu, Mihai; Thorup, Mikkel (2011). The power of simple tabulation hashing. Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11). pp. 1–10. arXiv: 1011.5200 . doi:10.1145/1993636.1993638. ISBN 9781450306911.
1 2 Kaser, Owen; Lemire, Daniel (2013). "Strongly universal string hashing is fast". Computer Journal. 57 (11). Oxford University Press: 1624–1638. arXiv: 1202.4961 . doi:10.1093/comjnl/bxt070.
↑ "Hebrew University Course Slides" (PDF).
↑ Robert Uzgalis. "Library Hash Functions". 1996.
↑ Kankowsk, Peter. "Hash functions: An empirical comparison".
↑ Yigit, Ozan. "String hash functions".
↑ Kernighan; Ritchie (1988). "6" . The C Programming Language (2nd ed.). Prentice Hall. pp. 118. ISBN 0-13-110362-8.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ "String (Java Platform SE 6)". docs.oracle.com. Retrieved 2015-06-10.

External links

Open Data Structures - Section 5.1.1 - Multiplicative Hashing, Pat Morin

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[CW77-1] 1 2 3 4 5 Carter, Larry; Wegman, Mark N. (1979). "Universal Classes of Hash Functions". Journal of Computer and System Sciences. 18 (2): 143–154. doi: 10.1016/0022-0000(79)90044-8 . Conference version in STOC'77.

[Miltersen-2] Miltersen, Peter Bro. "Universal Hashing" (PDF). Archived from the original (PDF) on 24 May 2011. Retrieved 24 June 2009.

[3] Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p. 221. ISBN 0-521-47465-5.

[4] David Wagner, ed. "Advances in Cryptology - CRYPTO 2008". p. 145.

[5] Jean-Philippe Aumasson, Willi Meier, Raphael Phan, Luca Henzen. "The Hash Function BLAKE". 2014. p. 10.

[6] Thorup, Mikkel (2015). "High Speed Hashing for Integers and Strings". arXiv: 1504.06804 [cs.DS].

[BDP-7] 1 2 Baran, Ilya; Demaine, Erik D.; Pătraşcu, Mihai (2008). "Subquadratic Algorithms for 3SUM" (PDF). Algorithmica. 50 (4): 584–596. doi:10.1007/s00453-007-9036-3. S2CID 9855995.

[DHKP97-8] Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, Jyrki; Penttonen, Martti (1997). "A Reliable Randomized Algorithm for the Closest-Pair Problem" (Postscript). Journal of Algorithms. 25 (1): 19–51. doi:10.1006/jagm.1997.0873 . Retrieved 10 February 2011.

[9] Thorup, Mikkel (18 December 2009). "Text-book algorithms at SODA".

[w99-10] Woelfel, Philipp (1999). Efficient Strongly Universal and Optimally Universal Hashing. Mathematical Foundations of Computer Science 1999. LNCS. Vol. 1672. pp. 262–272. doi:10.1007/3-540-48340-3_24.

[thorup09-11] 1 2 3 4 Thorup, Mikkel (2009). String hashing for linear probing. Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA). pp. 655–664. CiteSeerX 10.1.1.215.4253 . doi:10.1137/1.9781611973068.72. ISBN 978-0-89871-680-1., section 5.3

[DGMP-12] 1 2 Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pippenger, Nicholas (1992). Polynomial Hash Functions Are Reliable (Extended Abstract). Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP). pp. 235–246.

[black-13] Black, J.; Halevi, S.; Krawczyk, H.; Krovetz, T. (1999). UMAC: Fast and Secure Message Authentication (PDF). Advances in Cryptology (CRYPTO '99)., Equation 1

[14] Pătraşcu, Mihai; Thorup, Mikkel (2011). The power of simple tabulation hashing. Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11). pp. 1–10. arXiv: 1011.5200 . doi:10.1145/1993636.1993638. ISBN 9781450306911.

[kaser2013-15] 1 2 Kaser, Owen; Lemire, Daniel (2013). "Strongly universal string hashing is fast". Computer Journal. 57 (11). Oxford University Press: 1624–1638. arXiv: 1202.4961 . doi:10.1093/comjnl/bxt070.

[16] "Hebrew University Course Slides" (PDF).

[17] Robert Uzgalis. "Library Hash Functions". 1996.

[18] Kankowsk, Peter. "Hash functions: An empirical comparison".

[19] Yigit, Ozan. "String hash functions".

[20] Kernighan; Ritchie (1988). "6" . The C Programming Language (2nd ed.). Prentice Hall. pp. 118. ISBN 0-13-110362-8.{{cite book}}: CS1 maint: multiple names: authors list (link)

[21] "String (Java Platform SE 6)". docs.oracle.com. Retrieved 2015-06-10.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

Universal hashing

Contents

Introduction

Mathematical guarantees

Constructions

Hashing integers

Avoiding modular arithmetic

Hashing vectors

Hashing strings

Avoiding modular arithmetic

See also

Related Research Articles

References

Further reading

External links