A birthday attack is a bruteforce collision attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likelihood of collisions found between random attack attempts and a fixed degree of permutations (pigeonholes). Let be the number of possible values of a hash function, with . With a birthday attack, it is possible to find a collision of a hash function with chance in where is the bit length of the hash output, [1] [2] and with being the classical preimage resistance security with the same probability. [2] There is a general (though disputed [3] ) result that quantum computers can perform birthday attacks, thus breaking collision resistance, in . [4]
Although there are some digital signature vulnerabilities associated with the birthday attack, it cannot be used to break an encryption scheme any faster than a brute-force attack. [5] : 36
As an example, consider the scenario in which a teacher with a class of 30 students (n = 30) asks for everybody's birthday (for simplicity, ignore leap years) to determine whether any two students have the same birthday (corresponding to a hash collision as described further). Intuitively, this chance may seem small. Counter-intuitively, the probability that at least one student has the same birthday as any other student on any day is around 70% (for n = 30), from the formula . [6]
If the teacher had picked a specific day (say, 16 September), then the chance that at least one student was born on that specific day is , about 7.9%.
In a birthday attack, the attacker prepares many different variants of benign and malicious contracts, each having a digital signature. A pair of benign and malicious contracts with the same signature is sought. In this fictional example, suppose that the digital signature of a string is the first byte of its SHA-256 hash. The pair found is indicated in green – note that finding a pair of benign contracts (blue) or a pair of malicious contracts (red) is useless. After the victim accepts the benign contract, the attacker substitutes it with the malicious one and claims the victim signed it, as proven by the digital signature.
In the context of the birthday attack, the key variables are related to the well-known balls and bins problem in probability theory as follows.
The variable n represents the number of inputs (or attempts) being made. In the analogy of the balls and bins problem, n refers to the number of balls that are randomly thrown into H bins. Each input corresponds to throwing a ball into one of the bins (hash values).
The variable H represents the total number of possible outputs of the hash function. This is the number of unique "bins" that the balls can land in. The total number of hash outputs is often expressed as , where l is the bit length of the hash output. In the balls and bins analogy, H represents the number of bins, each corresponding to a unique hash value.
The variable l refers to the bit length of the hash function’s output. Since a hash function of bit length l can produce unique outputs, the number of possible hash values (or bins) is .
The variable p represents the probability that a collision will occur—that is, the probability that two or more inputs (balls) will be assigned the same output (bin). In a birthday attack, p is often set to 0.5 (50%) to estimate how many inputs are needed to have a 50% chance of a collision.
The birthday attack can be modeled as a variation of the balls and bins problem. In this problem:
Given a function , the goal of the attack is to find two different inputs such that . Such a pair is called a collision. The method used to find a collision is simply to evaluate the function for different input values that may be chosen randomly or pseudorandomly until the same result is found more than once. Because of the birthday problem, this method can be rather efficient. Specifically, if a function yields any of different outputs with equal probability and is sufficiently large, then we expect to obtain a pair of different arguments and with after evaluating the function for about different arguments on average.
We consider the following experiment. From a set of H values we choose n values uniformly at random thereby allowing repetitions. Let p(n; H) be the probability that during this experiment at least one value is chosen more than once. This probability can be approximated as
where is the number of chosen values (inputs) and is the number of possible outcomes (possible hash outputs).
Let n(p; H) be the smallest number of values we have to choose, such that the probability for finding a collision is at least p. By inverting this expression above, we find the following approximation
and assigning a 0.5 probability of collision we arrive at
Let Q(H) be the expected number of values we have to choose before finding the first collision. This number can be approximated by
As an example, if a 64-bit hash is used, there are approximately 1.8×1019 different outputs. If these are all equally probable (the best case), then it would take 'only' approximately 5 billion attempts (5.38×109) to generate a collision using brute force. [8] This value is called birthday bound [9] and for l-bit codes, it could be approximated as 2l/2 [10] Other examples are as follows:
Bits | Possible outputs (H) | Desired probability of random collision (2 s.f.) (p) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
10−18 | 10−15 | 10−12 | 10−9 | 10−6 | 0.1% | 1% | 25% | 50% | 75% | ||
16 | 216 (~6.5 x 104) | <2 | <2 | <2 | <2 | <2 | 11 | 36 | 190 | 300 | 430 |
32 | 232 (~4.3×109) | <2 | <2 | <2 | 3 | 93 | 2900 | 9300 | 50,000 | 77,000 | 110,000 |
64 | 264 (~1.8×1019) | 6 | 190 | 6100 | 190,000 | 6,100,000 | 1.9×108 | 6.1×108 | 3.3×109 | 5.1×109 | 7.2×109 |
96 | 296 (~7.9×1028) | 4.0×105 | 1.3×107 | 4.0×108 | 1.3×1010 | 4.0×1011 | 1.3×1013 | 4.0×1013 | 2.1×1014 | 3.3×1014 | 4.7×1014 |
128 | 2128 (~3.4×1038) | 2.6×1010 | 8.2×1011 | 2.6×1013 | 8.2×1014 | 2.6×1016 | 8.3×1017 | 2.6×1018 | 1.4×1019 | 2.2×1019 | 3.1×1019 |
192 | 2192 (~6.3×1057) | 1.1×1020 | 3.7×1021 | 1.1×1023 | 3.5×1024 | 1.1×1026 | 3.5×1027 | 1.1×1028 | 6.0×1028 | 9.3×1028 | 1.3×1029 |
256 | 2256 (~1.2×1077) | 4.8×1029 | 1.5×1031 | 4.8×1032 | 1.5×1034 | 4.8×1035 | 1.5×1037 | 4.8×1037 | 2.6×1038 | 4.0×1038 | 5.7×1038 |
384 | 2384 (~3.9×10115) | 8.9×1048 | 2.8×1050 | 8.9×1051 | 2.8×1053 | 8.9×1054 | 2.8×1056 | 8.9×1056 | 4.8×1057 | 7.4×1057 | 1.0×1058 |
512 | 2512 (~1.3×10154) | 1.6×1068 | 5.2×1069 | 1.6×1071 | 5.2×1072 | 1.6×1074 | 5.2×1075 | 1.6×1076 | 8.8×1076 | 1.4×1077 | 1.9×1077 |
It is easy to see that if the outputs of the function are distributed unevenly, then a collision could be found even faster. The notion of 'balance' of a hash function quantifies the resistance of the function to birthday attacks (exploiting uneven key distribution.) However, determining the balance of a hash function will typically require all possible inputs to be calculated and thus is infeasible for popular hash functions such as the MD and SHA families. [12] The subexpression in the equation for is not computed accurately for small when directly translated into common programming languages as log(1/(1-p))
due to loss of significance. When log1p
is available (as it is in C99) for example, the equivalent expression -log1p(-p)
should be used instead. [13] If this is not done, the first column of the above table is computed as zero, and several items in the second column do not have even one correct significant digit.
A good rule of thumb which can be used for mental calculation is the relation
which can also be written as
or
This works well for probabilities less than or equal to 0.5.
This approximation scheme is especially easy to use when working with exponents. For instance, suppose you are building 32-bit hashes () and want the chance of a collision to be at most one in a million (), how many documents could we have at the most?
which is close to the correct answer of 93.
The birthday attack is a method that exploits the mathematics of collisions in hash functions. Below, we provide upper and lower bounds for the probability of a collision, based on the analogy of the balls and bins problem, and derive key equations.
The birthday attack can be modeled as throwing n balls (inputs) into H bins (possible hash outputs). The probability of a collision is bounded by the following equation:
This equation follows from the union bound, which gives an upper bound on the probability that at least one collision occurs. We denote the event that the i-th ball collides with one of the previous balls as . The probability of a collision for the i-th ball is:
Thus, the total probability of a collision after throwing all n balls is bounded by:
This gives the upper bound for the probability of a collision in a hash function.
The lower bound for the probability of a collision can be derived by assuming no collision after throwing in i balls, which must all occupy different bins. The probability of no collision after throwing the (i+1)-st ball is:
The total probability of no collision after throwing all n balls is the product of these terms:
By using the inequality , we can approximate this as:
Thus, the probability of at least one collision is bounded below by:
This provides the lower bound for the probability of a collision.
It follows from the above argument that the probability of at least one collision is bounded between:
Letting , an almost sure collision occurs when the number of trials, n, is given by:
This illustrates how the number of inputs required for a collision grows as a function of the bit length of the hash output.
Digital signatures can be susceptible to a birthday attack or more precisely a chosen-prefix collision attack. A message is typically signed by first computing , where is a cryptographic hash function, and then using some secret key to sign . Suppose Mallory wants to trick Bob into signing a fraudulent contract. Mallory prepares a fair contract and a fraudulent one . She then finds a number of positions where can be changed without changing the meaning, such as inserting commas, empty lines, one versus two spaces after a sentence, replacing synonyms, etc. By combining these changes, she can create a huge number of variations on which are all fair contracts.
In a similar manner, Mallory also creates a huge number of variations on the fraudulent contract . She then applies the hash function to all these variations until she finds a version of the fair contract and a version of the fraudulent contract which have the same hash value, . She presents the fair version to Bob for signing. After Bob has signed, Mallory takes the signature and attaches it to the fraudulent contract. This signature then "proves" that Bob signed the fraudulent contract.
The probabilities differ slightly from the original birthday problem, as Mallory gains nothing by finding two fair or two fraudulent contracts with the same hash. Mallory's strategy is to generate pairs of one fair and one fraudulent contract. For a given hash function is the number of possible hashes, where is the bit length of the hash output. The birthday problem equations do not exactly apply here. For a 50% chance of a collision, Mallory would need to generate approximately hashes, which is twice the number required for a simple collision under the classical birthday problem.
To avoid this attack, the output length of the hash function used for a signature scheme can be chosen large enough so that the birthday attack becomes computationally infeasible, i.e. about twice as many bits as are needed to prevent an ordinary brute-force attack.
Besides using a larger bit length, the signer (Bob) can protect himself by making some random, inoffensive changes to the document before signing it, and by keeping a copy of the contract he signed in his own possession, so that he can at least demonstrate in court that his signature matches that contract, not just the fraudulent one.
Pollard's rho algorithm for logarithms is an example for an algorithm using a birthday attack for the computation of discrete logarithms.
The same fraud is possible if the signer is Mallory, not Bob. Bob could suggest a contract to Mallory for a signature. Mallory could find both an inoffensively-modified version of this fair contract that has the same signature as a fraudulent contract, and Mallory could provide the modified fair contract and signature to Bob. Later, Mallory could produce the fraudulent copy. If Bob doesn't have the inoffensively-modified version contract (perhaps only finding their original proposal), Mallory's fraud is perfect. If Bob does have it, Mallory can at least claim that it is Bob who is the fraudster.
Here are additional details:
1. Original Contract Proposal: Bob proposes a fair contract to Mallory, expecting her to sign it.
2. Mallory’s Modified and Fraudulent Contracts: Instead of signing Bob's contract directly, Mallory creates two versions of the contract
3. Mallory Provides the Modified Contract: Mallory signs the modified version and gives it to Bob. The signature on this version is the same as what would appear on the fraudulent contract.
4. Bob’s Risk: If Bob does not keep a copy of the modified version Mallory signed, but only retains the original proposal, he will not have proof of what Mallory agreed to. Later, Mallory can present the fraudulent contract (which carries the same signature) and claim it was the one that was signed.
5. Outcomes:
{{cite journal}}
: Cite journal requires |journal=
(help)A histogram is a visual representation of the distribution of quantitative data. To construct a histogram, the first step is to "bin" the range of values— divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) are adjacent and are typically of equal size.
In quantum computing, Grover's algorithm, also known as the quantum search algorithm, is a quantum algorithm for unstructured search that finds with high probability the unique input to a black box function that produces a particular output value, using just evaluations of the function, where is the size of the function's domain. It was devised by Lov Grover in 1996.
In probability theory, the birthday problem asks for the probability that, in a set of n randomly chosen people, at least two will share the same birthday. The birthday paradox refers to the counterintuitive fact that only 23 people are needed for that probability to exceed 50%.
In mathematics, the error function, often denoted by erf, is a function defined as:
In computer science, a one-way function is a function that is easy to compute on every input, but hard to invert given the image of a random input. Here, "easy" and "hard" are to be understood in the sense of computational complexity theory, specifically the theory of polynomial time problems. This has nothing to do with whether the function is one-to-one; finding any one input with the desired image is considered a successful inversion.
A commitment scheme is a cryptographic primitive that allows one to commit to a chosen value while keeping it hidden to others, with the ability to reveal the committed value later. Commitment schemes are designed so that a party cannot change the value or statement after they have committed to it: that is, commitment schemes are binding. Commitment schemes have important applications in a number of cryptographic protocols including secure coin flipping, zero-knowledge proofs, and secure computation.
In cryptography, a preimage attack on cryptographic hash functions tries to find a message that has a specific hash value. A cryptographic hash function should resist attacks on its preimage.
In probability theory and statistics, the hyperbolic secant distribution is a continuous probability distribution whose probability density function and characteristic function are proportional to the hyperbolic secant function. The hyperbolic secant function is equivalent to the reciprocal hyperbolic cosine, and thus this distribution is also called the inverse-cosh distribution.
In cryptography, collision resistance is a property of cryptographic hash functions: a hash function H is collision-resistant if it is hard to find two inputs that hash to the same output; that is, two inputs a and b where a ≠ b but H(a) = H(b). The pigeonhole principle means that any hash function with more inputs than outputs will necessarily have such collisions; the harder they are to find, the more cryptographically secure the hash function is.
In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is "one-way", meaning that it is difficult given a particular output to compute inputs which compress to that output. One-way compression functions are not related to conventional data compression algorithms, which instead can be inverted exactly or approximately to the original data.
In cryptography a universal one-way hash function is a type of universal hash function of particular importance to cryptography. UOWHFs are proposed as an alternative to collision-resistant hash functions (CRHFs). CRHFs have a strong collision-resistance property: that it is hard, given randomly chosen hash function parameters, to find any collision of the hash function. In contrast, UOWHFs require that it be hard to find a collision where one preimage is chosen independently of the hash function parameters. The primitive was suggested by Moni Naor and Moti Yung and is also known as "target collision resistant" hash functions; it was employed to construct general digital signature schemes without trapdoor functions, and also within chosen-ciphertext secure public key encryption schemes.
In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.
The Gamow factor, Sommerfeld factor or Gamow–Sommerfeld factor, named after its discoverer George Gamow or after Arnold Sommerfeld, is a probability factor for two nuclear particles' chance of overcoming the Coulomb barrier in order to undergo nuclear reactions, for example in nuclear fusion. By classical physics, there is almost no possibility for protons to fuse by crossing each other's Coulomb barrier at temperatures commonly observed to cause fusion, such as those found in the Sun. When George Gamow instead applied quantum mechanics to the problem, he found that there was a significant chance for the fusion due to tunneling.
In cryptography, Very Smooth Hash (VSH) is a provably secure cryptographic hash function invented in 2005 by Scott Contini, Arjen Lenstra, and Ron Steinfeld. Provably secure means that finding collisions is as difficult as some known hard mathematical problem. Unlike other provably secure collision-resistant hashes, VSH is efficient and usable in practice. Asymptotically, it only requires a single multiplication per log(n) message-bits and uses RSA-type arithmetic. Therefore, VSH can be useful in embedded environments where code space is limited.
In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.
In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.
Badger is a message authentication code (MAC) based on the idea of universal hashing and was developed by Boesgaard, Scavenius, Pedersen, Christensen, and Zenner. It is constructed by strengthening the ∆-universal hash family MMH using an ϵ-almost strongly universal (ASU) hash function family after the application of ENH, where the value of ϵ is . Since Badger is a MAC function based on the universal hash function approach, the conditions needed for the security of Badger are the same as those for other universal hash functions such as UMAC.
Fuzzy extractors are a method that allows biometric data to be used as inputs to standard cryptographic techniques, to enhance computer security. "Fuzzy", in this context, refers to the fact that the fixed values required for cryptography will be extracted from values close to but not identical to the original key, without compromising the security required. One application is to encrypt and authenticate users records, using the biometric inputs of the user as a key.
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.
The balls into binsproblem is a classic problem in probability theory that has many applications in computer science. The problem involves m balls and n boxes. Each time, a single ball is placed into one of the bins. After all balls are in the bins, we look at the number of balls in each bin; we call this number the load on the bin. The problem can be modelled using a Multinomial distribution, and may involve asking a question such as: What is the expected number of bins with a ball in them?