Birthday attack

Last updated

A birthday attack is a bruteforce collision attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likelihood of collisions found between random attack attempts and a fixed degree of permutations (pigeonholes). With a birthday attack, it is possible to find a collision of a hash function with chance in [1] [2] , with being the classical preimage resistance security with the same probability [2] . There is a general (though disputed [3] ) result that quantum computers can perform birthday attacks, thus breaking collision resistance, in . [4]

Contents

Although there are some digital signature vulnerabilities associated with the birthday attack, it cannot be used to break an encryption scheme any faster than a brute-force attack. [5] :36

Understanding the problem

Comparison of the birthday problem (1) and birthday attack (2):
In (1), collisions are found within one set, in this case, 3 out of 276 pairings of the 24 lunar astronauts.
In (2), collisions are found between two sets, in this case, 1 out of 256 pairings of only the first bytes of SHA-256 hashes of 16 variants each of benign and malicious contracts. Birthday attack vs paradox.svg
Comparison of the birthday problem (1) and birthday attack (2):
In (1), collisions are found within one set, in this case, 3 out of 276 pairings of the 24 lunar astronauts.
In (2), collisions are found between two sets, in this case, 1 out of 256 pairings of only the first bytes of SHA-256 hashes of 16 variants each of benign and malicious contracts.

As an example, consider the scenario in which a teacher with a class of 30 students (n = 30) asks for everybody's birthday (for simplicity, ignore leap years) to determine whether any two students have the same birthday (corresponding to a hash collision as described further). Intuitively, this chance may seem small. Counter-intuitively, the probability that at least one student has the same birthday as any other student on any day is around 70% (for n = 30), from the formula . [6]

If the teacher had picked a specific day (say, 16 September), then the chance that at least one student was born on that specific day is , about 7.9%.

In a birthday attack, the attacker prepares many different variants of benign and malicious contracts, each having a digital signature. A pair of benign and malicious contracts with the same signature is sought. In this fictional example, suppose that the digital signature of a string is the first byte of its SHA-256 hash. The pair found is indicated in green note that finding a pair of benign contracts (blue) or a pair of malicious contracts (red) is useless. After the victim accepts the benign contract, the attacker substitutes it with the malicious one and claims the victim signed it, as proven by the digital signature.

Mathematics

Given a function , the goal of the attack is to find two different inputs such that . Such a pair is called a collision. The method used to find a collision is simply to evaluate the function for different input values that may be chosen randomly or pseudorandomly until the same result is found more than once. Because of the birthday problem, this method can be rather efficient. Specifically, if a function yields any of different outputs with equal probability and is sufficiently large, then we expect to obtain a pair of different arguments and with after evaluating the function for about different arguments on average.

We consider the following experiment. From a set of H values we choose n values uniformly at random thereby allowing repetitions. Let p(n; H) be the probability that during this experiment at least one value is chosen more than once. This probability can be approximated as

[7]

Let n(p; H) be the smallest number of values we have to choose, such that the probability for finding a collision is at least p. By inverting this expression above, we find the following approximation

and assigning a 0.5 probability of collision we arrive at

Let Q(H) be the expected number of values we have to choose before finding the first collision. This number can be approximated by

As an example, if a 64-bit hash is used, there are approximately 1.8×1019 different outputs. If these are all equally probable (the best case), then it would take 'only' approximately 5 billion attempts (5.38×109) to generate a collision using brute force. [8] This value is called birthday bound [9] and for n-bit codes it could be approximated as 2n/2. [10] Other examples are as follows:

BitsPossible outputs (H)Desired probability of random collision
(2 s.f.) (p)
10−1810−1510−1210−910−60.1%1%25%50%75%
16216 (~6.5 x 104)<2<2<2<2<21136190300430
32232 (~4.3×109)<2<2<23932900930050,00077,000110,000
64264 (~1.8×1019)61906100190,0006,100,0001.9×1086.1×1083.3×1095.1×1097.2×109
1282128 (~3.4×1038)2.6×10108.2×10112.6×10138.2×10142.6×10168.3×10172.6×10181.4×10192.2×10193.1×1019
2562256 (~1.2×1077)4.8×10291.5×10314.8×10321.5×10344.8×10351.5×10374.8×10372.6×10384.0×10385.7×1038
3842384 (~3.9×10115)8.9×10482.8×10508.9×10512.8×10538.9×10542.8×10568.9×10564.8×10577.4×10571.0×1058
5122512 (~1.3×10154)1.6×10685.2×10691.6×10715.2×10721.6×10745.2×10751.6×10768.8×10761.4×10771.9×1077
Table shows number of hashes n(p) needed to achieve the given probability of success, assuming all hashes are equally likely. For comparison, 10−18 to 10−15 is the uncorrectable bit error rate of a typical hard disk. [11] In theory, MD5 hashes or UUIDs, being roughly 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.

It is easy to see that if the outputs of the function are distributed unevenly, then a collision could be found even faster. The notion of 'balance' of a hash function quantifies the resistance of the function to birthday attacks (exploiting uneven key distribution.) However, determining the balance of a hash function will typically require all possible inputs to be calculated and thus is infeasible for popular hash functions such as the MD and SHA families. [12] The subexpression in the equation for is not computed accurately for small when directly translated into common programming languages as log(1/(1-p)) due to loss of significance. When log1p is available (as it is in C99) for example, the equivalent expression -log1p(-p) should be used instead. [13] If this is not done, the first column of the above table is computed as zero, and several items in the second column do not have even one correct significant digit.

Simple approximation

A good rule of thumb which can be used for mental calculation is the relation

which can also be written as

.

or

.

This works well for probabilities less than or equal to 0.5.

This approximation scheme is especially easy to use when working with exponents. For instance, suppose you are building 32-bit hashes () and want the chance of a collision to be at most one in a million (), how many documents could we have at the most?

which is close to the correct answer of 93.

Digital signature susceptibility

Digital signatures can be susceptible to a birthday attack or more precisely a chosen-prefix collision attack. A message is typically signed by first computing , where is a cryptographic hash function, and then using some secret key to sign . Suppose Mallory wants to trick Bob into signing a fraudulent contract. Mallory prepares a fair contract and a fraudulent one . She then finds a number of positions where can be changed without changing the meaning, such as inserting commas, empty lines, one versus two spaces after a sentence, replacing synonyms, etc. By combining these changes, she can create a huge number of variations on which are all fair contracts.

In a similar manner, Mallory also creates a huge number of variations on the fraudulent contract . She then applies the hash function to all these variations until she finds a version of the fair contract and a version of the fraudulent contract which have the same hash value, . She presents the fair version to Bob for signing. After Bob has signed, Mallory takes the signature and attaches it to the fraudulent contract. This signature then "proves" that Bob signed the fraudulent contract.

The probabilities differ slightly from the original birthday problem, as Mallory gains nothing by finding two fair or two fraudulent contracts with the same hash. Mallory's strategy is to generate pairs of one fair and one fraudulent contract. For a given hash function is the number of possible hashes. The birthday problem equations does not exactly apply here, the number of hashes Mallory actually generates for a chance is twice as much as required for a simple collision which corresponds to .

To avoid this attack, the output length of the hash function used for a signature scheme can be chosen large enough so that the birthday attack becomes computationally infeasible, i.e. about twice as many bits as are needed to prevent an ordinary brute-force attack.

Besides using a larger bit length, the signer (Bob) can protect himself by making some random, inoffensive changes to the document before signing it, and by keeping a copy of the contract he signed in his own possession, so that he can at least demonstrate in court that his signature matches that contract, not just the fraudulent one.

Pollard's rho algorithm for logarithms is an example for an algorithm using a birthday attack for the computation of discrete logarithms.

Reverse attack

The same fraud is possible if the signer is Mallory, not Bob. Bob could suggest a contract to Mallory for a signature. Mallory could find both an inoffensively-modified version of this fair contract that has the same signature as a fraudulent contract, and Mallory could provide the modified fair contract and signature to Bob. Later, Mallory could produce the fraudulent copy. If Bob doesn't have the inoffensively-modified version contract (perhaps only finding their original proposal), Mallory's fraud is perfect. If Bob does have it, Mallory can at least claim that it is Bob who is the fraudster.

See also

Notes

  1. "Avoiding collisions, Cryptographic hash functions" (PDF). Foundations of Cryptography, Computer Science Department, Wellesley College.
  2. 1 2 Dang, Q H (2012). Recommendation for applications using approved hash algorithms (Report). Gaithersburg, MD: National Institute of Standards and Technology.
  3. Daniel J. Bernstein. "Cost analysis of hash collisions : Will quantum computers make SHARCS obsolete?" (PDF). Cr.yp.to. Retrieved 29 October 2017.
  4. Brassard, Gilles; HØyer, Peter; Tapp, Alain (20 April 1998). "Quantum cryptanalysis of hash and claw-free functions". LATIN'98: Theoretical Informatics. Lecture Notes in Computer Science. Vol. 1380. Springer, Berlin, Heidelberg. pp. 163–169. arXiv: quant-ph/9705002 . doi:10.1007/BFb0054319. ISBN   978-3-540-64275-6. S2CID   118940551.
  5. R. Shirey (August 2007). Internet Security Glossary, Version 2. Network Working Group. doi: 10.17487/RFC4949 . RFC 4949.Informational.
  6. "Birthday Problem". Brilliant.org. Brilliant_(website) . Retrieved 28 July 2023.
  7. Bellare, Mihir; Rogaway, Phillip (2005). "The Birthday Problem". Introduction to Modern Cryptography (PDF). pp. 273–274. Retrieved 2023-03-31.
  8. Flajolet, Philippe; Odlyzko, Andrew M. (1990). "Random Mapping Statistics". In Quisquater, Jean-Jacques; Vandewalle, Joos (eds.). Advances in Cryptology — EUROCRYPT '89. Lecture Notes in Computer Science. Vol. 434. Berlin, Heidelberg: Springer. pp. 329–354. doi:10.1007/3-540-46885-4_34. ISBN   978-3-540-46885-1.
  9. See upper and lower bounds.
  10. Jacques Patarin, Audrey Montreuil (2005). "Benes and Butterfly schemes revisited" (PostScript, PDF). Université de Versailles. Retrieved 2007-03-15.{{cite journal}}: Cite journal requires |journal= (help)
  11. Gray, Jim; van Ingen, Catharine (25 January 2007). "Empirical Measurements of Disk Failure Rates and Error Rates". arXiv: cs/0701166 .
  12. "CiteSeerX". Archived from the original on 2008-02-23. Retrieved 2006-05-02.
  13. "Compute log(1+x) accurately for small values of x". Mathworks.com. Retrieved 29 October 2017.

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Maxwell–Boltzmann distribution</span> Specific probability distribution function, important in physics

In physics, the Maxwell–Boltzmann distribution, or Maxwell(ian) distribution, is a particular probability distribution named after James Clerk Maxwell and Ludwig Boltzmann.

In quantum computing, Grover's algorithm, also known as the quantum search algorithm, is a quantum algorithm for unstructured search that finds with high probability the unique input to a black box function that produces a particular output value, using just evaluations of the function, where is the size of the function's domain. It was devised by Lov Grover in 1996.

<span class="mw-page-title-main">Birthday problem</span> Mathematical problem

In probability theory, the birthday problem asks for the probability that, in a set of n randomly chosen people, at least two will share a birthday. The birthday paradox refers to the counterintuitive fact that only 23 people are needed for that probability to exceed 50%.

In computer science, a one-way function is a function that is easy to compute on every input, but hard to invert given the image of a random input. Here, "easy" and "hard" are to be understood in the sense of computational complexity theory, specifically the theory of polynomial time problems. Not being one-to-one is not considered sufficient for a function to be called one-way.

In cryptography, a collision attack on a cryptographic hash tries to find two inputs producing the same hash value, i.e. a hash collision. This is in contrast to a preimage attack where a specific target hash value is specified.

In cryptography, collision resistance is a property of cryptographic hash functions: a hash function H is collision-resistant if it is hard to find two inputs that hash to the same output; that is, two inputs a and b where ab but H(a) = H(b). The pigeonhole principle means that any hash function with more inputs than outputs will necessarily have such collisions; the harder they are to find, the more cryptographically secure the hash function is.

In cryptography, a Lamport signature or Lamport one-time signature scheme is a method for constructing a digital signature. Lamport signatures can be built from any cryptographically secure one-way function; usually a cryptographic hash function is used.

<span class="mw-page-title-main">One-way compression function</span> Cryptographic primitive

In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is "one-way", meaning that it is difficult given a particular output to compute inputs which compress to that output. One-way compression functions are not related to conventional data compression algorithms, which instead can be inverted exactly or approximately to the original data.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

<span class="mw-page-title-main">Merkle–Damgård construction</span> Method of building collision-resistant cryptographic hash functions

In cryptography, the Merkle–Damgård construction or Merkle–Damgård hash function is a method of building collision-resistant cryptographic hash functions from collision-resistant one-way compression functions. This construction was used in the design of many popular hash algorithms such as MD5, SHA-1 and SHA-2.

In cryptography, the Rabin signature algorithm is a method of digital signature originally proposed by Michael O. Rabin in 1978.

A Quantum Digital Signature (QDS) refers to the quantum mechanical equivalent of either a classical digital signature or, more generally, a handwritten signature on a paper document. Like a handwritten signature, a digital signature is used to protect a document, such as a digital contract, against forgery by another party or by one of the participating parties.

In cryptography, Very Smooth Hash (VSH) is a provably secure cryptographic hash function invented in 2005 by Scott Contini, Arjen Lenstra and Ron Steinfeld. Provably secure means that finding collisions is as difficult as some known hard mathematical problem. Unlike other provably secure collision-resistant hashes, VSH is efficient and usable in practice. Asymptotically, it only requires a single multiplication per log(n) message-bits and uses RSA-type arithmetic. Therefore, VSH can be useful in embedded environments where code space is limited.

In cryptography, cryptographic hash functions can be divided into two main categories. In the first category are those functions whose designs are based on mathematical problems, and whose security thus follows from rigorous mathematical proofs, complexity theory and formal reduction. These functions are called Provably Secure Cryptographic Hash Functions. To construct these is very difficult, and few examples have been introduced. Their practical use is limited.

The elliptic curve only hash (ECOH) algorithm was submitted as a candidate for SHA-3 in the NIST hash function competition. However, it was rejected in the beginning of the competition since a second pre-image attack was found.

In cryptography, SWIFFT is a collection of provably secure hash functions. It is based on the concept of the fast Fourier transform (FFT). SWIFFT is not the first hash function based on FFT, but it sets itself apart by providing a mathematical proof of its security. It also uses the LLL basis reduction algorithm. It can be shown that finding collisions in SWIFFT is at least as difficult as finding short vectors in cyclic/ideal lattices in the worst case. By giving a security reduction to the worst-case scenario of a difficult mathematical problem, SWIFFT gives a much stronger security guarantee than most other cryptographic hash functions.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

Badger is a Message Authentication Code (MAC) based on the idea of universal hashing and was developed by Boesgaard, Scavenius, Pedersen, Christensen, and Zenner. It is constructed by strengthening the ∆-universal hash family MMH using an ϵ-almost strongly universal (ASU) hash function family after the application of ENH, where the value of ϵ is . Since Badger is a MAC function based on the universal hash function approach, the conditions needed for the security of Badger are the same as those for other universal hash functions such as UMAC.

Network coding has been shown to optimally use bandwidth in a network, maximizing information flow but the scheme is very inherently vulnerable to pollution attacks by malicious nodes in the network. A node injecting garbage can quickly affect many receivers. The pollution of network packets spreads quickly since the output of honest node is corrupted if at least one of the incoming packets is corrupted.

References