Rolling hash

Last updated

A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.

Contents

A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters; and similar to the way a Zobrist hash can be rapidly updated from the old hash value.

One of the main applications is the Rabin–Karp string search algorithm, which uses the rolling hash described below. Another popular application is the rsync program, which uses a checksum based on Mark Adler's adler-32 as its rolling hash. Low Bandwidth Network Filesystem (LBFS) uses a Rabin fingerprint as its rolling hash. FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash.

At best, rolling hash values are pairwise independent [1] or strongly universal. They cannot be 3-wise independent, for example.

Polynomial rolling hash

The Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions:

,

where is a constant, and are the input characters (but this function is not a Rabin fingerprint, see below).

In order to avoid manipulating huge values, all math is done modulo . The choice of and is critical to get good hashing; in particular, the modulus is typically a prime number. See linear congruential generator for more discussion.

Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum by . Shifting all characters by one position to the right requires dividing the entire sum by . Note that in modulo arithmetic, can be chosen to have a multiplicative inverse by which can be multiplied to get the result of the division without actually performing a division.

Rabin fingerprint

The Rabin fingerprint is another hash, which also interprets the input as a polynomial, but over the Galois field GF(2). Instead of seeing the input as a polynomial of bytes, it is seen as a polynomial of bits, and all arithmetic is done in GF(2) (similarly to CRC32). The hash is the result of the division of that polynomial by an irreducible polynomial over GF(2). It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash.

Because it shares the same author as the Rabin–Karp string search algorithm, which is often explained with another, simpler rolling hash, and because this simpler rolling hash is also a polynomial, both rolling hashes are often mistaken for each other. The backup software restic uses a Rabin fingerprint for splitting files, with blob size varying between 512KiB and 8MiB. [2]

Cyclic polynomial

Hashing by cyclic polynomial [3] sometimes called Buzhashis also simple, but it has the benefit of avoiding multiplications, using barrel shifts instead. It is a form of tabulation hashing: it presumes that there is some hash function from characters to integers in the interval . This hash function might be simply an array or a hash table mapping characters to random integers. Let the function be a cyclic binary rotation (or circular shift): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., . Let be the bitwise exclusive or. The hash values are defined as

where the multiplications by powers of two can be implemented by binary shifts. The result is a number in .

Computing the hash values in a rolling fashion is done as follows. Let be the previous hash value. Rotate once: . If is the character to be removed, rotate it times: . Then simply set

where is the new character.

Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first bits. That is, take the result and dismiss any consecutive bits. [1] In practice, this can be achieved by an integer division .

Content-based slicing using a rolling hash

One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network: a simple byte addition at the front of the file would normally cause all fixed size windows to become updated, while in reality, only the first "chunk" has been modified. [4]

A simple approach to making dynamic chunks is to calculate a rolling hash, and if the hash value matches an arbitrary pattern (e.g. all zeroes) in the lower N bits (with a probability of , given the hash has a uniform probability distribution) then it’s chosen to be a chunk boundary. Each chunk will thus have an average size of bytes. This approach ensures that unmodified data (more than a window size away from the changes) will have the same boundaries.

Once the boundaries are known, the chunks need to be compared by cryptographic hash value to detect changes. [5] The backup software Borg uses the Buzhash algorithm with a customizable chunk size range for splitting file streams. [6]

Such content-defined chunking is often used for data deduplication. [4] [6]

Content-based slicing using moving sum

Several programs, including gzip (with the --rsyncable option) and rsyncrypto, do content-based slicing based on this specific (unweighted) moving sum: [7]

where

Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum.

For every where , these programs cut the file between and . This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but no other chunk.

Gear fingerprint and content-based chunking algorithm FastCDC

Chunking is a technique to divide a data stream into a set of blocks, also called chunks. Content-defined chunking (CDC) is a chunking technique in which the division of the data stream is not based on fixed chunk size, as in fixed-size chunking, but on its content.

The Content-Defined Chunking algorithm needs to compute the hash value of a data stream byte by byte and split the data stream into chunks when the hash value meets a predefined value. However, comparing a string byte-by-byte will introduce the heavy computation overhead. FastCDC [8] proposes a new and efficient Content-Defined Chunking approach. It uses a fast rolling Gear hash algorithm, [9] skipping the minimum length, normalizing the chunk-size distribution, and last but not the least, rolling two bytes each time to speed up the CDC algorithm, which can achieve about 10X higher throughput than Rabin-based CDC approach. [10]

The basic version pseudocode is provided as follows:

algorithm FastCDC     input: data buffer src,             data length n,      output: cut point iMinSize 2KB     // split minimum chunk size is 2 KB     MaxSize 64KB    // split maximum chunk size is 64 KB     Mask0x0000d93003530000LLfp0i0          // buffer size is less than minimum chunk size     ifnMinSizethenreturnnifnMaxSizethennMaxSize          // Skip the first MinSize bytes, and kickstart the hash     whilei < MinSizedofp (fp << 1 ) + Gear[src[i]]         ii + 1           whilei < ndofp (fp << 1 ) + Gear[src[i]]         if !(fp & Mask) thenreturniii + 1         returni

Where Gear array is a pre-calculated hashing array. Here FastCDC uses Gear hashing algorithm which can calculate the rolling hashing results quickly and keep the uniform distribution of the hashing results as Rabin. Compared with the traditional Rabin hashing algorithm, it achieves a much faster speed. Experiments suggest that it can generate nearly the same chunk size distribution in the much shorter time (about 1/10 of rabin-based chunking [10] ) when segmenting the data stream.

Computational complexity

All rolling hash functions can be computed in time linear in the number of characters and updated in constant time when characters are shifted by one position. In particular, computing the Rabin-Karp rolling hash of a string of length requires modular arithmetic operations, and hashing by cyclic polynomials requires bitwise exclusive ors and circular shifts. [1]

See also

Footnotes

  1. 1 2 3 Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698–710, 2010. arXiv:0705.4676.
  2. "References — restic 0.9.0 documentation". restic.readthedocs.io. Retrieved 2018-05-24.
  3. Jonathan D. Cohen, Recursive Hashing Functions for n-Grams, ACM Trans. Inf. Syst. 15 (3), 1997.
  4. 1 2 "Foundation - Introducing Content Defined Chunking (CDC)". 2015.
  5. Horvath, Adam (October 24, 2012). "Rabin Karp rolling hash - dynamic sized chunks based on hashed content".
  6. 1 2 "Data structures and file formats — Borg - Deduplicating Archiver 1.1.5 documentation". borgbackup.readthedocs.io. Retrieved 2018-05-24.
  7. "Rsyncrypto Algorithm".
  8. Xia, Wen; Zhou, Yukun; Jiang, Hong; Feng, Dan; Hua, Yu; Hu, Yuchong; Liu, Qing; Zhang, Yucheng (2005). FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication. ISBN   9781931971300 . Retrieved 2020-07-24.{{cite book}}: |website= ignored (help)
  9. Xia, Wen; Jiang, Hong; Feng, Dan; Tian, Lei; Fu, Min; Zhou, Yukun (2014). "Ddelta: A deduplication-inspired fast delta compression approach". Performance Evaluation. 79: 258–272. doi:10.1016/j.peva.2014.07.016. ISSN   0166-5316.
  10. 1 2 Xia, Wen; Zou, Xiangyu; Jiang, Hong; Zhou, Yukun; Liu, Chuanyi; Feng, Dan; Hua, Yu; Hu, Yuchong; Zhang, Yucheng (2020-06-16). "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems". IEEE Transactions on Parallel and Distributed Systems. 31 (9): 2017–2031. doi:10.1109/TPDS.2020.2984632. S2CID   215817722 . Retrieved 2020-07-22.

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">HMAC</span> Computer communications hash algorithm

In cryptography, an HMAC is a specific type of message authentication code (MAC) involving a cryptographic hash function and a secret cryptographic key. As with any MAC, it may be used to simultaneously verify both the data integrity and authenticity of a message. An HMAC is a type of keyed hash function that can also be used in a key derivation scheme or a key stretching scheme.

The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4, and was specified in 1992 as RFC 1321.

A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data. Blocks of data entering these systems get a short check value attached, based on the remainder of a polynomial division of their contents. On retrieval, the calculation is repeated and, in the event the check values do not match, corrective action can be taken against data corruption. CRCs can be used for error correction.

<span class="mw-page-title-main">Perfect hash function</span> Hash function without any collisions

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

In computer science, a one-way function is a function that is easy to compute on every input, but hard to invert given the image of a random input. Here, "easy" and "hard" are to be understood in the sense of computational complexity theory, specifically the theory of polynomial time problems. Not being one-to-one is not considered sufficient for a function to be called one-way.

A commitment scheme is a cryptographic primitive that allows one to commit to a chosen value while keeping it hidden to others, with the ability to reveal the committed value later. Commitment schemes are designed so that a party cannot change the value or statement after they have committed to it: that is, commitment schemes are binding. Commitment schemes have important applications in a number of cryptographic protocols including secure coin flipping, zero-knowledge proofs, and secure computation.

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin (1987) that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

KCDSA is a digital signature algorithm created by a team led by the Korea Internet & Security Agency (KISA). It is an ElGamal variant, similar to the Digital Signature Algorithm and GOST R 34.10-94. The standard algorithm is implemented over , but an elliptic curve variant (EC-KCDSA) is also specified.

Fowler–Noll–Vo is a non-cryptographic hash function created by Glenn Fowler, Landon Curt Noll, and Kiem-Phong Vo.

In cryptography, Optimal Asymmetric Encryption Padding (OAEP) is a padding scheme often used together with RSA encryption. OAEP was introduced by Bellare and Rogaway, and subsequently standardized in PKCS#1 v2 and RFC 2437.

In cryptography, a message authentication code based on universal hashing, or UMAC, is a type of message authentication code (MAC) calculated choosing a hash function from a class of hash functions according to some secret (random) process and applying it to the message. The resulting digest or fingerprint is then encrypted to hide the identity of the hash function used. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message.

The GOST hash function, defined in the standards GOST R 34.11-94 and GOST 34.311-95 is a 256-bit cryptographic hash function. It was initially defined in the Russian national standard GOST R 34.11-94 Information Technology – Cryptographic Information Security – Hash Function. The equivalent standard used by other member-states of the CIS is GOST 34.311-95.

<span class="mw-page-title-main">One-way compression function</span> Cryptographic primitive

In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is "one-way", meaning that it is difficult given a particular output to compute inputs which compress to that output. One-way compression functions are not related to conventional data compression algorithms, which instead can be inverted exactly or approximately to the original data.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

In cryptography, Galois/Counter Mode (GCM) is a mode of operation for symmetric-key cryptographic block ciphers which is widely adopted for its performance. GCM throughput rates for state-of-the-art, high-speed communication channels can be achieved with inexpensive hardware resources.

The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin.

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.