MurmurHash

Last updated

MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup. [1] [2] [3] It was created by Austin Appleby in 2008 [4] and is currently hosted on GitHub along with its test suite named 'SMHasher'. It also exists in a number of variants, [5] all of which have been released into the public domain. The name comes from two basic operations, multiply (MU) and rotate (R), used in its inner loop.

Contents

Unlike cryptographic hash functions, it is not specifically designed to be difficult to reverse by an adversary, making it unsuitable for cryptographic purposes.

Variants

MurmurHash1

The original MurmurHash was created as an attempt to make a faster function than Lookup3. [6] Although successful, it hadn't been tested thoroughly and wasn't capable of providing 64-bit hashes as in Lookup3. It had a rather elegant design, that would be later built upon in MurmurHash2, combining a multiplicative hash (similar to Fowler–Noll–Vo hash function) with a Xorshift.

MurmurHash2

MurmurHash2 [7] yields a 32-bit or 64-bit value. It came in multiple variants, including some that allowed incremental hashing and aligned or neutral versions.

The person who originally found the flaw[ clarification needed ] in MurmurHash2 created an unofficial 160-bit version of MurmurHash2 called MurmurHash2_160. [10]

MurmurHash3

The current version is MurmurHash3, [11] [12] which yields a 32-bit or 128-bit hash value. When using 128-bits, the x86 and x64 versions do not produce the same values, as the algorithms are optimized for their respective platforms. MurmurHash3 was released alongside SMHasher—a hash function test suite.

Implementations

The canonical implementation is in C++, but there are efficient ports for a variety of popular languages, including Python, [13] C, [14] Go, [15] C#, [12] [16] D, [17] Lua, Perl, [18] Ruby, [19] Rust, [20] PHP, [21] [22] Common Lisp, [23] Haskell, [24] Elm, [25] Clojure, [26] Scala, [27] Java, [28] [29] Erlang, [30] Swift, [31] Object Pascal, [32] Kotlin, [33] JavaScript. [34] and OCaml, [35]

It has been adopted into a number of open-source projects, most notably libstdc++ (ver 4.6), nginx (ver 1.0.1), [36] Rubinius, [37] libmemcached (the C driver for Memcached), [38] npm (nodejs package manager), [39] maatkit, [40] Hadoop, [1] Kyoto Cabinet, [41] Cassandra, [42] [43] Solr, [44] vowpal wabbit, [45] Elasticsearch, [46] Guava, [47] Kafka [48] and RedHat Virtual Data Optimizer (VDO). [49]

Vulnerabilities

Hash functions can be vulnerable to collision attacks, where a user can choose input data in such a way so as to intentionally cause hash collisions. Jean-Philippe Aumasson and Daniel J. Bernstein were able to show that even implementations of MurmurHash using a randomized seed are vulnerable to so-called HashDoS attacks. [50] With the use of differential cryptanalysis they were able to generate inputs that would lead to a hash collision. The authors of the attack recommend to use their own SipHash instead.

Algorithm

algorithm Murmur3_32 is// Note: In this version, all arithmetic is performed with unsigned 32-bit integers.//       In the case of overflow, the result is reduced modulo 232.input:key, len, seed      c1 ← 0xcc9e2d51     c2 ← 0x1b873593     r1 ← 15     r2 ← 13     m ← 5     n ← 0xe6546b64      hash ← seedfor each fourByteChunk of keydo         k ← fourByteChunk          k ← k × c1         k ← k ROL r1         k ← k × c2          hash ← hash XOR k         hash ← hash ROL r2         hash ← (hash × m) + n      with any remainingBytesInKey do         remainingBytes ← SwapToLittleEndian(remainingBytesInKey)         // Note: Endian swapping is only necessary on big-endian machines.//       The purpose is to place the meaningful digits towards the low end of the value,//       so that these digits have the greatest potential to affect the low range digits//       in the subsequent multiplication.  Consider that locating the meaningful digits//       in the high range would produce a greater effect upon the high digits of the//       multiplication, and notably, that such high digits are likely to be discarded//       by the modulo arithmetic under overflow.  We don't want that.          remainingBytes ← remainingBytes × c1         remainingBytes ← remainingBytes ROL r1         remainingBytes ← remainingBytes × c2          hash ← hash XOR remainingBytes      hash ← hash XOR len      hash ← hash XOR (hash >> 16)     hash ← hash × 0x85ebca6b     hash ← hash XOR (hash >> 13)     hash ← hash × 0xc2b2ae35     hash ← hash XOR (hash >> 16)

A sample C implementation follows (for little-endian CPUs):

staticinlineuint32_tmurmur_32_scramble(uint32_tk){k*=0xcc9e2d51;k=(k<<15)|(k>>17);k*=0x1b873593;returnk;}uint32_tmurmur3_32(constuint8_t*key,size_tlen,uint32_tseed){uint32_th=seed;uint32_tk;/* Read in groups of 4. */for(size_ti=len>>2;i;i--){// Here is a source of differing results across endiannesses.// A swap here has no effects on hash properties though.memcpy(&k,key,sizeof(uint32_t));key+=sizeof(uint32_t);h^=murmur_32_scramble(k);h=(h<<13)|(h>>19);h=h*5+0xe6546b64;}/* Read the rest. */k=0;for(size_ti=len&3;i;i--){k<<=8;k|=key[i-1];}// A swap is *not* necessary here because the preceding loop already// places the low bytes in the low places according to whatever endianness// we use. Swaps only apply when the memory is copied in a chunk.h^=murmur_32_scramble(k);/* Finalize. */h^=len;h^=h>>16;h*=0x85ebca6b;h^=h>>13;h*=0xc2b2ae35;h^=h>>16;returnh;}
Tests
Test stringSeed valueHash value (hexadecimal)Hash value (decimal)
0x000000000x000000000
0x000000010x514E28B71,364,076,727
0xffffffff0x81F16F392,180,083,513
test0x000000000xba6bd2133,127,628,307
test0x9747b28c0x704b81dc1,883,996,636
Hello, world!0x000000000xc0363e433,224,780,355
Hello, world!0x9747b28c0x24884CBA612,912,314
The quick brown fox jumps over the lazy dog0x000000000x2e4ff723776,992,547
The quick brown fox jumps over the lazy dog0x9747b28c0x2FA826CD799,549,133

See also

Related Research Articles

Blowfish is a symmetric-key block cipher, designed in 1993 by Bruce Schneier and included in many cipher suites and encryption products. Blowfish provides a good encryption rate in software, and no effective cryptanalysis of it has been found to date. However, the Advanced Encryption Standard (AES) now receives more attention, and Schneier recommends Twofish for modern applications.

The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4, and was specified in 1992 as RFC 1321.

In cryptography, SHA-1 is a hash function which takes an input and produces a 160-bit (20-byte) hash value known as a message digest – typically rendered as 40 hexadecimal digits. It was designed by the United States National Security Agency, and is a U.S. Federal Information Processing Standard. The algorithm has been cryptographically broken but is still widely used.

<span class="mw-page-title-main">Endianness</span> Order of bytes in a computer word

In computing, endianness is the order in which bytes within a word of digital data are transmitted over a data communication medium or addressed in computer memory, counting only byte significance compared to earliness. Endianness is primarily expressed as big-endian (BE) or little-endian (LE), terms introduced by Danny Cohen into computer science for data ordering in an Internet Experiment Note published in 1980. The adjective endian has its origin in the writings of 18th century Anglo-Irish writer Jonathan Swift. In the 1726 novel Gulliver's Travels, he portrays the conflict between sects of Lilliputians divided into those breaking the shell of a boiled egg from the big end or from the little end. By analogy, a CPU may read a digital word big end first, or little end first.

<span class="mw-page-title-main">Netwide Assembler</span> Assembler for the Intel x86 architecture

The Netwide Assembler (NASM) is an assembler and disassembler for the Intel x86 architecture. It can be used to write 16-bit, 32-bit (IA-32) and 64-bit (x86-64) programs. It is considered one of the most popular assemblers for Linux and x86 chips.

In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operations and directly supported by the processor. Most bitwise operations are presented as two-operand instructions where the result replaces one of the input operands.

Adler-32 is a checksum algorithm written by Mark Adler in 1995, modifying Fletcher's checksum. Compared to a cyclic redundancy check of the same length, it trades reliability for speed. Adler-32 is more reliable than Fletcher-16, and slightly less reliable than Fletcher-32.

SHA-2 is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compression function itself built using the Davies–Meyer structure from a specialized block cipher.

In cryptography, a message authentication code based on universal hashing, or UMAC, is a type of message authentication code (MAC) calculated choosing a hash function from a class of hash functions according to some secret (random) process and applying it to the message. The resulting digest or fingerprint is then encrypted to hide the identity of the hash function used. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message. In contrast to traditional MACs, which are serializable, UMAC can be executed in parallel. Thus as machines continue to offer more parallel processing capabilities, the speed of implementing UMAC will increase.

<span class="mw-page-title-main">GUID Partition Table</span> Computer disk partitioning standard

The GUID Partition Table (GPT) is a standard for the layout of partition tables of a physical computer storage device, such as a hard disk drive or solid-state drive, using universally unique identifiers (UUIDs), which are also known as globally unique identifiers (GUIDs). Forming a part of the Unified Extensible Firmware Interface (UEFI) standard, it is nevertheless also used for some BIOSs, because of the limitations of master boot record (MBR) partition tables, which use 32 bits for logical block addressing (LBA) of traditional 512-byte disk sectors.

<span class="mw-page-title-main">Salsa20</span> Stream ciphers

Salsa20 and the closely related ChaCha are stream ciphers developed by Daniel J. Bernstein. Salsa20, the original cipher, was designed in 2005, then later submitted to the eSTREAM European Union cryptographic validation process by Bernstein. ChaCha is a modification of Salsa20 published in 2008. It uses a new round function that increases diffusion and increases performance on some architectures.

A hash array mapped trie (HAMT) is an implementation of an associative array that combines the characteristics of a hash table and an array mapped trie. It is a refined version of the more general notion of a hash tree.

<span class="mw-page-title-main">Computation of cyclic redundancy checks</span> Overview of the computation of cyclic redundancy checks

Computation of a cyclic redundancy check is derived from the mathematics of polynomial division, modulo two. In practice, it resembles long division of the binary message string, with a fixed number of zeroes appended, by the "generator polynomial" string except that exclusive or operations replace subtractions. Division of this type is efficiently realised in hardware by a modified shift register, and in software by a series of equivalent algorithms, starting with simple code close to the mathematics and becoming faster through byte-wise parallelism and space–time tradeoffs.

SHA-3 is the latest member of the Secure Hash Algorithm family of standards, released by NIST on August 5, 2015. Although part of the same series of standards, SHA-3 is internally different from the MD5-like structure of SHA-1 and SHA-2.

bcrypt is a password-hashing function designed by Niels Provos and David Mazières, based on the Blowfish cipher and presented at USENIX in 1999. Besides incorporating a salt to protect against rainbow table attacks, bcrypt is an adaptive function: over time, the iteration count can be increased to make it slower, so it remains resistant to brute-force search attacks even with increasing computation power.

BLAKE is a cryptographic hash function based on Daniel J. Bernstein's ChaCha stream cipher, but a permuted copy of the input block, XORed with round constants, is added before each ChaCha round. Like SHA-2, there are two variants differing in the word size. ChaCha operates on a 4×4 array of words. BLAKE repeatedly combines an 8-word hash value with 16 message words, truncating the ChaCha result to obtain the next hash value. BLAKE-256 and BLAKE-224 use 32-bit words and produce digest sizes of 256 bits and 224 bits, respectively, while BLAKE-512 and BLAKE-384 use 64-bit words and produce digest sizes of 512 bits and 384 bits, respectively.

A concurrent hash-trie or Ctrie is a concurrent thread-safe lock-free implementation of a hash array mapped trie. It is used to implement the concurrent map abstraction. It has particularly scalable concurrent insert and remove operations and is memory-efficient. It is the first known concurrent data-structure that supports O(1), atomic, lock-free snapshots.

<span class="mw-page-title-main">Google Authenticator</span> Two-step verification app

Google Authenticator is a software-based authenticator by Google. It implements multi-factor authentication services using the time-based one-time password and HMAC-based one-time password, for authenticating users of software applications.

In cryptography, scrypt is a password-based key derivation function created by Colin Percival in March 2009, originally for the Tarsnap online backup service. The algorithm was specifically designed to make it costly to perform large-scale custom hardware attacks by requiring large amounts of memory. In 2016, the scrypt algorithm was published by IETF as RFC 7914. A simplified version of scrypt is used as a proof-of-work scheme by a number of cryptocurrencies, first implemented by an anonymous programmer called ArtForz in Tenebrix and followed by Fairbrix and Litecoin soon after.

Argon2 is a key derivation function that was selected as the winner of the 2015 Password Hashing Competition. It was designed by Alex Biryukov, Daniel Dinu, and Dmitry Khovratovich from the University of Luxembourg. The reference implementation of Argon2 is released under a Creative Commons CC0 license or the Apache License 2.0, and provides three related versions:

References

  1. 1 2 "Hadoop in Java". Hbase.apache.org. 24 July 2011. Archived from the original on 12 January 2012. Retrieved 13 January 2012.
  2. Chouza et al.
  3. "Couceiro et al" (PDF) (in Portuguese). p. 14. Retrieved 13 January 2012.
  4. Tanjent (tanjent) wrote,3 March 2008 13:31:00. "MurmurHash first announcement". Tanjent.livejournal.com. Retrieved 13 January 2012.{{cite web}}: CS1 maint: numeric names: authors list (link)
  5. "MurmurHash2-160". Simonhf.wordpress.com. 25 September 2010. Retrieved 13 January 2012.
  6. "MurmurHash1" . Retrieved 12 January 2019.
  7. "MurmurHash2 on Github".
  8. "MurmurHash2Flaw" . Retrieved 15 January 2019.
  9. "MurmurHash3 (see note on MurmurHash2_x86_64)" . Retrieved 15 January 2019.
  10. "MurmurHash2_160" . Retrieved 12 January 2019.
  11. "MurmurHash3 on Github".
  12. 1 2 Horvath, Adam (10 August 2012). "MurMurHash3, an ultra fast hash algorithm for C# / .NET".
  13. "pyfasthash in Python" . Retrieved 13 January 2012.
  14. "C implementation in qLibc by Seungyoung Kim".
  15. "murmur3 in Go".
  16. Landman, Davy. "Davy Landman in C#". Landman-code.blogspot.com. Retrieved 13 January 2012.
  17. "std.digest.murmurhash - D Programming Language". dlang.org. Retrieved 5 November 2016.
  18. "Toru Maesaka in Perl". metacpan.org. Retrieved 13 January 2012.
  19. Yuki Kurihara (16 October 2014). "Digest::MurmurHash". GitHub.com. Retrieved 18 March 2015.
  20. "stusmall/murmur3". GitHub. Retrieved 29 November 2015.
  21. "PHP userland implementation of MurmurHash3". github.com. Retrieved 18 December 2017.
  22. "PHP 8.1 with MurmurHash3 support".
  23. "tarballs_are_good / murmurhash3" . Retrieved 7 February 2015.
  24. "Haskell". Hackage.haskell.org. Retrieved 13 January 2012.
  25. "Elm". package.elm-lang.org. Retrieved 12 June 2019.
  26. "Murmur3.java in Clojure source code on Github". clojure.org. Retrieved 11 March 2014.
  27. "Scala standard library implementation". 26 September 2014.
  28. Murmur3, part of Guava
  29. "Murmur3A and Murmur3F Java classes on Github". greenrobot. Retrieved 5 November 2014.
  30. "bipthelin/murmerl3". GitHub. Retrieved 21 October 2015.
  31. Daisuke T (7 February 2019). "MurmurHash-Swift". GitHub.com. Retrieved 10 February 2019.
  32. GitHub - Xor-el/HashLib4Pascal: Hashing for Modern Object Pascal
  33. "goncalossilva/kotlinx-murmurhash". GitHub.com. 10 December 2021. Retrieved 14 December 2021.
  34. raycmorgan (owner). "Javascript implementation by Ray Morgan". Gist.github.com. Retrieved 13 January 2012.
  35. INRIA. "OCaml Source". GitHub.com.
  36. "nginx" . Retrieved 13 January 2012.
  37. "Rubinius" . Retrieved 29 February 2012.
  38. "libMemcached". libmemcached.org. Retrieved 21 October 2015.
  39. "switch from MD5 to murmur".
  40. "maatkit". 24 March 2009. Retrieved 13 January 2012.
  41. "Kyoto Cabinet specification". Fallabs.com. 4 March 2011. Retrieved 13 January 2012.
  42. "Partitioners". apache.org. 15 November 2013. Retrieved 19 December 2013.
  43. "Introduction to Apache Cassandra™ + What's New in 4.0 by Patrick McFadin. DataStax Presents". YouTube. 10 April 2019.
  44. "Solr MurmurHash2 Javadoc".
  45. "hash.cc in vowpalwabbit source code".
  46. "Elasticsearch 2.0 - CRUD and routing changes".
  47. "Guava Hashing.java".
  48. "Kafka DefaultPartitioner.java".
  49. Virtual Data Optimizer source code
  50. "Breaking Murmur: Hash-flooding DoS Reloaded".