Pearson hashing

Last updated May 05, 2024

Pearson hashing is a non-cryptographic hash function designed for fast execution on processors with 8-bit registers. Given an input consisting of any number of bytes, it produces as output a single byte that is strongly dependent on every byte of the input. Its implementation requires only a few instructions, plus a 256-byte lookup table containing a permutation of the values 0 through 255.^[1]

This hash function is a CBC-MAC that uses an 8-bit substitution cipher implemented via the substitution table. An 8-bit cipher has negligible cryptographic security, so the Pearson hash function is not cryptographically strong, but it is useful for implementing hash tables or as a data integrity check code, for which purposes it offers these benefits:

It is extremely simple.
It executes quickly on resource-limited processors.
There is no simple class of inputs for which collisions (identical outputs) are especially likely.
Given a small, privileged set of inputs (e.g., reserved words for a compiler), the permutation table can be adjusted so that those inputs yield distinct hash values, producing what is called a perfect hash function.
Two input strings differing by exactly one character never collide.^[2] E.g., applying the algorithm on the strings ABC and AEC will never produce the same value.

One of its drawbacks when compared with other hashing algorithms designed for 8-bit processors is the suggested 256 byte lookup table, which can be prohibitively large for a small microcontroller with a program memory size on the order of hundreds of bytes. A workaround to this is to use a simple permutation function instead of a table stored in program memory. However, using a too simple function, such as T[i] = 255-i, partly defeats the usability as a hash function as anagrams will result in the same hash value; using a too complex function, on the other hand, will affect speed negatively. Using a function rather than a table also allows extending the block size. Such functions naturally have to be bijective, like their table variants.

The algorithm can be described by the following pseudocode, which computes the hash of message C using the permutation table T:

algorithm pearson hashing is     h := 0      for each c in C loop         h := T[ h xor c ]     end loopreturn h

The hash variable (h) may be initialized differently, e.g. to the length of the data (C) modulo 256.

Example implementations

C#, 8-bit

publicclassPearsonHashing{publicbyteHash(stringinput){constbyte[]T={/* Permutation of 0-255 */};bytehash=0;byte[]bytes=Encoding.UTF8.GetBytes(input);foreach(varbinbytes){hash=T[(byte)(hash^b)];}returnhash;}}

Related Research Articles

The Advanced Encryption Standard (AES), also known by its original name Rijndael, is a specification for the encryption of electronic data established by the U.S. National Institute of Standards and Technology (NIST) in 2001.

In cryptography, a block cipher is a deterministic algorithm that operates on fixed-length groups of bits, called blocks. Block ciphers are the elementary building blocks of many cryptographic protocols. They are ubiquitous in the storage and exchange of data, where such data is secured and authenticated via encryption.

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">HMAC</span> Computer communications hash algorithm

In cryptography, an HMAC is a specific type of message authentication code (MAC) involving a cryptographic hash function and a secret cryptographic key. As with any MAC, it may be used to simultaneously verify both the data integrity and authenticity of a message. An HMAC is a type of keyed hash function that can also be used in a key derivation scheme or a key stretching scheme.

In cryptography, RC4 is a stream cipher. While it is remarkable for its simplicity and speed in software, multiple vulnerabilities have been discovered in RC4, rendering it insecure. It is especially vulnerable when the beginning of the output keystream is not discarded, or when nonrandom or related keys are used. Particularly problematic uses of RC4 have led to very insecure protocols such as WEP.

In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation, in a process termed as direct addressing. The savings in processing time can be significant, because retrieving a value from memory is often faster than carrying out an "expensive" computation or input/output operation. The tables may be precalculated and stored in static program storage, calculated as part of a program's initialization phase (memoization), or even stored in hardware in application-specific platforms. Lookup tables are also used extensively to validate input values by matching against a list of valid items in an array and, in some programming languages, may include pointer functions to process the matching input. FPGAs also make extensive use of reconfigurable, hardware-implemented, lookup tables to provide programmable hardware functionality. LUTs differ from hash tables in a way that, to retrieve a value $with key, a hash table would store the value in the slot where is a hash function i.e. is used to compute the slot, while in the case of LUT, the value is stored in slot, thus directly addressable.$

A cryptographic hash function (CHF) is a hash algorithm that has special properties desirable for a cryptographic application:

In cryptography, Tiger is a cryptographic hash function designed by Ross Anderson and Eli Biham in 1995 for efficiency on 64-bit platforms. The size of a Tiger hash value is 192 bits. Truncated versions can be used for compatibility with protocols assuming a particular hash size. Unlike the SHA-2 family, no distinguishing initialization values are defined; they are simply prefixes of the full Tiger/192 hash value.

In cryptography, confusion and diffusion are two properties of the operation of a secure cipher identified by Claude Shannon in his 1945 classified report A Mathematical Theory of Cryptography. These properties, when present, work together to thwart the application of statistics and other methods of cryptanalysis.

Fowler–Noll–Vo is a non-cryptographic hash function created by Glenn Fowler, Landon Curt Noll, and Kiem-Phong Vo.

SHA-2 is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compression function itself built using the Davies–Meyer structure from a specialized block cipher.

In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is "one-way", meaning that it is difficult given a particular output to compute inputs which compress to that output. One-way compression functions are not related to conventional data compression algorithms, which instead can be inverted exactly or approximately to the original data.

In cryptography, Galois/Counter Mode (GCM) is a mode of operation for symmetric-key cryptographic block ciphers which is widely adopted for its performance. GCM throughput rates for state-of-the-art, high-speed communication channels can be achieved with inexpensive hardware resources.

SHA-3 is the latest member of the Secure Hash Algorithm family of standards, released by NIST on August 5, 2015. Although part of the same series of standards, SHA-3 is internally different from the MD5-like structure of SHA-1 and SHA-2.

The following tables compare general and technical information for a number of cryptographic hash functions. See the individual functions' articles for further information. This article is not all-inclusive or necessarily up-to-date. An overview of hash function security/cryptanalysis can be found at hash function security summary.

bcrypt is a password-hashing function designed by Niels Provos and David Mazières, based on the Blowfish cipher and presented at USENIX in 1999. Besides incorporating a salt to protect against rainbow table attacks, bcrypt is an adaptive function: over time, the iteration count can be increased to make it slower, so it remains resistant to brute-force search attacks even with increasing computation power.

BLAKE is a cryptographic hash function based on Daniel J. Bernstein's ChaCha stream cipher, but a permuted copy of the input block, XORed with round constants, is added before each ChaCha round. Like SHA-2, there are two variants differing in the word size. ChaCha operates on a 4×4 array of words. BLAKE repeatedly combines an 8-word hash value with 16 message words, truncating the ChaCha result to obtain the next hash value. BLAKE-256 and BLAKE-224 use 32-bit words and produce digest sizes of 256 bits and 224 bits, respectively, while BLAKE-512 and BLAKE-384 use 64-bit words and produce digest sizes of 512 bits and 384 bits, respectively.

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.

Kupyna is a cryptographic hash function defined in the Ukrainian national standard DSTU 7564:2014. It was created to replace an obsolete GOST hash function defined in the old standard GOST 34.11-95, similar to Streebog hash function standardized in Russia.

Lyra2 is a password hashing scheme (PHS) that can also work as a key derivation function (KDF). It received a special recognition during the Password Hashing Competition in July 2015, which was won by Argon2. It is also used in proof-of-work algorithms such as Lyra2REv2, adopted by Vertcoin, MonaCoin, among other cryptocurrencies. Lyra2 was designed by Marcos A. Simplicio Jr., Leonardo C. Almeida, Ewerton R. Andrade, Paulo C. F. dos Santos, and Paulo S. L. M. Barreto from Escola Politécnica da Universidade de São Paulo. It is an improvement over Lyra, previously proposed by the same authors. Lyra2 preserves the security, efficiency and flexibility of its predecessor, including: (1) the ability to configure the desired amount of memory, processing time and parallelism to be used by the algorithm; and (2) the capacity of providing a high memory usage with a processing time similar to that obtained with scrypt. In addition, it brings the following improvements when compared to its predecessor:

References

↑ Pearson, Peter K. (June 1990), "Fast Hashing of Variable-Length Text Strings" (PDF), Communications of the ACM , 33 (6): 677, doi:10.1145/78973.78978, archived from the original (PDF) on 2012-07-04, retrieved 2013-07-13
↑ Lemire, Daniel (2012), "The universality of iterated hashing over variable-length strings", Discrete Applied Mathematics , 160 (4–5): 604–617, arXiv: 1008.1715 , doi:10.1016/j.dam.2011.11.009

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[acmref-1] Pearson, Peter K. (June 1990), "Fast Hashing of Variable-Length Text Strings" (PDF), Communications of the ACM , 33 (6): 677, doi:10.1145/78973.78978, archived from the original (PDF) on 2012-07-04, retrieved 2013-07-13

[univ-2] Lemire, Daniel (2012), "The universality of iterated hashing over variable-length strings", Discrete Applied Mathematics , 160 (4–5): 604–617, arXiv: 1008.1715 , doi:10.1016/j.dam.2011.11.009

[1]

[2]

Pearson hashing

Contents

Example implementations

C#, 8-bit

See also

Related Research Articles

References