Fuzzy hashing

Last updated

Fuzzy hashing, also known as similarity hashing, [1] is a technique for detecting data that is similar, but not exactly the same, as other data. This is in contrast to cryptographic hash functions, which are designed to have significantly different hashes for even minor differences. Fuzzy hashing has been used to identify malware [2] [3] and has potential for other applications, like data loss prevention and detecting multiple versions of code. [4] [5]

Contents

Background

A hash function is a mathematical algorithm which maps arbitrary-sized data to a fixed size output. Many solutions use cryptographic hash functions like SHA-256 to detect duplicates or check for known files within large collection of files. [4] However, cryptographic hash functions cannot be used for determining if a file is similar to a known file, because one of the requirements of a cryptographic hash function is that a small change to the input should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect) [6]

Fuzzy hashing exists to solve this problem of detecting data that is similar, but not exactly the same, as other data. Fuzzy hashing algorithms specifically use algorithms in which two similar inputs will generate two similar hash values. This property is the exact opposite of the avalanche effect desired in cryptographic hash functions.

Fuzzy hashing can also be used to detect when one object is contained within another. [1]

Approaches for fuzzy hashing

There are a few approaches used for building fuzzy hash algorithms: [7] [5]

Notable fuzzy hashing tools and algorithms

See also

Related Research Articles

<span class="mw-page-title-main">Checksum</span> Data used to detect errors in other data

A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4, and was specified in 1992 as RFC 1321.

<span class="mw-page-title-main">Cryptographic hash function</span> Hash function that is suitable for use in cryptography

A cryptographic hash function (CHF) is a hash algorithm that has special properties desirable for a cryptographic application:

<span class="mw-page-title-main">Key derivation function</span> Function that derives secret keys from a secret value

In cryptography, a key derivation function (KDF) is a cryptographic algorithm that derives one or more secret keys from a secret value such as a master key, a password, or a passphrase using a pseudorandom function. KDFs can be used to stretch keys into longer keys or to obtain keys of a required format, such as converting a group element that is the result of a Diffie–Hellman key exchange into a symmetric key for use with AES. Keyed cryptographic hash functions are popular examples of pseudorandom functions used for key derivation.

In cryptography, a message authentication code (MAC), sometimes known as an authentication tag, is a short piece of information used for authenticating and integrity-checking a message. In other words, to confirm that the message came from the stated sender and has not been changed. The MAC value allows verifiers to detect any changes to the message content.

Internet security is a branch of computer security. It encompasses the Internet, browser security, web site security, and network security as it applies to other applications or operating systems as a whole. Its objective is to establish rules and measures to use against attacks over the Internet. The Internet is an inherently insecure channel for information exchange, with high risk of intrusion or fraud, such as phishing, online viruses, trojans, ransomware and worms.

In cryptography, a collision attack on a cryptographic hash tries to find two inputs producing the same hash value, i.e. a hash collision. This is in contrast to a preimage attack where a specific target hash value is specified.

VoIP spam or SPIT is unsolicited, automatically dialed telephone calls, typically using voice over Internet Protocol (VoIP) technology.

Fuzzy clustering is a form of clustering in which each data point can belong to more than one cluster.

Cryptovirology refers to the study of cryptography use in malware, such as ransomware and asymmetric backdoors. Traditionally, cryptography and its applications are defensive in nature, and provide privacy, authentication, and security to users. Cryptovirology employs a twist on cryptography, showing that it can also be used offensively. It can be used to mount extortion based attacks that cause loss of access to information, loss of confidentiality, and information leakage, tasks which cryptography typically prevents.

<span class="mw-page-title-main">Cryptographic nonce</span> Concept in cryptography

In cryptography, a nonce is an arbitrary number that can be used just once in a cryptographic communication. It is often a random or pseudo-random number issued in an authentication protocol to ensure that old communications cannot be reused in replay attacks. They can also be useful as initialization vectors and in cryptographic hash functions.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

<span class="mw-page-title-main">Fingerprint (computing)</span> Digital identifier derived from the data by an algorithm

In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes. This fingerprint may be used for data deduplication purposes. This is also referred to as file fingerprinting, data fingerprinting, or structured data fingerprinting.

<span class="mw-page-title-main">Cryptography</span> Practice and study of secure communication techniques

Cryptography, or cryptology, is the practice and study of techniques for secure communication in the presence of adversarial behavior. More generally, cryptography is about constructing and analyzing protocols that prevent third parties or the public from reading private messages. Modern cryptography exists at the intersection of the disciplines of mathematics, computer science, information security, electrical engineering, digital signal processing, physics, and others. Core concepts related to information security are also central to cryptography. Practical applications of cryptography include electronic commerce, chip-based payment cards, digital currencies, computer passwords, and military communications.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

<span class="mw-page-title-main">Sponge function</span> Theory of cryptography

In cryptography, a sponge function or sponge construction is any of a class of algorithms with finite internal state that take an input bit stream of any length and produce an output bit stream of any desired length. Sponge functions have both theoretical and practical uses. They can be used to model or implement many cryptographic primitives, including cryptographic hashes, message authentication codes, mask generation functions, stream ciphers, pseudo-random number generators, and authenticated encryption.

Nilsimsa is an anti-spam focused locality-sensitive hashing algorithm originally proposed the cmeclax remailer operator in 2001 and then reviewed by Ernesto Damiani et al. in their 2004 paper titled, "An Open Digest-based Technique for Spam Detection". The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. In comparison with cryptographic hash functions such as SHA-1 or MD5, making a small modification to a document does not substantially change the resulting hash of the document. The paper suggests that the Nilsimsa satisfies three requirements:

  1. The digest identifying each message should not vary significantly (sic) for changes that can be produced automatically.
  2. The encoding must be robust against intentional attacks.
  3. The encoding should support an extremely low risk of false positives.

Perceptual hashing is the use of a fingerprinting algorithm that produces a snippet, hash, or fingerprint of various forms of multimedia. A perceptual hash is a type of locality-sensitive hash, which is analogous if features of the multimedia are similar. This is in contrast to cryptographic hashing, which relies on the avalanche effect of a small change in input value creating a drastic change in output value. Perceptual hash functions are widely used in finding cases of online copyright infringement as well as in digital forensics because of the ability to have a correlation between hashes so similar data can be found.

References

  1. 1 2 Breitinger, Frank (May 2014). "NIST Special Publication 800-168" (PDF). NIST Publications. doi:10.6028/NIST.SP.800-168 . Retrieved January 11, 2023.
  2. Pagani, Fabio; Dell'Amico, Matteo; Balzarotti, Davide (2018-03-13). "Beyond Precision and Recall" (PDF). Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy. New York, NY, USA: ACM. pp. 354–365. doi:10.1145/3176258.3176306. ISBN   9781450356329 . Retrieved December 12, 2022.
  3. Sarantinos, Nikolaos; Benzaïd, Chafika; Arabiat, Omar (2016). "Forensic Malware Analysis: The Value of Fuzzy Hashing Algorithms in Identifying Similarities". 2016 IEEE Trustcom/BigDataSE/ISPA (PDF). pp. 1782–1787. doi:10.1109/TrustCom.2016.0274. ISBN   978-1-5090-3205-1. S2CID   32568938. 10.1109/TrustCom.2016.0274.
  4. 1 2 3 Kornblum, Jesse (2006). "Identifying almost identical files using context triggered piecewise hashing". Digital Investigation. 3, Supplement (September 2006): 91–97. doi: 10.1016/j.diin.2006.06.015 . Retrieved June 30, 2022.
  5. 1 2 Oliver, Jonathan; Cheng, Chun; Chen, Yanggui (2013). "TLSH -- A Locality Sensitive Hash" (PDF). 2013 Fourth Cybercrime and Trustworthy Computing Workshop. IEEE. pp. 7–13. doi:10.1109/ctc.2013.9. ISBN   978-1-4799-3076-0 . Retrieved December 12, 2022.
  6. Oliver, Jonathan; Hagen, Josiah (2021). "Designing the Elements of a Fuzzy Hashing Scheme" (PDF). 2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC). IEEE. pp. 1–6. doi:10.1109/euc53437.2021.00028. ISBN   978-1-6654-0036-7. Archived from the original (PDF) on 14 April 2021. Retrieved 14 April 2021.
  7. "Open Source Similarity Digests DFRWS August 2016" (PDF). tlsh.org. Retrieved December 11, 2022.
  8. "spamsum README". samba.org. Retrieved December 11, 2022.
  9. "spamsum.c". samba.org. Retrieved December 11, 2022.
  10. Roussev, Vassil (2010). "Data Fingerprinting with Similarity Digests". Advances in Digital Forensics VI. IFIP Advances in Information and Communication Technology. Vol. 337. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 207–226. doi:10.1007/978-3-642-15506-2_15. ISBN   978-3-642-15505-5. ISSN   1868-4238.
  11. "Fast Clustering of High Dimensional Data Clustering the Malware Bazaar Dataset" (PDF). tlsh.org. Retrieved December 11, 2022.