Nilsimsa Hash

Last updated

Nilsimsa is an anti-spam focused locality-sensitive hashing algorithm originally proposed the cmeclax remailer operator in 2001 [1] and then reviewed by Ernesto Damiani et al. in their 2004 paper titled, "An Open Digest-based Technique for Spam Detection". [2] The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. In comparison with cryptographic hash functions such as SHA-1 or MD5, making a small modification to a document does not substantially change the resulting hash of the document. The paper suggests that the Nilsimsa satisfies three requirements:

  1. The digest identifying each message should not vary significantly (sic) for changes that can be produced automatically.
  2. The encoding must be robust against intentional attacks.
  3. The encoding should support an extremely low risk of false positives.

Subsequent testing on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash. [3]

Nilsimsa similarity matching was taken in consideration by Jesse Kornblum when developing the fuzzy hashing in 2006, [4] that used the algorithms of spamsum by Andrew Tridgell (2002). [5]

Several implementations of Nilsimsa exist as open-source software. [6] [7] [8] [9] [10]

Related Research Articles

<span class="mw-page-title-main">Checksum</span> Data used to detect errors in other data

A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter-storage addressing.

In cryptography, SHA-1 is a hash function which takes an input and produces a 160-bit (20-byte) hash value known as a message digest – typically rendered as 40 hexadecimal digits. It was designed by the United States National Security Agency, and is a U.S. Federal Information Processing Standard. The algorithm has been cryptographically broken but is still widely used.

An anonymous remailer is a server that receives messages with embedded instructions on where to send them next, and that forwards them without revealing where they originally came from. There are cypherpunk anonymous remailers, mixmaster anonymous remailers, and nym servers, among others, which differ in how they work, in the policies they adopt, and in the type of attack on the anonymity of e-mail they can resist. Remailing as discussed in this article applies to e-mails intended for particular recipients, not the general public. Anonymity in the latter case is more easily addressed by using any of several methods of anonymous publication.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

<span class="mw-page-title-main">Cryptographic hash function</span> Hash function that is suitable for use in cryptography

A cryptographic hash function (CHF) is a hash algorithm that has special properties desirable for a cryptographic application:

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

<span class="mw-page-title-main">MD4</span> Cryptographic hash function

The MD4 Message-Digest Algorithm is a cryptographic hash function developed by Ronald Rivest in 1990. The digest length is 128 bits. The algorithm has influenced later designs, such as the MD5, SHA-1 and RIPEMD algorithms. The initialism "MD" stands for "Message Digest".

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

IP traceback is any method for reliably determining the origin of a packet on the Internet. The IP protocol does not provide for the authentication of the source IP address of an IP packet, enabling the source address to be falsified in a strategy called IP address spoofing, and creating potential internet security and stability problems.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

<span class="mw-page-title-main">Fingerprint (computing)</span> Digital identifier derived from the data by an algorithm

In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes. This fingerprint may be used for data deduplication purposes. This is also referred to as file fingerprinting, data fingerprinting, or structured data fingerprinting.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

crypt is a POSIX C library function. It is typically used to compute the hash of user account passwords. The function outputs a text string which also encodes the salt, and identifies the hash algorithm used. This output string forms a password record, which is usually stored in a text file.

In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar. In 2021 Google announced its intent to also use the algorithm in their newly created FLoC system.

Fuzzy hashing, also known as similarity hashing, is a technique for detecting data that is similar, but not exactly the same, as other data. This is in contrast to cryptographic hash functions, which are designed to have significantly different hashes for even minor differences. Fuzzy hashing has been used to identify malware and has potential for other applications, like data loss prevention and detecting multiple versions of code.

References

  1. cmeclax remailer operator (10 February 2002). "Nilsimsa v.0.2.4". Archived from the original on 7 July 2005. Retrieved 23 February 2014.
  2. Damiani; et al. (2004). "An Open Digest-based Technique for Spam Detection" (PDF). Retrieved 2013-09-01.
  3. Oliver; et al. (2013). "TLSH - A Locality Sensitive Hash". 4th Cybercrime and Trustworthy Computing Workshop. Retrieved 2015-06-04.
  4. Jesse Kornblum (15 May 2008). "The Fuzzy Hashing Patent". LiveJournal. Archived from the original on 7 May 2016. Retrieved 23 February 2014.
  5. Jesse Kornblum (2006). "Identifying almost identical files using context triggered piecewise hashing" (PDF). DFRWS. Retrieved 23 February 2014.
  6. "py-nilsimsa - Python port of Nilsimsa locality-sensitive hash". github.com. Retrieved 2016-11-08.
  7. "Nilsimsa". Nilsimsa.rubyforge.org. Archived from the original on 2013-06-15. Retrieved 2013-09-01.
  8. "Digest::Nilsimsa". metacpan.org. Retrieved 2013-09-01.
  9. "golang nilsimsa - implements nilsimsa fuzzy hash by cmeclax". hersensch.im. Retrieved 2018-04-25.
  10. "node-nilsimsa - Node.JS port of Nilsimsa locality-sensitive hash". github.com. Retrieved 2023-09-09.