Trigram

Last updated

Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. See https://www3.nd.edu/~busiforc/handouts/cryptography/Letter%20Frequencies.html for results of analysis of "Letter Frequencies in the English Language".

Contents

Frequency

Context is very important, varying analysis rankings and percentages are easily derived by drawing from different sample sizes, different authors; or different document types: poetry, science-fiction, technology documentation; and writing levels: stories for children versus adults, military orders, and recipes.

Typical cryptanalytic frequency analysis finds that the 16 most common character-level trigrams in English are: [1] [2]

Rank [1] TrigramFrequency [3]
(Different source)
1the1.81%
2and0.73%
3tha0.33%
4ent0.42%
5ing0.72%
6ion0.42%
7tio0.31%
8for0.34%
9nde
10has
11nce
12edt
13tis
14oft0.22%
15sth0.21%
16men

Because encrypted messages sent by telegraph often omit punctuation and spaces, cryptographic frequency analysis of such messages includes trigrams that straddle word boundaries. This causes trigrams such as "edt" to occur frequently, even though it may never occur in any one word of those messages. [4]

Examples

The sentence "the quick red fox jumps over the lazy brown dog" has the following word-level trigrams:

the quick red quick red fox red fox jumps fox jumps over jumps over the over the lazy the lazy brown lazy brown dog

And the word-level trigram "the quick red" has the following character-level trigrams (where an underscore "_" marks a space):

the he_ e_q _qu qui uic ick ck_ k_r _re red

Related Research Articles

<span class="mw-page-title-main">Cipher</span> Algorithm for encrypting and decrypting information

In cryptography, a cipher is an algorithm for performing encryption or decryption—a series of well-defined steps that can be followed as a procedure. An alternative, less common term is encipherment. To encipher or encode is to convert information into cipher or code. In common parlance, "cipher" is synonymous with "code", as they are both a set of steps that encrypt a message; however, the concepts are distinct in cryptography, especially classical cryptography.

The ellipsis... is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term originates from the Ancient Greek: ἔλλειψις, élleipsis meaning 'leave out'.

<span class="mw-page-title-main">HMAC</span> Computer communications hash algorithm

In cryptography, an HMAC is a specific type of message authentication code (MAC) involving a cryptographic hash function and a secret cryptographic key. As with any MAC, it may be used to simultaneously verify both the data integrity and authenticity of a message. An HMAC is a type of keyed hash function that can also be used in a key derivation scheme or a key stretching scheme.

In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters, pairs of letters, triplets of letters, mixtures of the above, and so forth. The receiver deciphers the text by performing the inverse substitution process to extract the original message.

<span class="mw-page-title-main">ROT13</span> Simple encryption method

ROT13 is a simple letter substitution cipher that replaces a letter with the 13th letter after it in the latin alphabet. ROT13 is a special case of the Caesar cipher which was developed in ancient Rome.

<span class="mw-page-title-main">Caesar cipher</span> Simple and widely known encryption technique

In cryptography, a Caesar cipher, also known as Caesar's cipher, the shift cipher, Caesar's code, or Caesar shift, is one of the simplest and most widely known encryption techniques. It is a type of substitution cipher in which each letter in the plaintext is replaced by a letter some fixed number of positions down the alphabet. For example, with a left shift of 3, D would be replaced by A, E would become B, and so on. The method is named after Julius Caesar, who used it in his private correspondence.

A lipogram is a kind of constrained writing or word game consisting of writing paragraphs or longer works in which a particular letter or group of letters is avoided. Extended Ancient Greek texts avoiding the letter sigma are the earliest examples of lipograms.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for a large language model's data preprocessing, that encodes text into numerical tokens, using byte pair encoding.

A passphrase is a sequence of words or other text used to control access to a computer system, program or data. It is similar to a password in usage, but a passphrase is generally longer for added security. Passphrases are often used to control both access to, and the operation of, cryptographic programs and systems, especially those that derive an encryption key from a passphrase. The origin of the term is by analogy with password. The modern concept of passphrases is believed to have been invented by Sigmund N. Porter in 1982.

<span class="mw-page-title-main">Frequency analysis</span> Study of the frequency of letters or groups of letters in a ciphertext

In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers.

<span class="mw-page-title-main">Letter case</span> Uppercase or lowercase

Letter case is the distinction between the letters that are in larger uppercase or capitals and smaller lowercase in the written representation of certain languages. The writing systems that distinguish between the upper- and lowercase have two parallel sets of letters: each in the majuscule set has a counterpart in the minuscule set. Some counterpart letters have the same shape, and differ only in size, but for others the shapes are different. The two case variants are alternative representations of the same letter: they have the same name and pronunciation and are typically treated identically when sorting in alphabetical order.

HAVAL is a cryptographic hash function. Unlike MD5, but like most modern cryptographic hash functions, HAVAL can produce hashes of different lengths – 128 bits, 160 bits, 192 bits, 224 bits, and 256 bits. HAVAL also allows users to specify the number of rounds to be used to generate the hash. HAVAL was broken in 2004.

<span class="mw-page-title-main">Filler text</span> Text generated to fill space or provide unremarkable and/or standardised text

Filler text is text that shares some characteristics of a real written text, but is random or otherwise generated. It may be used to display a sample of fonts, generate text for testing, or to spoof an e-mail spam filter. The process of using filler text is sometimes called greeking, although the text itself may be nonsense, or largely Latin, as in Lorem ipsum.

Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi, who formally developed the method to break ciphers. Letter frequency analysis gained importance in Europe with the development of movable type in 1450 AD, where one must estimate the amount of type required for each letterform. Linguists use letter frequency analysis as a rudimentary technique for language identification, where it is particularly effective as an indication of whether an unknown writing system is alphabetic, syllabic, or ideographic.

<span class="mw-page-title-main">RadioGatún</span> Cryptographic hash primitive

RadioGatún is a cryptographic hash primitive created by Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. It was first publicly presented at the NIST Second Cryptographic Hash Workshop, held in Santa Barbara, California, on August 24–25, 2006, as part of the NIST hash function competition. The same team that developed RadioGatún went on to make considerable revisions to this cryptographic primitive, leading to the Keccak SHA-3 algorithm.

JH is a cryptographic hash function submitted to the NIST hash function competition by Hongjun Wu. Though chosen as one of the five finalists of the competition, in 2012 JH ultimately lost to NIST hash candidate Keccak. JH has a 1024-bit state, and works on 512-bit input blocks. Processing an input block consists of three steps:

  1. XOR the input block into the left half of the state.
  2. Apply a 42-round unkeyed permutation (encryption function) to the state. This consists of 42 repetitions of:
    1. Break the input into 256 4-bit blocks, and map each through one of two 4-bit S-boxes, the choice being made by a 256-bit round-dependent key schedule. Equivalently, combine each input block with a key bit, and map the result through a 5→4 bit S-box.
    2. Mix adjacent 4-bit blocks using a maximum distance separable code over GF(24).
    3. Permute 4-bit blocks so that they will be adjacent to different blocks in following rounds.
  3. XOR the input block into the right half of the state.

The Jenkins hash functions are a collection of (non-cryptographic) hash functions for multi-byte keys designed by Bob Jenkins. The first one was formally published in 1997.

<span class="mw-page-title-main">The quick brown fox jumps over the lazy dog</span> Sentence containing all letters of the English alphabet

"The quick brown fox jumps over the lazy dog" is an English-language pangram – a sentence that contains all the letters of the alphabet. The phrase is commonly used for touch-typing practice, testing typewriters and computer keyboards, displaying examples of fonts, and other applications involving text where the use of all letters in the alphabet is desired.

Streebog is a cryptographic hash function defined in the Russian national standard GOST R 34.11-2012 Information Technology – Cryptographic Information Security – Hash Function. It was created to replace an obsolete GOST hash function defined in the old standard GOST R 34.11-94, and as an asymmetric reply to SHA-3 competition by the US National Institute of Standards and Technology. The function is also described in RFC 6986 and one out of hash functions in ISO/IEC 10118-3:2018.

Kupyna is a cryptographic hash function defined in the Ukrainian national standard DSTU 7564:2014. It was created to replace an obsolete GOST hash function defined in the old standard GOST 34.11-95, similar to Streebog hash function standardized in Russia.

References

  1. 1 2 Lewand, Robert (2000). Cryptological Mathematics. The Mathematical Association of America. p. 37. ISBN   978-0-88385-719-9.
  2. Linton, Tom (2001). "Relative Frequencies of Letters in General English Plain text". Central College . Cryptography (Spring ed.). Archived from the original on January 22, 2007.
  3. "English Letter Frequencies". Practical Cryptography.
  4. "Voice Search SEO". Fuelonline.