Index of coincidence

Last updated August 03, 2024

In cryptography, coincidence counting is the technique (invented by William F. Friedman ^[1]) of putting two texts side-by-side and counting the number of times that identical letters appear in the same position in both texts. This count, either as a ratio of the total or normalized by dividing by the expected count for a random source model, is known as the index of coincidence, or IC for short.

Because letters in a natural language are not distributed evenly, the IC is higher for such texts than it would be for uniformly random text strings. What makes the IC especially useful is the fact that its value does not change if both texts are scrambled by the same single-alphabet substitution cipher, allowing a cryptanalyst to quickly detect that form of encryption.

Calculation

The index of coincidence provides a measure of how likely it is to draw two matching letters by randomly selecting two letters from a given text. The chance of drawing a given letter in the text is (number of times that letter appears / length of the text). The chance of drawing that same letter again (without replacement) is (appearances − 1 / text length − 1). The product of these two values gives you the chance of drawing that letter twice in a row. One can find this product for each letter that appears in the text, then sum these products to get a chance of drawing two of a kind. This probability can then be normalized by multiplying it by some coefficient, typically 26 in English.

\mathbf {IC} =c\times \left({\left({{\frac {n_{\mathrm {a} }}{N}}\times {\frac {n_{\mathrm {a} }-1}{N-1}}}\right)+\left({{\frac {n_{\mathrm {b} }}{N}}\times {\frac {n_{\mathrm {b} }-1}{N-1}}}\right)+\cdots +\left({{\frac {n_{\mathrm {z} }}{N}}\times {\frac {n_{\mathrm {z} }-1}{N-1}}}\right)}\right)

where c is the normalizing coefficient (26 for English), n_a is the number of times the letter "a" appears in the text, and N is the length of the text.

We can express the index of coincidence IC for a given letter-frequency distribution as a summation:

\mathbf {IC} ={\frac {\displaystyle \sum _{i=1}^{c}n_{i}(n_{i}-1)}{N(N-1)/c}}

where N is the length of the text and n₁ through n_c are the frequencies (as integers) of the c letters of the alphabet (c = 26 for monocase English). The sum of the n_i is necessarily N.

The products $n (n - 1)$ count the number of combinations of n elements taken two at a time. (Actually this counts each pair twice; the extra factors of 2 occur in both numerator and denominator of the formula and thus cancel out.) Each of the n_i occurrences of the i -th letter matches each of the remaining $n i - 1$ occurrences of the same letter. There are a total of $N (N - 1)$ letter pairs in the entire text, and 1/c is the probability of a match for each pair, assuming a uniform random distribution of the characters (the "null model"; see below). Thus, this formula gives the ratio of the total number of coincidences observed to the total number of coincidences that one would expect from the null model.^[2]

The expected average value for the IC can be computed from the relative letter frequencies $f i$ of the source language:

\mathbf {IC} _{\mathrm {expected} }={\frac {\displaystyle \sum _{i=1}^{c}{f_{i}}^{2}}{1/c}}.

If all $c$ letters of an alphabet were equally probable, the expected index would be 1.0. The actual monographic IC for telegraphic English text is around 1.73, reflecting the unevenness of natural-language letter distributions.

Sometimes values are reported without the normalizing denominator, for example $0.067 = 1.73/26$ for English; such values may be called κ_p ("kappa-plaintext") rather than IC, with κ_r ("kappa-random") used to denote the denominator $1/ c$ (which is the expected coincidence rate for a uniform distribution of the same alphabet, $0.0385=1/26$ for English). English plaintext will generally fall somewhere in the range of 1.5 to 2.0 (normalized calculation).^[3]

Application

The index of coincidence is useful both in the analysis of natural-language plaintext and in the analysis of ciphertext (cryptanalysis). Even when only ciphertext is available for testing and plaintext letter identities are disguised, coincidences in ciphertext can be caused by coincidences in the underlying plaintext. This technique is used to cryptanalyze the Vigenère cipher, for example. For a repeating-key polyalphabetic cipher arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system.

Coincidence counting can help determine when two texts are written in the same language using the same alphabet. (This technique has been used to examine the purported Bible code). The causal coincidence count for such texts will be distinctly higher than the accidental coincidence count for texts in different languages, or texts using different alphabets, or gibberish texts.^{[ citation needed ]}

To see why, imagine an "alphabet" of only the two letters A and B. Suppose that in our "language", the letter A is used 75% of the time, and the letter B is used 25% of the time. If two texts in this language are laid side by side, then the following pairs can be expected:

Pair	Probability
AA	56.25%
BB	6.25%
AB	18.75%
BA	18.75%

Overall, the probability of a "coincidence" is 62.5% (56.25% for AA + 6.25% for BB).

Now consider the case when both messages are encrypted using the simple monoalphabetic substitution cipher which replaces A with B and vice versa:

Pair	Probability
AA	6.25%
BB	56.25%
AB	18.75%
BA	18.75%

The overall probability of a coincidence in this situation is 62.5% (6.25% for AA + 56.25% for BB), exactly the same as for the unencrypted "plaintext" case. In effect, the new alphabet produced by the substitution is just a uniform renaming of the original character identities, which does not affect whether they match.

Now suppose that only one message (say, the second) is encrypted using the same substitution cipher (A,B)→(B,A). The following pairs can now be expected:

Pair	Probability
AA	18.75%
BB	18.75%
AB	56.25%
BA	6.25%

Now the probability of a coincidence is only 37.5% (18.75% for AA + 18.75% for BB). This is noticeably lower than the probability when same-language, same-alphabet texts were used. Evidently, coincidences are more likely when the most frequent letters in each text are the same.

The same principle applies to real languages like English, because certain letters, like E, occur much more frequently than other letters—a fact which is used in frequency analysis of substitution ciphers. Coincidences involving the letter E, for example, are relatively likely. So when any two English texts are compared, the coincidence count will be higher than when an English text and a foreign-language text are used.

This effect can be subtle. For example, similar languages will have a higher coincidence count than dissimilar languages. Also, it is not hard to generate random text with a frequency distribution similar to real text, artificially raising the coincidence count. Nevertheless, this technique can be used effectively to identify when two texts are likely to contain meaningful information in the same language using the same alphabet, to discover periods for repeating keys, and to uncover many other kinds of nonrandom phenomena within or among ciphertexts.

Expected values for various languages^[4] are:

Language	Index of Coincidence
English	1.73
French	2.02
German	2.05
Italian	1.94
Portuguese	1.94
Russian	1.76
Spanish	1.94

Generalization

The above description is only an introduction to use of the index of coincidence, which is related to the general concept of correlation. Various forms of Index of Coincidence have been devised; the "delta" I.C. (given by the formula above) in effect measures the autocorrelation of a single distribution, whereas a "kappa" I.C. is used when matching two text strings.^[5] Although in some applications constant factors such as $c$ and $N$ can be ignored, in more general situations there is considerable value in truly indexing each I.C. against the value to be expected for the null hypothesis (usually: no match and a uniform random symbol distribution), so that in every situation the expected value for no correlation is 1.0. Thus, any form of I.C. can be expressed as the ratio of the number of coincidences actually observed to the number of coincidences expected (according to the null model), using the particular test setup.

From the foregoing, it is easy to see that the formula for kappa I.C. is

\mathbf {IC} ={\frac {\displaystyle \sum _{j=1}^{N}[a_{j}=b_{j}]}{N/c}},

where $N$ is the common aligned length of the two texts A and B, and the bracketed term is defined as 1 if the $j$ -th letter of text A matches the $j$ -th letter of text B, otherwise 0.

A related concept, the "bulge" of a distribution, measures the discrepancy between the observed I.C. and the null value of 1.0. The number of cipher alphabets used in a polyalphabetic cipher may be estimated by dividing the expected bulge of the delta I.C. for a single alphabet by the observed bulge for the message, although in many cases (such as when a repeating key was used) better techniques are available.

Example

As a practical illustration of the use of I.C., suppose that we have intercepted the following ciphertext message:

QPWKA LVRXC QZIKG RBPFA EOMFL  JMSDZ VDHXC XJYEB IMTRQ WNMEA IZRVK CVKVL XNEIC FZPZC ZZHKM  LVZVZ IZRRQ WDKEC HOSNY XXLSP MYKVQ XJTDC IOMEE XDQVS RXLRL  KZHOV

(The grouping into five characters is just a telegraphic convention and has nothing to do with actual word lengths.) Suspecting this to be an English plaintext encrypted using a Vigenère cipher with normal A–Z components and a short repeating keyword, we can consider the ciphertext "stacked" into some number of columns, for example seven:

QPWKALV RXCQZIK GRBPFAE OMFLJMS DZVDHXC XJYEBIM TRQWN…

If the key size happens to have been the same as the assumed number of columns, then all the letters within a single column will have been enciphered using the same key letter, in effect a simple Caesar cipher applied to a random selection of English plaintext characters. The corresponding set of ciphertext letters should have a roughness of frequency distribution similar to that of English, although the letter identities have been permuted (shifted by a constant amount corresponding to the key letter). Therefore, if we compute the aggregate delta I.C. for all columns ("delta bar"), it should be around 1.73. On the other hand, if we have incorrectly guessed the key size (number of columns), the aggregate delta I.C. should be around 1.00. So we compute the delta I.C. for assumed key sizes from one to ten:

Size	Delta-bar I.C.
1	1.12
2	1.19
3	1.05
4	1.17
5	1.82
6	0.99
7	1.00
8	1.05
9	1.16
10	2.07

We see that the key size is most likely five. If the actual size is five, we would expect a width of ten to also report a high I.C., since each of its columns also corresponds to a simple Caesar encipherment, and we confirm this. So we should stack the ciphertext into five columns:

QPWKA LVRXC QZIKG RBPFA EOMFL JMSDZ VDH…

We can now try to determine the most likely key letter for each column considered separately, by performing trial Caesar decryption of the entire column for each of the 26 possibilities A–Z for the key letter, and choosing the key letter that produces the highest correlation between the decrypted column letter frequencies and the relative letter frequencies for normal English text. That correlation, which we don't need to worry about normalizing, can be readily computed as

\mathbf {\chi } =\sum _{i=1}^{c}n_{i}f_{i}

where $n_{i}$ are the observed column letter frequencies and $f_{i}$ are the relative letter frequencies for English. When we try this, the best-fit key letters are reported to be "EVERY," which we recognize as an actual word, and using that for Vigenère decryption produces the plaintext:

MUSTC HANGE MEETI NGLOC ATION FROMB RIDGE TOUND ERPAS  SSINC EENEM YAGEN TSARE BELIE VEDTO HAVEB EENAS SIGNE  DTOWA TCHBR IDGES TOPME ETING TIMEU NCHAN GEDXX

from which one obtains:

MUST CHANGE MEETING LOCATION FROM BRIDGE TO UNDERPASS SINCE ENEMY AGENTS ARE BELIEVED TO HAVE BEEN ASSIGNED TO WATCH BRIDGE STOP  MEETING TIME UNCHANGED  XX

after word divisions have been restored at the obvious positions. "XX" are evidently "null" characters used to pad out the final group for transmission.

This entire procedure could easily be packaged into an automated algorithm for breaking such ciphers. Due to normal statistical fluctuation, such an algorithm will occasionally make wrong choices, especially when analyzing short ciphertext messages.

Related Research Articles

In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters, pairs of letters, triplets of letters, mixtures of the above, and so forth. The receiver deciphers the text by performing the inverse substitution process to extract the original message.

In cryptography, a transposition cipher is a method of encryption which scrambles the positions of characters (transposition) without changing the characters themselves. Transposition ciphers reorder units of plaintext according to a regular system to produce a ciphertext which is a permutation of the plaintext. They differ from substitution ciphers, which do not change the position of units of plaintext but instead change the units themselves. Despite the difference between transposition and substitution operations, they are often combined, as in historical ciphers like the ADFGVX cipher or complex high-quality encryption methods like the modern Advanced Encryption Standard (AES).

In cryptography, a Caesar cipher, also known as Caesar's cipher, the shift cipher, Caesar's code, or Caesar shift, is one of the simplest and most widely known encryption techniques. It is a type of substitution cipher in which each letter in the plaintext is replaced by a letter some fixed number of positions down the alphabet. For example, with a left shift of 3, D would be replaced by A, E would become B, and so on. The method is named after Julius Caesar, who used it in his private correspondence.

<span class="mw-page-title-main">Vigenère cipher</span> Simple type of polyalphabetic encryption system

The Vigenère cipher is a method of encrypting alphabetic text where each letter of the plaintext is encoded with a different Caesar cipher, whose increment is determined by the corresponding letter of another text, the key.

In cryptography, unicity distance is the length of an original ciphertext needed to break the cipher by reducing the number of possible spurious keys to zero in a brute force attack. That is, after trying every possible key, there should be just one decipherment that makes sense, i.e. expected amount of ciphertext needed to determine the key completely, assuming the underlying message has redundancy.

An autokey cipher is a cipher that incorporates the message into the key. The key is generated from the message in some automated fashion, sometimes by selecting certain letters from the text or, more commonly, by adding a short primer key to the front of the message.

In cryptography, the tabula recta is a square table of alphabets, each row of which is made by shifting the previous one to the left. The term was invented by the German author and monk Johannes Trithemius in 1508, and used in his Trithemius cipher.

In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers.

In classical cryptography, the running key cipher is a type of polyalphabetic substitution cipher in which a text, typically from a book, is used to provide a very long keystream. The earliest description of such a cipher was given in 1892 by French mathematician Arthur Joseph Hermann. Usually, the book to be used would be agreed ahead of time, while the passage to be used would be chosen randomly for each message and secretly indicated somewhere in the message.

The affine cipher is a type of monoalphabetic substitution cipher, where each letter in an alphabet is mapped to its numeric equivalent, encrypted using a simple mathematical function, and converted back to a letter. The formula used means that each letter encrypts to one other letter, and back again, meaning the cipher is essentially a standard substitution cipher with a rule governing which letter goes to which. As such, it has the weaknesses of all substitution ciphers. Each letter is enciphered with the function $(ax + b) mod 26$ , where $b$ is the magnitude of the shift.

In cryptography, the ADFGVX cipher was a manually applied field cipher used by the Imperial German Army during World War I. It was used to transmit messages secretly using wireless telegraphy. ADFGVX was in fact an extension of an earlier cipher called ADFGX which was first used on 1 March 1918 on the German Western Front. ADFGVX was applied from 1 June 1918 on both the Western Front and Eastern Front.

In cryptography, a classical cipher is a type of cipher that was used historically but for the most part, has fallen into disuse. In contrast to modern cryptographic algorithms, most classical ciphers can be practically computed and solved by hand. However, they are also usually very simple to break with modern technology. The term includes the simple systems used since Greek and Roman times, the elaborate Renaissance ciphers, World War II cryptography such as the Enigma machine and beyond.

In cryptanalysis, Kasiski examination is a method of attacking polyalphabetic substitution ciphers, such as the Vigenère cipher. It was first published by Friedrich Kasiski in 1863, but seems to have been independently discovered by Charles Babbage as early as 1846.

The trifid cipher is a classical cipher invented by Félix Delastelle and described in 1902. Extending the principles of Delastelle's earlier bifid cipher, it combines the techniques of fractionation and transposition to achieve a certain amount of confusion and diffusion: each letter of the ciphertext depends on three letters of the plaintext and up to three letters of the key.

The rail fence cipher is a classical type of transposition cipher. It derives its name from the manner in which encryption is performed, in analogy to a fence built with horizontal rails.

The four-square cipher is a manual symmetric encryption technique. It was invented by the French cryptographer Felix Delastelle.

The Two-square cipher, also called double Playfair, is a manual symmetric encryption technique. It was developed to ease the cumbersome nature of the large encryption/decryption matrix used in the four-square cipher while still being slightly stronger than the single-square Playfair cipher.

The Beaufort cipher, invented by some Giovanni Sestri in early 18th century but widely attributed to Sir Francis Beaufort, is a substitution cipher similar to the Vigenère cipher, with a slightly modified enciphering mechanism and tableau. Its most famous application was in a rotor-based cipher machine, the Hagelin M-209. The Beaufort cipher is based on the Beaufort square which is essentially the same as a Vigenère square but in reverse order starting with the letter "Z" in the first row, where the first row and the last column serve the same purpose.

The Alberti Cipher, created in 1467 by Italian architect Leon Battista Alberti, was one of the first polyalphabetic ciphers. In the opening pages of his treatise De componendis cifris he explained how his conversation with the papal secretary Leonardo Dati about a recently developed movable type printing press led to the development of his cipher wheel.

Turingery or Turing's method was a manual codebreaking method devised in July 1942 by the mathematician and cryptanalyst Alan Turing at the British Government Code and Cypher School at Bletchley Park during World War II. It was for use in cryptanalysis of the Lorenz cipher produced by the SZ40 and SZ42 teleprinter rotor stream cipher machines, one of the Germans' Geheimschreiber machines. The British codenamed non-Morse traffic "Fish", and that from this machine "Tunny".

References

↑
Friedman, William F. (1922). "The index of coincidence and its applications in cryptography." Department of Ciphers. Publ 22. Geneva, Illinois, USA: Riverbank Laboratories. OCLC 55786052. The original application ignored normalization.
- 2nd ed. (1935). Washington, D.C.: Signals Office, War Department.
- 3rd ed. (1996). Laguna Hills, Calif.: Aegean Park Press. ISBN 0894121383.
↑ Mountjoy, Marjorie (1963). "The Bar Statistics". NSA Technical Journal. VII (2, 4). Published in two parts.
↑ Kontou, Eleni (18 May 2020). "Index of Coincidence". University of Leicester Open Journals– via CORE.
↑ Friedman, W.F. and Callimahos, L.D. (1985) [1956]. Military Cryptanalytics, Part I – Volume 2. Reprinted by Aegean Park Press. ISBN 0-89412-074-3.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ Kahn, David (1996) [1967]. The Codebreakers - The Story of Secret Writing. New York: Macmillan. ISBN 0-684-83130-9.