cryptography Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adve ...

, coincidence counting is the technique (invented by

William F. Friedman William Frederick Friedman (September 24, 1891 – November 12, 1969) was a US Army cryptographer who ran the research division of the Army's Signal Intelligence Service (SIS) in the 1930s, and parts of its follow-on services into the 1950s. In ...

) of putting two texts side-by-side and counting the number of times that identical letters appear in the same position in both texts. This count, either as a ratio of the total or normalized by dividing by the expected count for a random source model, is known as the index of coincidence, or IC for short. Because letters in a natural language are not distributed evenly, the IC is higher for such texts than it would be for uniformly random text strings. What makes the IC especially useful is the fact that its value does not change if both texts are scrambled by the same single-alphabet

substitution cipher In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters (the most common), pairs of letters, tri ...

, allowing a cryptanalyst to quickly detect that form of encryption.

Calculation

The index of coincidence provides a measure of how likely it is to draw two matching letters by randomly selecting two letters from a given text. The chance of drawing a given letter in the text is (number of times that letter appears / length of the text). The chance of drawing that same letter again (without replacement) is (appearances − 1 / text length − 1). The product of these two values gives you the chance of drawing that letter twice in a row. One can find this product for each letter that appears in the text, then sum these products to get a chance of drawing two of a kind. This probability can then be normalized by multiplying it by some coefficient, typically 26 in English. :

\mathbf = c \times \left(\right)

where ''c'' is the normalizing coefficient (26 for English), ''n''_a is the number of times the letter "a" appears in the text, and ''N'' is the length of the text. We can express the index of coincidence IC for a given letter-frequency distribution as a summation: :

\mathbf = \frac

where ''N'' is the length of the text and ''n''₁ through ''n_c'' are the

frequencies Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from ''angular frequency''. Frequency is measured in hertz (Hz) which is eq ...

(as integers) of the ''c'' letters of the alphabet (''c'' = 26 for monocase

English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...

). The sum of the ''n_i'' is necessarily ''N''. The products count the number of combinations of ''n'' elements taken two at a time. (Actually this counts each pair twice; the extra factors of 2 occur in both numerator and denominator of the formula and thus cancel out.) Each of the ''n_i'' occurrences of the ''i'' -th letter matches each of the remaining occurrences of the same letter. There are a total of letter pairs in the entire text, and 1/''c'' is the probability of a match for each pair, assuming a uniform

random In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...

distribution of the characters (the "null model"; see below). Thus, this formula gives the ratio of the total number of coincidences observed to the total number of coincidences that one would expect from the null model. The expected average value for the IC can be computed from the relative letter frequencies of the source language: :

\mathbf_ = \frac.

If all letters of an alphabet were equally probable, the expected index would be 1.0. The actual monographic IC for

telegraph Telegraphy is the long-distance transmission of messages where the sender uses symbolic codes, known to the recipient, rather than a physical exchange of an object bearing the message. Thus flag semaphore is a method of telegraphy, whereas p ...

ic English text is around 1.73, reflecting the unevenness of natural-language letter distributions. Sometimes values are reported without the normalizing denominator, for example for English; such values may be called ''κ''_p ("kappa-plaintext") rather than IC, with ''κ''_r ("kappa-random") used to denote the denominator (which is the expected coincidence rate for a uniform distribution of the same alphabet, for English). English plaintext will generally fall somewhere in the range of 1.5 to 2.0 (normalized calculation).

Application

The index of coincidence is useful both in the analysis of natural-language

plaintext In cryptography, plaintext usually means unencrypted information pending input into cryptographic algorithms, usually encryption algorithms. This usually refers to data that is transmitted or stored unencrypted. Overview With the advent of comp ...

and in the analysis of ciphertext (

cryptanalysis Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic sec ...

). Even when only ciphertext is available for testing and plaintext letter identities are disguised, coincidences in ciphertext can be caused by coincidences in the underlying plaintext. This technique is used to cryptanalyze the

Vigenère cipher The Vigenère cipher () is a method of encrypting alphabetic text by using a series of interwoven Caesar ciphers, based on the letters of a keyword. It employs a form of polyalphabetic substitution. First described by Giovan Battista Bellas ...

, for example. For a repeating-key

polyalphabetic cipher A polyalphabetic cipher substitution, using multiple substitution alphabets. The Vigenère cipher is probably the best-known example of a polyalphabetic cipher, though it is a simplified special case. The Enigma machine is more complex but is st ...

arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system. Coincidence counting can help determine when two texts are written in the same language using the same

alphabet An alphabet is a standardized set of basic written graphemes (called letter (alphabet), letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character ...

. (This technique has been used to examine the purported

Bible code The Bible code ( he, הצופן התנ"כי, ), also known as the Torah code, is a purported set of encoded words within a Hebrew text of the Torah that, according to proponents, has predicted significant historical events. The statistical like ...

). The ''causal'' coincidence count for such texts will be distinctly higher than the ''accidental'' coincidence count for texts in different languages, or texts using different alphabets, or gibberish texts. To see why, imagine an "alphabet" of only the two letters A and B. Suppose that in our "language", the letter A is used 75% of the time, and the letter B is used 25% of the time. If two texts in this language are laid side by side, then the following pairs can be expected: Overall, the probability of a "coincidence" is 62.5% (56.25% for AA + 6.25% for BB). Now consider the case when ''both'' messages are encrypted using the simple monoalphabetic

which replaces A with B and vice versa: The overall probability of a coincidence in this situation is 62.5% (6.25% for AA + 56.25% for BB), exactly the same as for the unencrypted "plaintext" case. In effect, the new alphabet produced by the substitution is just a uniform renaming of the original character identities, which does not affect whether they match. Now suppose that only ''one'' message (say, the second) is encrypted using the same substitution cipher (A,B)→(B,A). The following pairs can now be expected: Now the probability of a coincidence is only 37.5% (18.75% for AA + 18.75% for BB). This is noticeably lower than the probability when same-language, same-alphabet texts were used. Evidently, coincidences are more likely when the most frequent letters in each text are the same. The same principle applies to real languages like English, because certain letters, like E, occur much more frequently than other letters—a fact which is used in

frequency analysis In cryptanalysis, frequency analysis (also known as counting letters) is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers. Frequency analysis is based on th ...

s. Coincidences involving the letter E, for example, are relatively likely. So when any two English texts are compared, the coincidence count will be higher than when an English text and a foreign-language text are used. It can easily be imagined that this effect can be subtle. For example, similar languages will have a higher coincidence count than dissimilar languages. Also, it is not hard to generate random text with a frequency distribution similar to real text, artificially raising the coincidence count. Nevertheless, this technique can be used effectively to identify when two texts are likely to contain meaningful information in the same language using the same alphabet, to discover periods for repeating keys, and to uncover many other kinds of nonrandom phenomena within or among ciphertexts. Expected values for various languages are:

Generalization

The above description is only an introduction to use of the index of coincidence, which is related to the general concept of correlation. Various forms of Index of Coincidence have been devised; the "delta" I.C. (given by the formula above) in effect measures the

autocorrelation Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as ...

of a single distribution, whereas a "kappa" I.C. is used when matching two text strings. Although in some applications constant factors such as

c

and

N

can be ignored, in more general situations there is considerable value in truly ''indexing'' each I.C. against the value to be expected for the

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...

(usually: no match and a uniform random symbol distribution), so that in every situation the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

for no correlation is 1.0. Thus, any form of I.C. can be expressed as the ratio of the number of coincidences actually observed to the number of coincidences expected (according to the null model), using the particular test setup. From the foregoing, it is easy to see that the formula for kappa I.C. is :

\mathbf = \frac,

where

N

is the common aligned length of the two texts ''A'' and ''B'', and the bracketed term is defined as 1 if the

j

-th letter of text ''A'' matches the

j

-th letter of text ''B'', otherwise 0. A related concept, the "bulge" of a distribution, measures the discrepancy between the observed I.C. and the null value of 1.0. The number of cipher alphabets used in a

may be estimated by dividing the expected bulge of the delta I.C. for a single alphabet by the observed bulge for the message, although in many cases (such as when a repeating key was used) better techniques are available.

Example

As a practical illustration of the use of I.C., suppose that we have intercepted the following ciphertext message:

QPWKA LVRXC QZIKG RBPFA EOMFL  JMSDZ VDHXC XJYEB IMTRQ WNMEA
IZRVK CVKVL XNEIC FZPZC ZZHKM  LVZVZ IZRRQ WDKEC HOSNY XXLSP
MYKVQ XJTDC IOMEE XDQVS RXLRL  KZHOV

(The grouping into five characters is just a

telegraphic Telegraphy is the long-distance transmission of messages where the sender uses symbolic codes, known to the recipient, rather than a physical exchange of an object bearing the message. Thus flag semaphore is a method of telegraphy, whereas p ...

convention and has nothing to do with actual word lengths.) Suspecting this to be an English plaintext encrypted using a

with normal A–Z components and a short repeating keyword, we can consider the ciphertext "stacked" into some number of columns, for example seven:

QPWKALV
RXCQZIK
GRBPFAE
OMFLJMS
DZVDHXC
XJYEBIM
TRQWN…

If the key size happens to have been the same as the assumed number of columns, then all the letters within a single column will have been enciphered using the same key letter, in effect a simple Caesar cipher applied to a random selection of English plaintext characters. The corresponding set of ciphertext letters should have a roughness of frequency distribution similar to that of English, although the letter identities have been permuted (shifted by a constant amount corresponding to the key letter). Therefore, if we compute the aggregate delta I.C. for all columns ("delta bar"), it should be around 1.73. On the other hand, if we have incorrectly guessed the key size (number of columns), the aggregate delta I.C. should be around 1.00. So we compute the delta I.C. for assumed key sizes from one to ten: We see that the key size is most likely five. If the actual size is five, we would expect a width of ten to also report a high I.C., since each of its columns also corresponds to a simple Caesar encipherment, and we confirm this. So we should stack the ciphertext into five columns:

QPWKA
LVRXC
QZIKG
RBPFA
EOMFL
JMSDZ
VDH…

We can now try to determine the most likely key letter for each column considered separately, by performing trial Caesar decryption of the entire column for each of the 26 possibilities A–Z for the key letter, and choosing the key letter that produces the highest correlation between the decrypted column letter frequencies and the relative

letter frequencies Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi (c. 801–873 AD), who formally developed the method to brea ...

for normal English text. That correlation, which we don't need to worry about normalizing, can be readily computed as :

\mathbf = \sum_^n_i f_i

where

n_i

are the observed column letter frequencies and

f_i

are the relative letter frequencies for English. When we try this, the best-fit key letters are reported to be "EVERY," which we recognize as an actual word, and using that for Vigenère decryption produces the plaintext:

MUSTC HANGE MEETI NGLOC ATION FROMB RIDGE TOUND ERPAS 
SSINC EENEM YAGEN TSARE BELIE VEDTO HAVEB EENAS SIGNE 
DTOWA TCHBR IDGES TOPME ETING TIMEU NCHAN GEDXX

from which one obtains:

MUST CHANGE MEETING LOCATION FROM BRIDGE TO UNDERPASS
SINCE ENEMY AGENTS ARE BELIEVED TO HAVE BEEN ASSIGNED
TO WATCH BRIDGE STOP  MEETING TIME UNCHANGED  XX

after word divisions have been restored at the obvious positions. "XX" are evidently "null" characters used to pad out the final group for transmission. This entire procedure could easily be packaged into an automated algorithm for breaking such ciphers. Due to normal statistical fluctuation, such an algorithm will occasionally make wrong choices, especially when analyzing short ciphertext messages.

Calculation

Application

Generalization

Example

References

See also