Good–Turing frequency estimation
   HOME

TheInfoList



OR:

Good–Turing frequency estimation is a
statistical Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...
technique for estimating the
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...
of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colors of the balls (finite but unknown in number). After drawing R_\text red balls, R_\text black balls and R_\text green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen color.


Historical background

Good–Turing frequency estimation was developed by
Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical ...
and his assistant I. J. Good as part of their methods used at
Bletchley Park Bletchley Park is an English country house and estate in Bletchley, Milton Keynes (Buckinghamshire) that became the principal centre of Allied code-breaking during the Second World War. The mansion was constructed during the years following ...
for cracking
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
cipher In cryptography, a cipher (or cypher) is an algorithm for performing encryption or decryption—a series of well-defined steps that can be followed as a procedure. An alternative, less common term is ''encipherment''. To encipher or encode ...
s for the Enigma machine during
World War II World War II or the Second World War, often abbreviated as WWII or WW2, was a world war that lasted from 1939 to 1945. It involved the World War II by country, vast majority of the world's countries—including all of the great power ...
. Turing at first modelled the frequencies as a
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of wh ...
, but found it inaccurate. Good developed
smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the dat ...
algorithms to improve the estimator's accuracy. The discovery was recognized as significant when published by Good in 1953, but the calculations were difficult so it was not used as widely as it might have been. The method even gained some literary fame due to the Robert Harris novel ''
Enigma Enigma may refer to: *Riddle, someone or something that is mysterious or puzzling Biology *ENIGMA, a class of gene in the LIM domain Computing and technology * Enigma (company), a New York-based data-technology startup * Enigma machine, a family ...
''. In the 1990s, Geoffrey Sampson worked with William A. Gale of
AT&T AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile ...
to create and implement a simplified and easier-to-use variant of the Good–Turing method described below. Various heuristic justifications and a simple combinatorial derivation have been provided.


The method

The Good–Turing estimator is largely independent of the distribution of species frequencies.


Notation

* Assuming that X distinct species have been observed, enumerated \; 1, \dots, X ~. * Then the frequency vector, \, \bar \;, has elements \, R_x \, that give the number of individuals that have been observed for species \, x \;. * The frequency of frequencies vector, \, (N_r)_ \;, shows how many times the frequency ''r'' occurs in the vector \bar (i.e. among the elements R_x): ::N_r = \Bigl, \left\ \Bigr, ~. For example, N_1 is the number of species for which only one individual was observed. Note that the total number of objects observed, \, N \,, can be found from :N = \sum_^\infty r N_r = \sum_^X R_x ~.


Calculation

The first step in the calculation is to estimate the probability that a future observed individual (or the next observed individual) is a member of a thus far unseen species. This estimate is: :p_0 = \frac ~. The next step is to estimate the probability that the next observed individual is from a species which has been seen r times. For a ''single'' species this estimate is: : p_r = \frac ~. Here, the notation S(\cdot) means the ''smoothed'', or ''adjusted'' value of the frequency shown in parenthesis. An overview of how to perform this smoothing follows in the next section (see also
empirical Bayes method Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed b ...
). To estimate the probability that the next observed individual is from any species from this group (i.e., the group of species seen \, r \, times) one can use the following formula: :\frac ~.


Smoothing

For smoothing the erratic values in \, N_r \, for large , we would like to make a plot of \; \log N_r \; versus \; \log r \; but this is problematic because for large \, r \, many \, N_r \, will be zero. Instead a revised quantity, \; \log Z_r \;, is plotted versus \; \log r ~, where \, Z_r \, is defined as :Z_r = \frac ~, and where , , and are three consecutive subscripts with non-zero counts \; N_q, N_r, N_t \;. For the special case when is 1, take to be 0. In the opposite special case, when \, r = r_\mathsf \; is the index of the ''last'' non-zero count, replace the divisor \, \tfrac\,(t - q) \, with \, r_\mathsf - q \;, so \; Z_ = \frac ~. A
simple linear regression In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' and ...
is then fitted to the log–log plot. For small values of \, r \, it is reasonable to set \; S(N_r) = N_r \; – that is, no smoothing is performed. For large values of , values of \; S(N_r) \; are read off the regression line. An automatic procedure (not described here) can be used to specify at what point the switch from no smoothing to linear smoothing should take place. Code for the method is available in the public domain.


Derivation

Many different derivations of the above formula for p_r have been given. One of the simplest ways to motivate the formula is by assuming the next item will behave similarly to the previous item. The overall idea of the estimator is that currently we are seeing never-seen items at a certain frequency, seen-once items at a certain frequency, seen-twice items at a certain frequency, and so on. Our goal is to estimate just how likely each of these categories is, for the ''next'' item we will see. Put another way, we want to know the current rate at which seen-twice items are becoming seen-thrice items, and so on. Since we don't assume anything about the underlying probability distribution, it does sound a bit mysterious at first. But it is extremely easy to calculate these probabilities ''empirically'' for the ''previous'' item we saw, even assuming we don't remember exactly which item that was: Take all the items we have seen so far (including multiplicities) — the last item we saw was a random one of these, all equally likely. Specifically, the chance that we saw an item for the (r+1)th time is simply the chance that it was one of the items that we have now seen r + 1 times, namely \frac. In other words, our chance of seeing an item that had been seen ''r'' times before was \frac. So now we simply assume that this chance will be about the same for the next item we see. This immediately gives us the formula above for p_0, by setting r=0. And for r>0, to get the probability that ''a particular one'' of the N_ items is going to be the next one seen, we need to divide this probability (of seeing ''some'' item that has been seen ''r'' times) among the N_ possibilities for which particular item that could be. This gives us the formula \;p_ = \frac. Of course, your actual data will probably be a bit noisy, so you will want to smooth the values first to get a better estimate of how quickly the category counts are growing, and this gives the formula as shown above. This approach is in the same spirit as deriving the standard
Bernoulli Bernoulli can refer to: People *Bernoulli family of 17th and 18th century Swiss mathematicians: ** Daniel Bernoulli (1700–1782), developer of Bernoulli's principle **Jacob Bernoulli (1654–1705), also known as Jacques, after whom Bernoulli numbe ...
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
by simply asking what the two probabilities were for the previous coin flip (after scrambling the trials seen so far), given only the current result counts, while assuming nothing about the underlying distribution.


See also

* Ewens sampling formula *
Pseudocount In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth categorical data. Given a set of observation counts \textstyle from a \textstyle -dimensional multinomial distribution with ...


References


Bibliography

* David A. McAllester, Robert Schapire (2000
''On the Convergence Rate of Good–Turing Estimators''
''Proceedings of the Thirteenth Annual Conference on Computational Learning Theory'' pp. 1–6 * David A. McAllester, Ortiz, Luis (2003

''Journal of Machine Learning Research'' pp. 895–911 {{DEFAULTSORT:Good-Turing Frequency Estimation 1953 introductions Categorical data Probability assessment Alan Turing