HOME

TheInfoList



OR:

Katz back-off is a generative ''n''-gram language model that estimates the
conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
of a word given its history in the ''n''-gram. It accomplishes this estimation by ''backing off'' through progressively shorter history models under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results. The model was introduced in 1987 by Slava M. Katz. Prior to that, n-gram language models were constructed by training individual models for different n-gram orders using maximum likelihood estimation and then interpolating them together.


The method

The equation for Katz's back-off model is: : \begin & P_ (w_i \mid w_ \cdots w_) \\ pt= & \begin d_ \dfrac & \text C(w_ \cdots w_i) > k \\
0pt PT, Pt, or pt may refer to: Arts and entertainment * ''P.T.'' (video game), acronym for ''Playable Teaser'', a short video game released to promote the cancelled video game ''Silent Hills'' * Porcupine Tree, a British progressive rock group ...
\alpha_ P_(w_i \mid w_ \cdots w_) & \text \end \end where : ''C''(''x'') = number of times ''x'' appears in training : ''w''''i'' = ''i''th word in the given context Essentially, this means that if the ''n''-gram has been seen more than ''k'' times in training, the conditional probability of a word given its history is proportional to the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
estimate of that ''n''-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the (''n'' − 1)-gram. The more difficult part is determining the values for ''k'', ''d'' and ''α''. k is the least important of the parameters. It is usually chosen to be 0. However, empirical testing may find better values for k. d is typically the amount of discounting found by Good–Turing estimation. In other words, if Good–Turing estimates C as C^*, then d = \frac To compute \alpha, it is useful to first define a quantity β, which is the left-over probability mass for the (''n'' − 1)-gram: :\beta_ = 1 - \sum_ d_ \frac Then the back-off weight, α, is computed as follows: :\alpha_ = \frac The above formula only applies if there is data for the "(''n'' − 1)-gram". If not, the algorithm skips n-1 entirely and uses the Katz estimate for n-2. (and so on until an n-gram with data is found)


Discussion

This model generally works well in practice, but fails in some circumstances. For example, suppose that the bigram "a b" and the unigram "c" are very common, but the trigram "a b c" is never seen. Since "a b" and "c" are very common, it may be significant (that is, not due to chance) that "a b c" is never seen. Perhaps it's not allowed by the rules of the grammar. Instead of assigning a more appropriate value of 0, the method will back off to the bigram and estimate ''P''(''c'' ,  ''b''), which may be too high.Manning and Schütze, Foundations of Statistical Natural Language Processing, MIT Press (1999), {{ISBN, 978-0-262-13360-9.


References

Language modeling Statistical natural language processing