Kneser–Ney Smoothing
   HOME

TheInfoList



OR:

Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...
distribution of ''n''-grams in a
document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...
based on their histories. It is widely considered the most effective method of
smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the dat ...
due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit ''n''-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order ''n''-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and . A common example that illustrates the concept behind this method is the frequency of the
bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...
"
San Francisco San Francisco (; Spanish for " Saint Francis"), officially the City and County of San Francisco, is the commercial, financial, and cultural center of Northern California. The city proper is the fourth most populous in California and 17th ...
". If it appears several times in a training
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
, the frequency of the
unigram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...
"Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of ''n''-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.


Method

Let c(w,w') be the number of occurrences of the word w followed by the word w' in the corpus. The equation for bigram probabilities is as follows: p_(w_i, w_) = \frac + \lambda_ p_(w_i) Where the unigram probability p_(w_i) depends on how likely it is to see the word w_i in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus: p_(w_i) = \frac Note that p_ is a proper distribution, as the values defined in the above way are non-negative and sum to one. The parameter \delta is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1. The value of the normalizing constant \lambda_ is calculated to make the sum of conditional probabilities p_(w_i, w_) over all w_i equal to one. Observe that (provided \delta < 1 ) for each w_i which occurs at least once in the context of w_ in the corpus we discount the probability by exactly the same constant amount / \left(\sum_ c(w_,w')\right), so the total discount depends linearly on the number of unique words w_i that can occur after w_. This total discount is a budget we can spread over all p_(w_i, w_) proportionally to p_(w_i). As the values of p_(w_i) sum to one, we can simply define \lambda_ to be equal to this total discount: \lambda_ = \frac , \ , This equation can be extended to n-grams. Let w_^ be the n-1 words before w_i: p_(w_i, w_^) = \frac + \delta \frac p_(w_i, w_^) This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero. Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.


Modified Kneser–Ney smoothing

Modifications of this method also exist. Chen and Goodman's 1998 paper lists and benchmarks several such modifications. Computational efficiency and scaling to multi-core systems is the focus of Chen and Goodman’s 1998 modification. This approach is once used for
Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an API ...
under a
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
implementation. KenLM is a performant open-source implementation.


References

{{DEFAULTSORT:Kneser-Ney smoothing Statistical methods Language modeling