Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...

distribution of ''n''-grams in a

document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...

based on their histories. It is widely considered the most effective method of

smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the dat ...

due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit ''n''-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order ''n''-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and . A common example that illustrates the concept behind this method is the frequency of the

bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...

San Francisco San Francisco (; Spanish for " Saint Francis"), officially the City and County of San Francisco, is the commercial, financial, and cultural center of Northern California. The city proper is the fourth most populous in California and 17th ...

". If it appears several times in a training

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

, the frequency of the

unigram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...

"Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of ''n''-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Method

Let

c(w,w')

be the number of occurrences of the word

w

followed by the word

w'

in the corpus. The equation for bigram probabilities is as follows:

p_(w_i, w_) = \frac + \lambda_ p_(w_i)

Where the unigram probability

p_(w_i)

depends on how likely it is to see the word

w_i

in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:

p_(w_i) = \frac

Note that

p_

is a proper distribution, as the values defined in the above way are non-negative and sum to one. The parameter

\delta

is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1. The value of the normalizing constant

\lambda_

is calculated to make the sum of conditional probabilities

p_(w_i, w_)

over all

w_i

equal to one. Observe that (provided

\delta < 1

) for each

w_i

which occurs at least once in the context of

w_

in the corpus we discount the probability by exactly the same constant amount

/ \left(\sum_ c(w_,w')\right)

, so the total discount depends linearly on the number of unique words

w_i

that can occur after

w_

. This total discount is a budget we can spread over all

p_(w_i, w_)

proportionally to

p_(w_i)

. As the values of

p_(w_i)

sum to one, we can simply define

\lambda_

to be equal to this total discount:

\lambda_ = \frac   , \ ,

This equation can be extended to n-grams. Let

w_^

be the

n-1

words before

w_i

p_(w_i, w_^) = \frac + \delta \frac p_(w_i, w_^)

This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero. Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

Modified Kneser–Ney smoothing

Modifications of this method also exist. Chen and Goodman's 1998 paper lists and benchmarks several such modifications. Computational efficiency and scaling to multi-core systems is the focus of Chen and Goodman’s 1998 modification. This approach is once used for

Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an API ...

under a

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...

implementation. KenLM is a performant open-source implementation.

References

{{DEFAULTSORT:Kneser-Ney smoothing Statistical methods Language modeling