Skip-gram
   HOME

TheInfoList



OR:

In the fields of
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
and
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
of text or speech. The items can be
phoneme In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west o ...
s,
syllable A syllable is a unit of organization for a sequence of speech sounds typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). Syllables are often considered the phonological "bu ...
s,
letters Letter, letters, or literature may refer to: Characters typeface * Letter (alphabet), a character representing one or more of the sounds used in speech; any of the symbols of an alphabet. * Letterform, the graphic form of a letter of the alphabe ...
,
word A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...
s or base pairs according to the application. The ''n''-grams typically are collected from a
text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...
or
speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or spea ...
. When the items are words, -grams may also be called ''shingles''. Using
Latin numerical prefixes Numeral or number prefixes are prefixes derived from numerals or occasionally other numbers. In English and many other languages, they are used to coin numerous series of words. For example: * unicycle, bicycle, tricycle (1-cycle, 2-cycle, 3-cy ...
, an ''n''-gram of size 1 is referred to as a "unigram"; size 2 is a "
bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...
" (or, less commonly, a "digram"); size 3 is a "
trigram Trigrams are a special case of the ''n''-gram, where ''n'' is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. Frequency Context is ...
". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on. In computational biology, a
polymer A polymer (; Greek '' poly-'', "many" + ''-mer'', "part") is a substance or material consisting of very large molecules called macromolecules, composed of many repeating subunits. Due to their broad spectrum of properties, both synthetic a ...
or
oligomer In chemistry and biochemistry, an oligomer () is a molecule that consists of a few repeating units which could be derived, actually or conceptually, from smaller molecules, monomers.Quote: ''Oligomer molecule: A molecule of intermediate relativ ...
of a known size is called a ''k''-mer instead of an ''n''-gram, with specific names using
Greek numerical prefixes Numeral or number prefixes are prefixes derived from numerals or occasionally other numbers. In English and many other languages, they are used to coin numerous series of words. For example: * unicycle, bicycle, tricycle (1-cycle, 2-cycle, 3-cyc ...
such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc.


Applications

An ''n''-gram model is a type of probabilistic
language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
for predicting the next item in such a sequence in the form of a (''n'' − 1)–order
Markov model In probability theory, a Markov model is a stochastic model used to Mathematical model, model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, i ...
. ''n''-gram models are now widely used in
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
,
communication theory Communication theory is a proposed description of communication phenomena, the relationships among them, a storyline describing these relationships, and an argument for these three elements. Communication theory provides a way of talking about a ...
,
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
(for instance, statistical
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
),
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
(for instance, biological
sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignm ...
), and
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
. Two benefits of ''n''-gram models (and algorithms that use them) are simplicity and scalability – with larger ''n'', a model can store more context with a well-understood
space–time tradeoff A space–time trade-off or time–memory trade-off in computer science is a case where an algorithm or program trades increased space usage with decreased time. Here, ''space'' refers to the data storage consumed in performing a given task (RAM ...
, enabling small experiments to scale up efficiently.


Examples

Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences. Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google ''n''-gram corpus. 3-grams * ceramics collectables collectibles (55) * ceramics collectables fine (130) * ceramics collected by (52) * ceramics collectible pottery (50) * ceramics collectibles cooking (45) 4-grams * serve as the incoming (92) * serve as the incubator (99) * serve as the independent (794) * serve as the index (223) * serve as the indication (72) * serve as the indicator (120)


''n''-gram models

An ''n''-gram model models sequences, notably natural languages, using the statistical properties of ''n''-grams. This idea can be traced to an experiment by
Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American people, American mathematician, electrical engineering, electrical engineer, and cryptography, cryptographer known as a "father of information theory". As a 21-year-o ...
's work in
information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
. Shannon posed the question: given a sequence of letters (for example, the sequence "for ex"), what is the
likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
of the next letter? From training data, one can derive a
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
for the next letter given a history of size n: ''a'' = 0.4, ''b'' = 0.00001, ''c'' = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0. More concisely, an ''n''-gram model predicts x_i based on x_, \dots, x_. In probability terms, this is P(x_i \mid x_, \dots, x_). When used for
language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
ing, independence assumptions are made so that each word depends only on the last ''n'' − 1 words. This
Markov model In probability theory, a Markov model is a stochastic model used to Mathematical model, model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, i ...
is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of estimating the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together. Note that in a simple ''n''-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
(often imprecisely called a "
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...
"). In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or ''n''-grams; see smoothing techniques.


Applications and considerations

''n''-gram models are widely used in statistical
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
. In
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
,
phonemes In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west o ...
and sequences of phonemes are modeled using a ''n''-gram distribution. For parsing, words are modeled such that each ''n''-gram is composed of ''n'' words. For
language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solv ...
, sequences of
characters Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
/
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
s (''e.g.'',
letters of the alphabet A letter is a segmental symbol of a Phonemic orthography, phonemic writing system. The inventory of all letters forms an alphabet. Letters broadly correspond to phonemes in the spoken language, spoken form of the language, although there is rarel ...
) are modeled for different languages. For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth, counting the space character as a gram (sometimes the beginning and end of a text are modeled explicitly, adding "_ ⁠_g", "_go", "ng_", and "g_ ⁠_"). For sequences of words, the trigrams (shingles) that can be generated from "the dog smelled like a skunk" are "# the dog", "the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #". Practitioners more interested in multiple word terms might preprocess strings to remove spaces. Many simply collapse
whitespace White space or whitespace may refer to: Technology * Whitespace characters, characters in computing that represent horizontal or vertical space * White spaces (radio), allocated but locally unused radio frequencies * TV White Space Database, a mec ...
to a single space while preserving paragraph marks, because the whitespace is frequently either an element of writing style or introduces layout or presentation not required by the prediction and deduction methodology. Punctuation is also commonly reduced or removed by preprocessing and is frequently used to trigger functionality. ''n''-grams can also be used for sequences of words or almost any type of data. For example, they have been used for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from. They have also been very successful as the first pass in genetic sequence search and in the identification of the species from which short sequences of DNA originated. ''n''-gram models are often criticized because they lack any explicit representation of long range dependency. This is because the only explicit dependency range is (''n'' − 1) tokens for an ''n''-gram model, and since natural languages incorporate many cases of unbounded dependencies (such as
wh-movement In linguistics, wh-movement (also known as wh-fronting, wh-extraction, or wh-raising) is the formation of syntactic dependencies involving interrogative words. An example in English is the dependency formed between ''what'' and the object position ...
), this means that an ''n''-gram model cannot in principle distinguish unbounded dependencies from noise (since long range correlations drop exponentially with distance for any Markov model). For this reason, ''n''-gram models have not made much impact on linguistic theory, where part of the explicit goal is to model such dependencies. Another criticism that has been made is that Markov models of language, including ''n''-gram models, do not explicitly capture the performance/competence distinction. This is because ''n''-gram models are not designed to model linguistic knowledge as such, and make no claims to being (even potentially) complete models of linguistic knowledge; instead, they are used in practical applications. In practice, ''n''-gram models have been shown to be extremely effective in modeling language data, which is a core component in modern statistical
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of met ...
applications. Most modern applications that rely on ''n''-gram based models, such as
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
applications, do not rely exclusively on such models; instead, they typically also incorporate
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
. Modern statistical models are typically made up of two parts, a
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
describing the inherent likelihood of a possible result and a
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
used to assess the compatibility of a possible result with observed data. When a language model is used, it is used as part of the prior distribution (e.g. to gauge the inherent "goodness" of a possible translation), and even then it is often not the only component in this distribution. Handcrafted features of various sorts are also used, for example variables that represent the position of a word in a sentence or the general topic of discourse. In addition, features based on the structure of the potential result, such as syntactic considerations, are often used. Such features are also used as part of the likelihood function, which makes use of the observed data. Conventional linguistic theory can be incorporated in these features (although in practice, it is rare that features specific to generative or other particular theories of grammar are incorporated, as computational linguists tend to be "agnostic" towards individual theories of grammar).


Out-of-vocabulary words

An issue when using n-gram language models are out-of-vocabulary (OOV) words. They are encountered in
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
and
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the n-grams in the
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
that contain an out-of-vocabulary word are ignored. The n-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed. Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g. '''') into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special token before n-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of n-grams involving out-of-vocabulary words.


''n''-grams for approximate matching

''n''-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of ''n''-grams, it can be embedded in a
vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called ''vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but can ...
, thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into single character 3-grams, we get a 26^3-dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. For example, both the strings "abc" and "bca" give rise to exactly the same 2-gram "bc" (although is clearly not the same as ). However, we know empirically that if two strings of real text have a similar vector representation (as measured by cosine distance) then they are likely to be similar. Other metrics have also been applied to vectors of ''n''-grams with varying, sometimes better, results. For example,
z-score In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean ...
s have been used to compare documents by examining how many standard deviations each ''n''-gram differs from its mean occurrence in a large collection, or
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
, of documents (which form the "background" vector). In the event of small counts, the g-score (also known as
g-test In statistics, ''G''-tests are likelihood ratio test, likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended. The general formula fo ...
) may give better results for comparing alternative models. It is also possible to take a more principled approach to the statistics of ''n''-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
. ''n''-gram-based searching can also be used for
plagiarism detection Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to pla ...
.


Other applications

''n''-grams find use in several areas of computer science,
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
, and applied mathematics. They have been used to: * design
kernels Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...
that allow
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
algorithms such as
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
s to learn from string data * find likely candidates for the correct spelling of a misspelled word * improve compression in
compression algorithms In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
where a small area of data requires ''n''-grams of greater length * assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems,
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
, OCR (
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
),
Intelligent Character Recognition In computer science, intelligent character recognition (ICR) is an advanced optical character recognition (OCR) or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a ...
( ICR),
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
and similar applications * improve retrieval in
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents * improve retrieval performance in genetic sequence analysis as in the
BLAST Blast or The Blast may refer to: * Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film) ...
family of programs * identify the language a text is in or the species a small sequence of DNA was taken from * predict letters or words at random in order to create text, as in the
dissociated press Dissociated press is a parody generator (a computer program that generates nonsensical text). The generated text is based on another text using the Markov chain technique. The name is a play on "Associated Press" and the psychological term dissoc ...
algorithm. *
cryptanalysis Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic sec ...


Space required for an ''n''-gram

Consider an ''n''-gram where the units are characters and a text with ''t'' characters, where n,t \in \mathbb. There are t-n+1 total strings, where each string requires n units of space. Thus, the total space required for this ''n''-gram is (t-n+1) \cdot n, which is simplified to: -n^2 + (t+1)n


Bias-versus-variance trade-off

To choose a value for ''n'' in an ''n''-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.


Smoothing techniques

There are problems of balance weight between ''infrequent grams'' (for example, if a proper name appeared in the training data) and ''frequent grams''. Also, items not seen in the training data will be given a
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
of 0.0 without
smoothing In statistics and image processing, to smooth a data set is to create an approximating function (mathematics), function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena ...
. For unseen but plausible data from a sample, one can introduce
pseudocount In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth categorical data. Given a set of observation counts \textstyle from a \textstyle -dimensional multinomial distribution with \ ...
s. Pseudocounts are generally motivated on Bayesian grounds. In practice it is necessary to ''smooth'' the probability distributions by also assigning non-zero probabilities to unseen words or ''n''-grams. The reason is that models derived directly from the ''n''-gram frequency counts have severe problems when confronted with any ''n''-grams that have not explicitly been seen before – the zero-frequency problem. Various smoothing methods are used, from simple "add-one" (Laplace) smoothing (assign a count of 1 to unseen ''n''-grams; see
Rule of succession In probability theory, the rule of succession is a formula introduced in the 18th century by Pierre-Simon Laplace in the course of treating the sunrise problem. The formula is still used, particularly to estimate underlying probabilities when t ...
) to more sophisticated models, such as Good–Turing discounting or back-off models. Some of these methods are equivalent to assigning a
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
to the probabilities of the ''n''-grams and using
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
to compute the resulting posterior ''n''-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations. *
Linear interpolation In mathematics, linear interpolation is a method of curve fitting using linear polynomials to construct new data points within the range of a discrete set of known data points. Linear interpolation between two known points If the two known poin ...
(e.g., taking the
weighted mean The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
of the unigram, bigram, and trigram) * Good–Turing discounting * Witten–Bell discounting * Lidstone's smoothing *
Katz's back-off model Katz back-off is a generative ''n''-gram language model that estimates the conditional probability of a word given its history in the ''n''-gram. It accomplishes this estimation by ''backing off'' through progressively shorter history models under ...
(trigram) *
Kneser–Ney smoothing Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the probability distribution of ''n''-grams in a document based on their histories. It is widely considered the most effective method of smoo ...


Skip-gram

In the field of
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
, in particular
language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
ing, skip-grams are a generalization of ''n''-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are ''skipped'' over. They provide one way of overcoming the data sparsity problem found with conventional ''n''-gram analysis. In the area of computer security, skip-grams have proven more robust to attack than ngrams. Formally, an -gram is a consecutive subsequence of length of some sequence of tokens . A -skip--gram is a length- subsequence where the components occur at distance at most from each other. For example, in the input text: :''the rain in Spain falls mainly on the plain'' the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences :''the in'', ''rain Spain'', ''in falls'', ''Spain mainly'', ''falls on'', ''mainly the'', and ''on plain''.


Syntactic ''n''-grams

Syntactic ''n''-grams are ''n''-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text. For example, the sentence "economic news has little effect on financial markets" can be transformed to syntactic ''n''-grams following the tree structure of its dependency relations: news-economic, effect-little, effect-on-markets-financial. Syntactic ''n''-grams are intended to reflect syntactic structure more faithfully than linear ''n''-grams, and have many of the same applications, especially as features in a Vector Space Model. Syntactic ''n''-grams for certain tasks gives better results than the use of standard ''n''-grams, for example, for authorship attribution. Another type of syntactic ''n''-grams are part-of-speech ''n''-grams, defined as fixed-length contiguous overlapping subsequences that are extracted from part-of-speech sequences of text. Part-of-speech ''n''-grams have several applications, most commonly in information retrieval.


See also

*
Collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words th ...
*
Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
*
n-tuple In mathematics, a tuple is a finite ordered list (sequence) of elements. An -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0-tuple, referred to as ''the empty tuple''. An -tuple is defi ...
*
String kernel In machine learning and data mining, a string kernel is a Positive-definite kernel, kernel function that operates on String (computer science), strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intu ...
*
MinHash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the AltaV ...
*
Feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
*
Longest common substring problem In computer science, the longest common substring problem is to find a longest string that is a substring of two or more strings. The problem may have multiple solutions. Applications include data deduplication and plagiarism detection. Examples ...


References


Further reading

* Christopher D. Manning, Hinrich Schütze, ''Foundations of Statistical Natural Language Processing'', MIT Press: 1999. . * * Frederick J. Damerau, ''Markov Models and Linguistic Theory''. Mouton. The Hague, 1971. * *


External links


Google's Google Books ''n''-gram viewer
an

(September 2006)
Microsoft's web ''n''-grams service

STATOPERATOR N-grams Project Weighted ''n''-gram viewer for every domain in Alexa Top 1M

1,000,000 most frequent 2,3,4,5-grams
from the 425 million word
Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). Content The Corpus of C ...

Peachnote's music ngram viewer

Stochastic Language Models (''n''-Gram) Specification
(W3C)
Michael Collins's notes on ''n''-Gram Language Models

OpenRefine: Clustering In Depth
{{Natural Language Processing Natural language processing Computational linguistics Language modeling Speech recognition Corpus linguistics Probabilistic models