HOME

TheInfoList



OR:

Collocation extraction is the task of using a computer to extract
collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words th ...
s automatically from a
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
. The traditional method of performing collocation extraction is to find a formula based on the statistical quantities of those words to calculate a score associated to every word pairs. Proposed formulas are
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
,
t-test A ''t''-test is any statistical hypothesis testing, statistical hypothesis test in which the test statistic follows a Student's t-distribution, Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test stati ...
, z test,
chi-squared test A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variable ...
and
likelihood ratio The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
. Within the area of
corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance. 'Crystal clear', 'middle management', 'nuclear family', and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a
compound noun A compound is a word composed of more than one free morpheme. The English language, like many others, uses compounds frequently. English compounds may be classified in several ways, such as the word classes or the semantic relationship of their ...
, for example 'riding boots' or 'motor cyclist'.


See also

*
Collocational restriction Collocational restriction is a linguistic term used in morphology. The term refers to the fact that in certain two-word phrases the meaning of an individual word is restricted to that particular phrase (cf. idiom). For instance: the adjective ''dry ...
*
Collostructional analysis Collostructional analysis is a family of methods developed by (in alphabetical order) Stefan Th. Gries (University of California, Santa Barbara) and Anatol Stefanowitsch (Free University of Berlin). Collostructional analysis aims at measuring the ...
*
Compound noun, adjective and verb In linguistics, a compound is a lexeme (less precisely, a word or sign) that consists of more than one stem. Compounding, composition or nominal composition is the process of word formation that creates compound lexemes. Compounding occurs whe ...
*
Phrasal verb In the traditional grammar of Modern English, a phrasal verb typically constitutes a single semantic unit composed of a verb followed by a particle (examples: ''turn down'', ''run into'' or ''sit up''), sometimes combined with a preposition (ex ...
*
Siamese twins (English language) In linguistics and stylistics, an irreversible binomial, frozen binomial, binomial freeze, binomial expression, binomial pair, or nonreversible word pair is a pair or group of words used together in fixed order as an idiomatic expression or co ...
*
Terminology extraction Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a give ...
*
n-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...
analysis


External links


What is collocation


References

Tasks of natural language processing Computational linguistics Corpus linguistics {{comp-ling-stub