In
corpus linguistics
Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...
, a collocation is a series of words or
terms that
co-occur more often than would be expected by chance. In
phraseology, a collocation is a type of
compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an
idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.
An example of a phraseological collocation is the expression ''strong tea''. While the same meaning could be conveyed by the roughly equivalent ''powerful tea'', this adjective does not modify ''tea'' frequently enough for English speakers to become accustomed to its co-occurrence and regard it as
idiomatic or
unmarked. (By way of counterexample, ''powerful'' is idiomatically preferred to ''strong'' when modifying a ''computer'' or a ''car''.)
There are about six main types of collocations: adjective + noun, noun + noun (such as
collective nouns), verb + noun, adverb + adjective, verbs + prepositional phrase (
phrasal verbs), and verb + adverb.
Collocation extraction is a computational technique that finds collocations in a document or corpus, using various
computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
elements resembling
data mining.
Expanded definition
Collocations are partly or fully fixed expressions that become established through repeated context-dependent use. Such terms as ''crystal clear'', ''middle management'', ''nuclear family'', and ''cosmetic surgery'' are examples of collocated pairs of words.
Collocations can be in a
syntactic
In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituenc ...
relation (such as
verb–object: ''make'' and ''decision''),
lexical relation (such as
antonymy
In lexical semantics, opposites are words lying in an inherently incompatible binary relationship. For example, something that is ''long'' entails that it is not ''short''. It is referred to as a 'binary' relationship because there are two members ...
), or they can be in no linguistically defined relation. Knowledge of collocations is vital for the competent use of a language: a
grammatically correct sentence will stand out as awkward if collocational preferences are violated. This makes collocation an interesting area for language teaching.
Corpus linguists specify a
key word in context (
KWIC) and identify the words immediately surrounding them. This gives an idea of the way words are used.
The processing of collocations involves a number of parameters, the most important of which is the ''measure of association'', which evaluates whether the
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ...
is purely by chance or statistically
significant. Due to the non-random nature of language, most collocations are classed as significant, and the association scores are simply used to rank the results. Commonly used measures of association include
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...
,
t scores, and
log-likelihood.
Rather than select a single definition, Gledhill proposes that collocation involves at least three different perspectives: co-occurrence, a statistical view, which sees collocation as the recurrent appearance in a text of a node and its collocates; construction, which sees collocation either as a correlation between a lexeme and a lexical-grammatical pattern, or as a relation between a base and its collocative partners; and expression, a pragmatic view of collocation as a conventional unit of expression, regardless of form. These different perspectives contrast with the usual way of presenting collocation in phraseological studies. Traditionally speaking, collocation is explained in terms of all three perspectives at once, in a continuum:
:Free combination ↔ bound collocation ↔ frozen idiom
In dictionaries
In 1933,
Harold Palmer's ''Second Interim Report on English Collocations'' highlighted the importance of collocation as a key to producing natural-sounding language, for anyone learning a
foreign language. Thus from the 1940s onwards, information about recurrent word combinations became a standard feature of
monolingual learner's dictionaries. As these dictionaries became "less word-centred and more phrase-centred", more attention was paid to collocation. This trend was supported, from the beginning of the 21st century, by the availability of large text
corpora and intelligent
corpus-querying software, making it possible to provide a more systematic account of collocation in dictionaries. Using these tools, dictionaries such as the ''
Macmillan English Dictionary'' and the ''
Longman Dictionary of Contemporary English'' included boxes or panels with lists of frequent collocations.
There are also a number of
specialized dictionaries devoted to describing the frequent collocations in a language. These include (for Spanish) ''Redes: Diccionario combinatorio del español contemporaneo'' (2004), (for French) ''Le Robert: Dictionnaire des combinaisons de mots'' (2007), and (for English) the ''LTP Dictionary of Selected Collocations'' (1997) and the ''Macmillan Collocations Dictionary'' (2010).
Statistically significant collocation
Student's t-test can be used to determine whether the occurrence of a collocation in a corpus is statistically significant.
For a
bigram , let
be the unconditional probability of occurrence of
in a corpus with size
, and let
be the unconditional probability of occurrence of
in the corpus. The t-score for the bigram
is calculated as:
:
where
is the sample mean of the occurrence of
,
is the number of occurrences of
,
is the probability of
under the null-hypothesis that
and
appear independently in the text, and
is the sample variance. With a large
, the t-test is equivalent to a
z-test.
See also
*
English collocations
In the English language, collocation refers to a natural combination of words that are closely affiliated with each other. Some examples are "pay attention", "fast food", "make an effort", and "powerful engine". Collocations make it easier to avoid ...
*
Agreement (linguistics) In linguistics, agreement or concord ( abbreviated ) occurs when a word changes form depending on the other words to which it relates. It is an instance of inflection, and usually involves making the value of some grammatical category (such as gen ...
*
Cliché
*
Collocational restriction
*
Collostructional analysis
*
Compound noun, adjective and verb
*
Government (linguistics)
*
Irreversible binomial
*
Isocolon
*
Lexical item
*
N-gram
*
Phrasal verb
*
Phraseology
*
Phraseme
*
Sketch Engine
*
Statistically improbable phrase
*
Word sketch
References
External links
{{Wiktionary, collocation
Ozdic Collocation DictionaryA Small System Storing Spanish Collocations(Igor A. Bolshakov & Sabino Miranda-Jiménez)
Morphological characterization of collocations and semantic relationships in Spanish(Sabino Miranda-Jiménez & Igor A. Bolshakov)
Example of collocations for the word "Surgery"
Lexical units
Language education
Corpus linguistics
Semantic relations