Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the
Oxford English Corpus
The Oxford English Corpus (OEC) is a text corpus of 21st-century English language, English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, ...
(OEC), a massive
text corpus
In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in corp ...
that is written in the English language.
In total, the texts in the Oxford English Corpus contain more than 2 billion words.
The OEC includes a wide variety of writing samples, such as literary works, novels, academic journals, newspapers, magazines,
Hansard's Parliamentary Debates,
blog
A blog (a Clipping (morphology), truncation of "weblog") is an informational website consisting of discrete, often informal diary-style text entries also known as posts. Posts are typically displayed in Reverse chronology, reverse chronologic ...
s,
chat logs, and emails.
Another English corpus that has been used to study word frequency is the
Brown Corpus, which was compiled by researchers at
Brown University
Brown University is a Private university, private Ivy League research university in Providence, Rhode Island, United States. It is the List of colonial colleges, seventh-oldest institution of higher education in the US, founded in 1764 as the ' ...
in the 1960s. The researchers published their analysis of the Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis.
According to ''The Reading Teacher's Book of Lists'', the first 25 words in the OEC make up about one-third of all printed material in English, and the first 100 words make up about half of all written English.
[The First 100 Most Commonly Used English Words]
. According to a study cited by
Robert McCrum in ''
The Story of English,'' all of the first hundred of the most common words in English are of either
Old English
Old English ( or , or ), or Anglo-Saxon, is the earliest recorded form of the English language, spoken in England and southern and eastern Scotland in the Early Middle Ages. It developed from the languages brought to Great Britain by Anglo-S ...
or
Old Norse
Old Norse, also referred to as Old Nordic or Old Scandinavian, was a stage of development of North Germanic languages, North Germanic dialects before their final divergence into separate Nordic languages. Old Norse was spoken by inhabitants ...
origin, except for "just", ultimately from Latin "iustus", "people", ultimately from Latin "populus", "use", ultimately from Latin "usare", and "because", in part from Latin "causa".
Some lists of common words distinguish between
word form
In linguistics, morphology is the study of words, including the principles by which they are formed, and how they relate to one another within a language. Most approaches to morphology investigate the structure of words in terms of morphemes, wh ...
s, while others rank all forms of a word as a single
lexeme
A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms ta ...
(the form of the word as it would appear in a dictionary). For example, the lexeme ''be'' (as in ''
to be'') comprises all its conjugations (''am'', ''are'', ''is'', ''was'', ''were'', etc.), and
contractions of those conjugations.
[ Benjamin Zimmer. June 22, 2006]
Time after time after time...
Language Log. Retrieved June 22, 2006. These top 100
lemmas listed below account for 50% of all the words in the Oxford English Corpus.
[
]
100 most common words
A list of 100 words that occur most frequently in written English is given below, based on an analysis of the Oxford English Corpus
The Oxford English Corpus (OEC) is a text corpus of 21st-century English language, English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, ...
(a collection of texts in the English language, comprising over 2 billion words). A part of speech
In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...
is provided for most of the words, but part-of-speech categories vary between analyses, and not all possibilities are listed. For example, "I" may be a pronoun or a Roman numeral; "to" may be a preposition or an infinitive marker; "time" may be a noun or a verb. Also, a single spelling can represent more than one root word. For example, "singer" may be a form of either "sing" or "singe". Different corpora may treat such difference differently.
The number of distinct senses that are listed in Wiktionary is shown in the polysemy
Polysemy ( or ; ) is the capacity for a Sign (semiotics), sign (e.g. a symbol, morpheme, word, or phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from ''monosemy'', where a word h ...
column. For example, "out" can refer to an escape, a removal from play in baseball, or any of 36 other concepts. On average, each word in the list has 15.38 senses. The sense count does not include the use of terms in phrasal verbs such as "put out" (as in "inconvenienced") and other multiword expressions such as the interjection "get out!", where the word "out" does not have an individual meaning. As an example, "out" occurs in at least 560 phrasal verbs and appears in nearly 1700 multiword expressions.
The table also includes frequencies from other corpora. As well as usage differences, lemmatisation may differ from corpus to corpus – for example splitting the prepositional use of "to" from the use as a particle. Also, the Corpus of Contemporary American English (COCA) list includes dispersion as well as frequency to calculate rank.
Parts of speech
The following is a very similar list, also from the OEC, subdivided by part of speech
In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...
. The list labeled "Others" includes pronoun
In linguistics and grammar, a pronoun (Interlinear gloss, glossed ) is a word or a group of words that one may substitute for a noun or noun phrase.
Pronouns have traditionally been regarded as one of the part of speech, parts of speech, but so ...
s, possessive
A possessive or ktetic form (Glossing abbreviation, abbreviated or ; from ; ) is a word or grammatical construction indicating a relationship of possession (linguistics), possession in a broad sense. This can include strict ownership, or a numbe ...
s, articles, modal verbs, adverb An adverb is a word or an expression that generally modifies a verb, an adjective, another adverb, a determiner, a clause, a preposition, or a sentence. Adverbs typically express manner, place, time, frequency, degree, or level of certainty by ...
s, and conjunctions.
See also
* Basic English
Basic English (a backronym for British American Scientific International and Commercial English) is a controlled language based on standard English, but with a greatly simplified vocabulary and grammar. It was created by the linguist and philo ...
* Frequency analysis, the study of the frequency of letters or groups of letters
* Letter frequencies
Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi (c. AD 801–873), who formally developed the method to break ci ...
* Oxford English Corpus
The Oxford English Corpus (OEC) is a text corpus of 21st-century English language, English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, ...
* Swadesh list, a compilation of basic concepts for the purpose of historical-comparative linguistics
* Zipf's law, a theory stating that the frequency of any word is inversely proportional to its rank in a frequency table
Word lists
* Dolch Word List, a list of frequently used English words
* General Service List
* New General Service List
* Word lists by frequency
A word list is a list of words in a lexicon, generally sorted by frequency of occurrence (either by Educational stage, graded levels, or as a ranked list). A word list is compiled by lexical frequency analysis within a given text corpus, and is ...
References
External links
{{Wiktionary pipe, Wiktionary:Frequency lists#English, frequency lists
Lists of English words