High-frequency Word
   HOME

TheInfoList



OR:

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the
Oxford English Corpus The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly ...
(OEC), a massive
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
that is written in the English language. In total, the texts in the Oxford English Corpus contain more than 2 billion words. The OEC includes a wide variety of writing samples, such as literary works, novels, academic journals, newspapers, magazines,
Hansard's Parliamentary Debates ''Hansard'' is the traditional name of the transcripts of parliamentary debates in Britain and many Commonwealth countries. It is named after Thomas Curson Hansard (1776–1833), a London printer and publisher, who was the first official print ...
,
blog A blog (a truncation of "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries (posts). Posts are typically displayed in reverse chronological order ...
s,
chat log A chat log is an archive of transcripts from online chat and instant messaging conversations. Many chat or IM applications allow for the client-side archiving of online chat conversations, while a subset of chat or IM clients (i.e., Google Talk and ...
s, and emails. Another English corpus that has been used to study word frequency is the
Brown Corpus The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
, which was compiled by researchers at
Brown University Brown University is a private research university in Providence, Rhode Island. Brown is the seventh-oldest institution of higher education in the United States, founded in 1764 as the College in the English Colony of Rhode Island and Providenc ...
in the 1960s. The researchers published their analysis of the Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis. According to ''The Reading Teacher's Book of Lists'', the first 25 words in the OEC make up about one-third of all printed material in English, and the first 100 words make up about half of all written English.The First 100 Most Commonly Used English Words
.
According to a study cited by
Robert McCrum John Robert McCrum (born 7 July 1953) is an English writer and editor, holding senior editorial positions at Faber and Faber over seventeen years, followed by a long association with ''The Observer''. Early life The son of Michael William McC ...
in ''
The Story of English ''The Story of English'' is an Primetime Emmy Award, Emmy Award-winning nine-part television series, produced in 1986, detailing the development of the English language. ''The Story of English'' is also a companion book, also produced in 1986. ...
,'' all of the first hundred of the most common words in English are of
Old English Old English (, ), or Anglo-Saxon, is the earliest recorded form of the English language, spoken in England and southern and eastern Scotland in the early Middle Ages. It was brought to Great Britain by Anglo-Saxon settlement of Britain, Anglo ...
origin, except for "people", ultimately from Latin "populus", and "because", in part from Latin "causa". Some lists of common words distinguish between
word form In linguistics, morphology () is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words such as stems, root words, prefixes, and suffixes. Mor ...
s, while others rank all forms of a word as a single
lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken ...
(the form of the word as it would appear in a dictionary). For example, the lexeme ''be'' (as in ''
to be In linguistics, a copula (plural: copulas or copulae; abbreviated ) is a word or phrase that links the subject of a sentence to a subject complement, such as the word ''is'' in the sentence "The sky is blue" or the phrase ''was not being'' i ...
'') comprises all its conjugations (''is'', ''was'', ''am'', ''are'', ''were'', etc.), and contractions of those conjugations.
Benjamin Zimmer Benjamin Zimmer (born 1971) is an American linguist, lexicographer, and language commentator. He is a language columnist for ''The Wall Street Journal'' and contributing editor for ''The Atlantic''. He was formerly a language columnist for ''The ...
. June 22, 2006
Time after time after time...
Language Log ''Language Log'' is a collaborative language blog maintained by Mark Liberman, a phonetician at the University of Pennsylvania. Most of the posts focus on language use in the media and in popular culture. Text available through Google Search fr ...
. Retrieved June 22, 2006.
These top 100
lemma Lemma may refer to: Language and linguistics * Lemma (morphology), the canonical, dictionary or citation form of a word * Lemma (psycholinguistics), a mental abstraction of a word about to be uttered Science and mathematics * Lemma (botany), a ...
s listed below account for 50% of all the words in the Oxford English Corpus.


100 most common words

A list of 100 words that occur most frequently in written English is given below, based on an analysis of the
Oxford English Corpus The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly ...
(a collection of texts in the English language, comprising over 2 billion words). A
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...
is provided for most of the words, but part-of-speech categories vary between analyses, and not all possibilities are listed. For example, "I" may be a pronoun or a Roman numeral; "to" may be a preposition or an infinitive marker; "time" may be a noun or a verb. Also, a single spelling can represent more than one
root word A root (or root word) is the core of a word that is irreducible into more meaningful elements. In morphology, a root is a morphologically simple unit which can be left bare or to which a prefix or a suffix can attach. The root word is the prima ...
. For example, "singer" may be a form of either "sing" or "singe". Different corpora may treat such difference differently. The number of distinct senses that are listed in
Wiktionary Wiktionary ( , , rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a number ...
is shown in the
polysemy Polysemy ( or ; ) is the capacity for a sign (e.g. a symbol, a morpheme, a word, or a phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from ''monosemy'', where a word has a singl ...
column. For example, "out" can refer to an escape, a removal from play in baseball, or any of 36 other concepts. On average, each word in the list has 15.38 senses. The sense count does not include the use of terms in
phrasal verbs In the traditional grammar of Modern English, a phrasal verb typically constitutes a single semantic unit composed of a verb followed by a particle (examples: ''turn down'', ''run into'' or ''sit up''), sometimes combined with a preposition (ex ...
such as "put out" (as in "inconvenienced") and other multiword expressions such as the interjection "get out!", where the word "out" does not have an individual meaning. As an example, "out" occurs in at least 560 phrasal verbs and appears in nearly 1700 multiword expressions. The table also includes frequencies from other corpora. Note that as well as usage differences,
lemmatisation Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemma ...
may differ from corpus to corpus – for example splitting the prepositional use of "to" from the use as a particle. Also the
Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). Content The Corpus of C ...
(COCA) list includes dispersion as well as frequency to calculate rank.


Parts of speech

The following is a very similar list, subdivided by
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...
. The list labeled "Others" includes
pronoun In linguistics and grammar, a pronoun (abbreviated ) is a word or a group of words that one may substitute for a noun or noun phrase. Pronouns have traditionally been regarded as one of the parts of speech, but some modern theorists would not co ...
s,
possessive A possessive or ktetic form (abbreviated or ; from la, possessivus; grc, κτητικός, translit=ktētikós) is a word or grammatical construction used to indicate a relationship of possession in a broad sense. This can include strict owne ...
s,
articles Article often refers to: * Article (grammar), a grammatical element used to indicate definiteness or indefiniteness * Article (publishing), a piece of nonfictional prose that is an independent part of a publication Article may also refer to: G ...
,
modal verb A modal verb is a type of verb that contextually indicates a modality such as a ''likelihood'', ''ability'', ''permission'', ''request'', ''capacity'', ''suggestion'', ''order'', ''obligation'', or ''advice''. Modal verbs generally accompany the b ...
s,
adverb An adverb is a word or an expression that generally modifies a verb, adjective, another adverb, determiner, clause, preposition, or sentence. Adverbs typically express manner, place, time, frequency, degree, level of certainty, etc., answering ...
s, and
conjunction Conjunction may refer to: * Conjunction (grammar), a part of speech * Logical conjunction, a mathematical operator ** Conjunction introduction, a rule of inference of propositional logic * Conjunction (astronomy), in which two astronomical bodies ...
s.


See also

*
Basic English Basic English (British American Scientific International and Commercial English) is an English-based controlled language created by the linguist and philosopher Charles Kay Ogden as an international auxiliary language, and as an aid for teach ...
*
Frequency analysis In cryptanalysis, frequency analysis (also known as counting letters) is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers. Frequency analysis is based on t ...
, the study of the frequency of letters or groups of letters *
Letter frequencies Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi (c. 801–873 AD), who formally developed the method to break ...
*
Oxford English Corpus The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly ...
*
Swadesh list The Swadesh list ("Swadesh" is pronounced ) is a classic compilation of tentatively universal concepts for the purposes of lexicostatistics. Translations of the Swadesh list into a set of languages allow researchers to quantify the interrelatedness ...
, a compilation of basic concepts for the purpose of historical-comparative linguistics * Zipf's law, a theory stating that the frequency of any word is inversely proportional to its rank in a frequency table


Word lists

*
Dolch Word List The Dolch word list is a list of frequently used English words (also known as sight words), compiled by Edward William Dolch, a major proponent of the "whole-word" method of beginning reading instruction. The list was first published in a jou ...
, a list of frequently used English words *
General Service List The General Service List (GSL) is a list of roughly 2,000 words published by Michael West in 1953. The words were selected to represent the most frequent words of English and were taken from a corpus of written English. The target audience was E ...
*
Word lists by frequency A word list (or ''lexicon'') is a list of a language's lexicon (generally sorted by frequency of occurrence either by levels or as a ranked list) within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by f ...


References


External links

{{Wiktionary pipe, Wiktionary:Frequency lists#English, frequency lists Lists of English words