A word list (or ''lexicon'') is a list of a language's lexicon (generally sorted by frequency of occurrence either by levels or as a ranked list) within some given

text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...

, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort" (), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus

register Register or registration may refer to: Arts entertainment, and media Music * Register (music), the relative "height" or range of a note, melody, part, instrument, etc. * ''Register'', a 2017 album by Travis Miller * Registration (organ), th ...

, and the definition of "

word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...

". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field. In computational linguistics, a frequency list is a sorted list of

s (word types) together with their

frequency Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from ''angular frequency''. Frequency is measured in hertz (Hz) which is eq ...

, where frequency here usually means the number of occurrences in a given

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

, from which the rank can be derived as the position in the list.

Methodology

Factors

Nation () noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists: * corpus representativeness * word frequency and range * treatment of word families * treatment of idioms and fixed expressions * range of information * various other criteria

Corpora

Traditional written corpus

Most of currently available studies are based on written

, more easily available and easy to process.

SUBTLEX movement

However, proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. made a long critical evaluation of this traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. This has recently been followed by a handful of follow-up studies, providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French (), American English (; ), Dutch (), Chinese (), Spanish (), Greek (), Vietnamese (), Brazil Portuguese () and Portugal Portuguese (), Albanian (), Polish () and Catalan (2019). SUBTLEX-IT (2015) provides raw data only.

Lexical unit

In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a

word family A word family is the base form of a word plus its inflected forms and derived forms made with suffixes and prefixes plus its cognates, i.e. all words that have a common etymological origin, some of which even native speakers don't recognize as bein ...

under the representation of its base word. Thus, ''possible, impossible, possibility'' are words of the same word family, represented by the base word ''*possib*''. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.

Statistics

It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics. German linguists define the ''Häufigkeitsklasse'' (frequency class)

N

of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word ''outragious'' has a ratio of 76/3789654 and belongs in class 16. :

N=\left\lfloor0.5-\log_2\left(\frac\right)\right\rfloor

where

\lfloor\ldots\rfloor

is the

floor function In mathematics and computer science, the floor function is the function that takes as input a real number , and gives as output the greatest integer less than or equal to , denoted or . Similarly, the ceiling function maps to the least int ...

. Frequency lists, together with

semantic network A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, ...

s, are used to identify the least common, specialized terms to be replaced by their

hypernym In linguistics, semantics, general semantics, and ontologies, hyponymy () is a semantic relation between a hyponym denoting a subtype and a hypernym or hyperonym (sometimes called umbrella term or blanket term) denoting a supertype. In other ...

s in a process of

semantic compression In natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas ca ...

Pedagogy

Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors ().

Paul Nation Ian Stephen Paul Nation (born 28 April 1944) is an internationally recognized scholar in the field of linguistics and teaching methodology. As a professor in the field of applied linguistics with a specialization in pedagogical methodology, he ...

's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes hematicvocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" ().

Effects of words frequency

Word frequency is known to have various effects (; ). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (). Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect (). The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.

Languages

Below is a review of available resources.

English

Word counting dates back to

Hellenistic In Classical antiquity, the Hellenistic period covers the time in Mediterranean history after Classical Greece, between the death of Alexander the Great in 323 BC and the emergence of the Roman Empire, as signified by the Battle of Actium in ...

time. Thorndike & Lorge, assisted by their colleagues, counted 18,000,000 running words to provide the first large-scale frequency list in 1944, before modern computers made such projects far easier ().

Traditional lists

These all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency in the Corpus of Contemporary American English, was first attested to in 1999, and does not appear in any of these three lists. ;The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944) The TWB contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (). ;The

General Service List The General Service List (GSL) is a list of roughly 2,000 words published by Michael West in 1953. The words were selected to represent the most frequent words of English and were taken from a corpus of written English. The target audience was E ...

(West, 1953) The GSL contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the

New General Service List The New General Service List (NGSL) is a list of 2,818 words (lemmas) claimed to be the core vocabulary of the English language published by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips in March 2013. The words in the NGSL represent ...

. ;The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971) A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (). ;The Brown (Francis and Kucera, 1982) LOB and related corpora These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists ().

French

;Traditional datasets A review has been made by . An attempt was made in the 1950s–60s with the

Français fondamental ''Français fondamental'' (French for ''Fundamental French'') is a list of words and grammatical concepts, devised in the beginning of the 1950s for teaching foreigners and residents of the French Union, France's colonial empire. A series of inves ...

. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules. It is claimed that 70 grammatical words constitute 50% of the communicatives sentence, while 3,680 words make about 95~98% of coverage. A list of 3,000 frequent words is available. The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue

Étienne Brunet Étienne, a French analog of Stephen or Steven, is a masculine given name. An archaic variant of the name, prevalent up to the mid-17th century, is Estienne. Étienne, Etienne, Ettiene or Ettienne may refer to: People Scientists and inventors ...

. Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain". More recently, the project Lexique3 provides 142,000 French words, with

orthography An orthography is a set of conventions for writing a language, including norms of spelling, hyphenation, capitalization, word breaks, emphasis, and punctuation. Most transnational languages in the modern period have a writing system, and ...

phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...

, syllabation,

part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...

gender Gender is the range of characteristics pertaining to femininity and masculinity and differentiating between them. Depending on the context, this may include sex-based social structures (i.e. gender roles) and gender identity. Most cultures ...

, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0. ;Subtlex This Lexique3 is a continuous study from which originate the Subtlex movement cited above. made a completely new counting based on online film subtitles.

Spanish

There have been several studies of Spanish word frequency ().

Chinese

Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (). American sinologist

John DeFrancis John DeFrancis (August 31, 1911January 2, 2009) was an American linguist, sinologist, author of Chinese language textbooks, lexicographer of Chinese dictionaries, and Professor Emeritus of Chinese Studies at the University of Hawaii at Mānoa. ...

mentioned its importance for Chinese as a foreign language learning and teaching in ''Why Johnny Can't Read Chinese'' (). As a frequency toolkit, Da () and the Taiwanese Ministry of Education () provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the

People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...

, and the

Republic of China (Taiwan) Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the northeast ...

TOP A spinning top, or simply a top, is a toy with a squat body and a sharp point at the bottom, designed to be spun on its vertical axis, balancing on the tip due to the gyroscopic effect. Once set in motion, a top will usually wobble for a few ...

list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, recently made a rich study of Chinese word and character frequencies.

Other

Most frequently used words in different languages based on Wikipedia or combined corpora.

Notes

References

Theoretical concepts

* * * . *
database
* * * (frequency list of German words) * *

Written texts-based databases

* . * . * . * .

SUBTLEX movement

* * * * * * * *
databases
* * * * {{DEFAULTSORT:Word lists by frequency Quantitative linguistics Computational linguistics de:Häufigkeitsklasse hy:Հաճախականության բառարաններ

Methodology

Factors

Corpora

Traditional written corpus

SUBTLEX movement

Lexical unit

Statistics

Pedagogy

Effects of words frequency

Languages

English

Traditional lists

French

Spanish

Chinese

Other

See also

Notes

References

Theoretical concepts

Written texts-based databases

SUBTLEX movement