HOME

TheInfoList



OR:

The Corpus of Contemporary American English (COCA) is a one-billion-word
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
of contemporary
American English American English, sometimes called United States English or U.S. English, is the set of varieties of the English language native to the United States. English is the most widely spoken language in the United States and in most circumstances ...
. It was created by Mark Davies, retired professor of
corpus linguistics Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...
at
Brigham Young University Brigham Young University (BYU, sometimes referred to colloquially as The Y) is a private research university in Provo, Utah. It was founded in 1875 by religious leader Brigham Young and is sponsored by the Church of Jesus Christ of Latter-d ...
(BYU).


Content

The Corpus of Contemporary American English (COCA) is composed of one billion words as of November 2021. The corpus is constantly growing: In 2009 it contained more than 385 million words; In 2010 the corpus grew in size to 400 million words; By March 2019, the corpus had grown to 560 million words. As of November 2021, the Corpus of Contemporary American English is composed of 485,202 texts. According to the corpus website, the current corpus (November 2021) is composed of texts that include 24-25 million words for each year 1990-2019. For each year contained in the corpus (1990-2019), the corpus is evenly divided between six registers/genres: TV/movies, spoken, fiction, magazine, newspaper, and academic (see Texts and Registers page of the COCA website). In addition to the six registers that were previously listed, COCA (as of November 2021) also contains 125,496,215 words from blogs, and 129,899,426 from websites, making it a corpus that is truly composed of contemporary English (see Texts and Register page of COCA). The texts come from a variety of sources: * Spoken: (85 million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs. * Fiction: (81 million words) Short stories and plays, first chapters of books 1990–present, and movie scripts. * Popular magazines: (86 million words) Nearly 100 different magazines, from a range of domains such as news, health, home and gardening, women's, financial, religion, and sports. * Newspapers: (81 million words) Ten newspapers from across the US, with text from different sections of the newspapers, such as local news, opinion, sports, and the financial section. * Academic journals: (81 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress Classification system.


Availability

The Corpus of Contemporary American English is free to search for registered users.


Queries

* The interface is the same as the BYU-BNC interface for the 100 million word
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
, the 100 million word
Time Magazine ''Time'' (stylized in all caps) is an American news magazine based in New York City. For nearly a century, it was published weekly, but starting in March 2020 it transitioned to every other week. It was first published in New York City on Ma ...
Corpus, and the 400 million word Corpus of Historical American English (COHA), the 1810s–2000s (see links below) * Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms (see below), and customized lists (see below) * The corpus is tagged by CLAWS, the same
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...
tagger that was used for the BNC and the Time corpus * Chart listings (totals for all matching forms in each genre or year, 1990–present, as well as for subgenres) and table listings (frequency for each matching form in each genre or year) * Full collocates searching (up to ten words left and right of node word) * Re-sortable concordances, showing the most common words/strings to the left and right of the searched word * Comparisons between genres or time periods (e.g. collocates of 'chair' in fiction or academic, nouns with 'break the in newspapers or academic, adjectives that occur primarily in sports magazines, or verbs that are more common 2005–2010 than previously) * One-step comparisons of collocates of related words, to study semantic or cultural differences between words (e.g. comparison of collocates of 'small', 'little', 'tiny', 'miniscule', or lilliputian or 'Democrats' and 'Republicans', or 'men' and 'women', or 'rob' vs 'steal') * Users can include semantic information from a 60,000 entry thesaurus directly as part of the query syntax (e.g. frequency and distribution of synonyms of 'beautiful', synonyms of 'strong' occurring in fiction but not academic, synonyms of 'clean' + noun ('clean the floor', 'washed the dishes')) * Users can also create their own 'customized' word lists, and then re-use these as part of subsequent queries (e.g. lists related to a particular semantic category (clothes, foods, emotions), or a user-defined part of speech) * Note that the corpus is available only through the web interface, due to copyright restrictions.


Related

The corpus o
Global Web-based English
(GloWbE; pronounced "globe") contains about 1.9 billion words of text from twenty different countries. This makes it about 100 times as large as other corpora like the International Corpus of English, and it allows for many types of searches that would not be possible otherwise. In addition to this online interface, you can also download full-text data from the corpus. It is unique in the way that it allows one to carry out comparisons between different varieties of English. GloWbE is related to the many other corpora of English.


See also

* American National Corpus *
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
* Bank of English * Brown Corpus


References


Further reading

* * * * *


External links

*
"The Linguistics Search Engine That Overturned the Federal Mask Mandate"
- article i
''Verge''
{{Dictionaries of English English corpora Online databases Applied linguistics Linguistic research Corpora