Oxford English Corpus
   HOME

TheInfoList



OR:

The Oxford English Corpus (OEC) is a
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
of 21st-century
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...
, used by the makers of the ''
Oxford English Dictionary The ''Oxford English Dictionary'' (''OED'') is the first and foundational historical dictionary of the English language, published by Oxford University Press (OUP). It traces the historical development of the English language, providing a c ...
'' and by
Oxford University Press Oxford University Press (OUP) is the university press of the University of Oxford. It is the largest university press in the world, and its printing history dates back to the 1480s. Having been officially granted the legal right to print book ...
' language research programme. It is the largest corpus of its kind, containing nearly 2.1 billion words. It includes language from the UK, the United States, Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa. The text is mainly collected from web pages; some printed texts, such as
academic journals An academic journal or scholarly journal is a periodical publication in which scholarship relating to a particular academic discipline is published. Academic journals serve as permanent and transparent forums for the presentation, scrutiny, and ...
, have been collected to supplement particular subject areas. The sources are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from
Hansard ''Hansard'' is the traditional name of the transcripts of parliamentary debates in Britain and many Commonwealth countries. It is named after Thomas Curson Hansard (1776–1833), a London printer and publisher, who was the first official prin ...
to the language of blogs, emails, and social media". This may be contrasted with similar databases that sample only a specific kind of writing. The corpus is generally available only to researchers at Oxford University Press, but other researchers who can demonstrate a strong need may apply for access. The digital version of the Oxford English Corpus is formatted in
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
and usually analysed with
Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or lan ...
software.The Oxford English Corpus
Retrieved February 4, 2014.
By April 27, 2006, the dictionary database had 1 billion words. Each document in the OE Corpus is accompanied by
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
including: *title *author (if known; many websites make this difficult to determine reliably) *author gender (if known) *language type (e.g. British English, American English) *source website *year (+ date, if known) *date of collection *domain + subdomain * document statistics (number of tokens, sentences, etc.)


See also

*
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
*
Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). Content The Corpus of Co ...
(COCA) * American National Corpus *
Frequency analysis In cryptanalysis, frequency analysis (also known as counting letters) is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers. Frequency analysis is based on ...


References

Applied linguistics Databases in England English corpora Linguistic research English corpus English corpus Types of databases Corpora {{corpora-stub