The American National Corpus (ANC) is a

text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...

American English American English, sometimes called United States English or U.S. English, is the set of variety (linguistics), varieties of the English language native to the United States. English is the Languages of the United States, most widely spoken lan ...

containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the

British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...

. It is annotated for

part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...

and

lemma Lemma may refer to: Language and linguistics * Lemma (morphology), the canonical, dictionary or citation form of a word * Lemma (psycholinguistics), a mental abstraction of a word about to be uttered Science and mathematics * Lemma (botany), a ...

, shallow parse, and

named entities Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...

. The ANC is available from the

Linguistic Data Consortium The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and developm ...

. A fifteen million word subset of the corpus, called the Open American National Corpus (OANC), is freely available with no restrictions on its use from the ANC Website. The corpus and its annotations are provided according to the specifications of

ISO/TC 37 ISO/TC 37 is a technical committee within the International Organization for Standardization (ISO) that prepares standards and other documents concerning methodology and principles for terminology and language resources. Title: Terminology an ...

SC4's Linguistic Annotation Framework. By using a freely provided transduction tool (ANC2Go), the corpus and user-chosen annotations are provided in multiple formats, including CoNLL IOB format, the XML format conformant to the XML Corpus Encoding Standard (XCES) (usable with the

's XAIRA search engine), a

UIMA UIMA ( ), short for Unstructured Information Management Architecture, is an OASIS standard for content analytics, originally developed at IBM. It provides a component software architecture for the development, discovery, composition, and deploy ...

-compliant format, and formats suitable for input to a wide variety of concordance software. Plugins to import the annotations into

General Architecture for Text Engineering General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many nat ...

(GATE) are also available. The ANC differs from other corpora of English because it is richly annotated, including different

annotations (Penn tags, CLAWS5 and CLAWS7 tags), shallow parse annotations, and annotations for several types of

. Additional annotations are added to all or parts of the corpus as they become available, often by contributions from other projects. Unlike on-line searchable corpora, which due to copyright restrictions allow access only to individual sentences, the entire ANC is available to enable research involving, for example, development of statistical language models and full-text linguistic annotation. ANC annotations are automatically produced and unvalidated. A 500,000 word subset called the Manually Annotated Sub-Corpus (MASC) is annotated for approximately 20 different kinds of linguistic annotations, all of which have been hand-validated or manually produced. These includ
Penn Treebank
syntactic annotation,

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...

sense annotation,

FrameNet FrameNet is a research and resource development project based at the International Computer Science Institute (ICSI) in Berkeley, California, which has produced an electronic resource based on a theory of meaning called frame semantics. The data ...

semantic frame annotations, among others. Like the OANC, MASC is freely available for any use, and can be downloaded from the ANC site or from the

. It is also distributed in part-of-speech tagged form with the

Natural Language Toolkit The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and E ...

. The ANC and its sub-corpora differ from similar corpora primarily in the range of linguistic annotations provided and the inclusion of modern genres that do not appear in resources like the

. Also, because the initial target use of the corpora was the development of statistical language models, the full data and all annotations are available, thus differing from the

Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). Content The Corpus of C ...

(COCA) which is available only selectively through a web browser. Continued growth of the OANC and MASC relies on contributions of data and annotations from the computational linguistics and corpus linguistics communities.

References

* Ide, N. (2008)
The American National Corpus: Then, Now, and Tomorrow
In Michael Haugh, Kate Burridge, Jean Mulder and Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, Cascadilla Proceedings Project, Sommerville, MA. * Ide, N., Suderman, K. (2004)
The American National Corpus First Release
Proceedings of the Fourth Language Resources and Evaluation Conference (LREC), Lisbon, 1681-84. *Ide, N., Baker, C., Fellbaum, C., Passonneau, R. (2010).
The Manually Annotated Sub-Corpus: A Community Resource For and By the People
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.

External links

ANC Website

MASC website

ANC2Go
{{Corpus linguistics English corpora Online databases Applied linguistics Linguistic research Computational linguistics

See also

References

External links