The International Corpus of English (ICE) is a set of
corpora
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
History
Sidney Greenbaum
Sidney Greenbaum (31 December 1929 – 28 May 1996) was a British scholar of the English language and of linguistics. He was Quain Professor of English language and literature at the University College London from 1983 to 1990 and Director ...
's goal to compile corpora that would compare the syntax of world English became the ICE project that was achieved by Professor Charles F. Meyer. Sidney Greenbaum anticipated for international teams of researchers to collect comparable national variations of English both written and spoken. Comparable variations would be British English, American English, and Indian English, that would be represented through a computer corpora. The corpora are used by researchers to compare the syntax of the varieties of English. ICE corpora completion would have comprehensive linguistic analysis of varieties of English that have emerged. Ongoing research for ICE is implemented by international teams in diversified regions. The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.
Description
Each corpus contains one million words in 500 texts of 2000 words,
following the sampling methodology used for the
Brown Corpus
The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
. Unlike Brown or the
Lancaster-Oslo-Bergen (LOB) Corpus (or indeed mega-corpora such as the
British National Corpus
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
), however, the ''majority'' of texts are derived from spoken data.
With only one million words per corpus, ICE corpora are considered very small for modern standards. ICE corpora contain 60% (600,000 words) of orthographically transcribed ''spoken'' English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those containing, e.g. parliamentary or legal paraphrases.
The corpora consist entirely of data from 1990 or later. The subjects from which the data was collected are all adults who were educated in English and were either born, or moved at an early age, to the country to which their data is attributed.
There are speech and text samples from both men and women of many age groups, but the corpus website makes it a point to note that, "The proportions, however, are not representative of the proportions in the population as a whole: women are not equally represented in professions such as politics and law, and so do not produce equal amounts of discourse in these fields."
The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk ''et al.''
phrase structure grammar, and the analyses have been thoroughly checked and completed. This analysis includes a
part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
and
parsing
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from L ...
of the entire corpus. The
treebank
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...
can be thoroughly searched and explored with the ''ICE Corpus Utility Program'' or ''ICECUP'' software. More information is in the handbook.
To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Many corpora are currently available for download on the ICE official webpage, though some require a license. Others, however, are not ready for publication.
Textual and Grammatical Annotation
Researchers and Linguists follow specific guidelines when annotating data for the corpus, which can be foun
here in the International Corpus of English Manuals and Documentation. The three levels of annotation are Text Markup, Wordclass Tagging, Syntactic Parsing.
Textual Markup
Original markup and layout such as sentence and paragraph parsing is preserved, with special markers indicating it as original. Spoken data is transcribed orthographically, with indicators for hesitations, false starts, and pauses.
Word Class Tagging
Word Classes, also called
Parts of Speech
In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ass ...
, are grammatical categories for words based upon their function in a sentence.
British texts are automatically tagged for wordclass by the ICE tagger, developed at University College London, which uses a comprehensive grammar of the English language.
All other languages are tagged automatically using the PENN Treebank and the CLAWS tagset. While the tags are not corrected manually, they are checked regularly for quality.
Syntactic Parsing
The sentence are parsed automatically and, if necessary, are manually corrected with ICECUP, a syntax tree editor created specifically for the corpus.
Dependency parsing is also done automatically with the Dependency Parser Pro3GreS. The results are not manually verified.
Pragmatic Parsing
Ireland is currently the only participant country who includes pragmatic annotation in their data.
Design of the Corpora
Below are the subsections of the ICE, with the number of corpora for each category and sub-category in parentheses.
Publications
There are a number of books published about the International Corpus of English, as well as books based in part on the corpora.
* ''English in the Caribbean: Variation, Style and Standards in Jamaica and Trinidad'' (2014) by Dagmar Deuber
* ''The Present Perfect in World Englishes: Charting Unity and Diversity'' (2014) by Valentin Werner
* ''Mapping Unity and Diversity Worldwide: Corpus-based Studies of New Englishes'' (2012) by Marianne Hundt and Ulrike Gut
* ''The Syntax of Spoken Indian English'' (2012) by Claudia Lange
* ''Oxford Modern English Grammar'' (2011) by Bas Aarts
* ''Adjunct Adverbials in English'' (2010) by Hilde Hasselgård
* ''ICAME Journal'' No 34 (2010)
* ''An Introduction to English Grammar'' (2009) by Sidney Greenbaum and Gerald Nelson
* ''Word-Formation in New Englishes: A corpus-based Analysis'' (2008) by Thomas Biermeier
* Special issue of ''World Englishes'' Volume 23 Number 2 (2004)
* ''Exploring Natural Language: Working with the British component of the International Corpus of English'' (2002) by Gerald Nelson, Sean Wallis, and Bas Aarts
* ''Comparing English Worldwide: The International Corpus of English'' (1996) by Sidney Greenbaum
* ''Oxford English Grammar'' (1996) by Sidney Greenbaum
Participants
The current list of participant countries are (*= available):
* Australia
* Cameroon
* Canada*
* East Africa (Kenya, Malawi, Tanzania)*
* Fiji
* Ghana
* Great Britain* (parsed)
* Hong Kong*
* India*
* Ireland*
* Jamaica*
* Malta
* Malaysia
* New Zealand*
* Nigeria* (tagged)
* Pakistan
* The Philippines*
* Sierra Leone
* Singapore*
* South Africa
* Sri Lanka
* Trinidad and Tobago
* USA*
See also
*
Corpus linguistics
*
British National Corpus
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
*
BYU Corpus of American English
References
External links
The International Corpus of English website
{{Corpus linguistics
English corpora
Dialects of English
Applied linguistics
Linguistic research
Corpora