The Survey of English Usage was the first research centre in Europe to carry out research with
corpora
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
. The Survey is based in the Department of English Language and Literature at
University College London
, mottoeng = Let all come who by merit deserve the most reward
, established =
, type = Public research university
, endowment = £143 million (2020)
, budget = ...
.
History
The Survey of English Usage was founded as the Survey of Spoken English at
Durham University
, mottoeng = Her foundations are upon the holy hills (Psalm 87:1)
, established = (university status)
, type = Public
, academic_staff = 1,830 (2020)
, administrative_staff = 2,640 (2018/19)
, chancellor = Sir Thomas Allen
, vice_chan ...
in 1959 by
Randolph Quirk
Charles Randolph Quirk, Baron Quirk, CBE, FBA (12 July 1920 – 20 December 2017) was a British linguist and life peer. He was the Quain Professor of English language and literature at University College London from 1968 to 1981. He sat as ...
, moving with him to
University College London
, mottoeng = Let all come who by merit deserve the most reward
, established =
, type = Public research university
, endowment = £143 million (2020)
, budget = ...
in 1960. Many well-known linguists have spent time doing research at the Survey, including Bas Aarts, Valerie Adams, John Algeo, Dwight Bolinger, Noël Burton-Roberts,
David Crystal
David Crystal, (born 6 July 1941) is a British linguist, academic, and prolific author best known for his works on linguistics and the English language.
Family
Crystal was born in Lisburn, Northern Ireland, on 6 July 1941 after his mother had ...
, Derek Davy, Jan Firbas,
Sidney Greenbaum
Sidney Greenbaum (31 December 1929 – 28 May 1996) was a British scholar of the English language and of linguistics. He was Quain Professor of English language and literature at the University College London from 1983 to 1990 and Director ...
, Liliane Haegeman, Robert Ilson, Ruth Kempson,
Geoffrey Leech
Geoffrey Neil Leech FBA (16 January 1936 – 19 August 2014) was a specialist in English language and linguistics. He was the author, co-author, or editor of over 30 books and over 120 published papers. His main academic interests were English ...
, Jan Rusiecki, Jan Svartvik, and Joe Taglicht. The current director is Bas Aarts.
The original Survey Corpus predated modern computing. It was recorded on reel-to-reel tapes, transcribed on paper, filed in filing cabinets, and indexed on paper cards. Transcriptions were annotated with a detailed
prosodic
In linguistics, prosody () is concerned with elements of speech that are not individual phonetic segments (vowels and consonants) but are properties of syllables and larger units of speech, including linguistic functions such as intonation, str ...
and
paralinguistic
Paralanguage, also known as vocalics, is a component of meta-communication that may modify meaning, give nuanced meaning, or convey emotion, by using techniques such as prosody, pitch, volume, intonation, etc. It is sometimes defined as relatin ...
annotation developed by Crystal and Quirk (1964). Sets of paper cards were manually annotated for grammatical structures and filed, so, for example, all noun phrases could be found in the noun phrase filing cabinet in the Survey. Naturally, corpus searches required a visit to the Survey.
This corpus is now known more widely as the London-Lund Corpus (LLC), as it was the responsibility of co-workers in Lund, Sweden, to computerise the corpus. Thirty-four of the spoken texts were published in book form as Svartvik and Quirk (1980), and the corpus was used as the basis for the famous book ''
A Comprehensive Grammar of the English Language
''A Comprehensive Grammar of the English Language'' is a descriptive grammar of English written by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. It was first published by Longman in 1985.
In 1991 it was called "The greates ...
'' (Quirk ''et al.'' 1985).
[Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985). ''A Comprehensive Grammar of the English Language'' London: Longman.]
Current research
Constructing corpora
In 1988 Sidney Greenbaum proposed a new project, ICE, the
International Corpus of English The International Corpus of English (ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
His ...
. ICE was to be an international project, carried out at research centres around the world, to compile corpora of English varieties where English was the first or second official language. ICE texts would contain spoken and written English in a balanced sample of one million words per component so that these samples could be compared in a wide varieties of ways. The ICE project continues around the world to the present day.
ICE-GB, the British Component of ICE, was compiled at the Survey. ICE-GB was annotated to a very detailed level, including constructing a full grammatical analysis (parse) for every sentence in the corpus. The first release of ICE-GB took place in 1998. ICE-GB was distributed with software for searching and exploring the
parsed corpus
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empi ...
called ICECUP. Release 2 of ICE-GB has now been released and is available on CD.
As well as contrasting varieties of English, many researchers are interested in language development and change over time. A recent project at the Survey undertook the parsing of a large (400,000 word) selection of the spoken part of the LLC in a manner directly comparable with ICE-GB, forming a new, 800,000 word diachronic corpus, called the Diachronic Corpus of Present-Day Spoken English (DCPSE). DCPSE has now been released and is available on CD from the Survey.
These two corpora comprise the largest collection of parsed and corrected, orthographically transcribed spoken English language data in the world, with over one million words of spoken English in this form.
Exploring corpora
Parsed corpora are large databases containing detailed grammatical tree structures. One of the consequences of forming large collections of valuable linguistic data is a pressing need for methods and tools to help researchers and other users make the most of them. So in parallel with the parsing of natural language data, the Survey team have carried out research and development of software tools to help linguists use these corpora. The ICECUP research platform uses an intuitive grammatical query representation called Fuzzy Tree Fragments (FTFs) to search parsed corpora.
Linguistic research with corpora
As well as distributing corpora and tools to the
corpus linguistics
Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
research community, the SEU carries out research into English language. Recent projects include research on the English Noun Phrase, Subordination in Spoken and Written English, and the English Verb Phrase. The Survey also provides support for PhD students who carry out research into English language corpora.
References
External links
*
{{authority control
English language
Corpus linguistics
Applied linguistics
Linguistic research institutes