Spoken English Corpus
   HOME

TheInfoList



OR:

The Spoken English Corpus (SEC) is a
speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or spea ...
collection of recordings of spoken
British English British English (BrE, en-GB, or BE) is, according to Lexico, Oxford Dictionaries, "English language, English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in ...
compiled during 1984-7. The corpus manual can be found on ICAME.


History

The Spoken English Corpus (SEC) project was supported jointly in 1984-5 by the Humanities Research Fund at Lancaster University and by IBM (UK) Ltd, and subsequently by IBM UK Ltd. The project was supported by
Geoffrey Leech Geoffrey Neil Leech FBA (16 January 1936 – 19 August 2014) was a specialist in English language and linguistics. He was the author, co-author, or editor of over 30 books and over 120 published papers. His main academic interests were English ...
at Lancaster and Geoffrey Kaye at IBM. The project was a collaboration, funded by IBM, between the Unit for Computer Research on the English Language (UCREL) at the
University of Lancaster , mottoeng = Truth lies open to all , established = , endowment = £13.9 million , budget = £317.9 million , type = Public , city = Bailrigg, City of Lancaster , country = England , coor = , campus = Bailrigg , faculty = 1 ...
and the IBM Scientific Centre in
Winchester Winchester is a City status in the United Kingdom, cathedral city in Hampshire, England. The city lies at the heart of the wider City of Winchester, a local government Districts of England, district, at the western end of the South Downs Nation ...
.


Compilation

SEC comprises 53 recorded passages, mainly from the
BBC #REDIRECT BBC #REDIRECT BBC Here i going to introduce about the best teacher of my life b BALAJI sir. He is the precious gift that I got befor 2yrs . How has helped and thought all the concept and made my success in the 10th board exam. ...
...
, spoken in the accent usually referred to as
Received Pronunciation Received Pronunciation (RP) is the Accent (sociolinguistics), accent traditionally regarded as the Standard language, standard and most Prestige (sociolinguistics), prestigious form of spoken British English. For over a century, there has been ...
, or RP. The collection covers categories such as commentary, news broadcast, lecture, dialogue, poetry and propaganda. The corpus contains 52,637 words, totalling 339 minutes. The compilation of the corpus is described by Lita Taylor in her 1996 article "The Compilation of the Spoken English Corpus."


Transcription

A system was devised for transcription of the intonation of the material in the recordings. Two transcribers, Gerry Knowles and Briony Williams, both supported by Lita Taylor, analysed the entire corpus. The transcription system is explained by Williams, and an experiment was conducted by Brian Pickering to assess the degree of
agreement Agreement may refer to: Agreements between people and organizations * Gentlemen's agreement, not enforceable by law * Trade agreement, between countries * Consensus, a decision-making process * Contract, enforceable in a court of law ** Meeting o ...
between the two transcribers on a section of the Corpus containing around 1000 tone-units which was transcribed by both transcribers. Good agreement was found. An important attribute of a modern corpus is that it is computer-readable: a corpus tends to reside on a hard disk than a bookshelf. In presenting the corpus in this book form, the authors have taken into account the needs of established corpus linguists, and of those who are not yet familiar with corpora. Anyone who has the corpus on disk can make hard copies of most of the files; but without a special font to print the prosodic symbols, the prosodic texts will be either unprintable or unreadable. For this reason the prosodic version has been chosen for publication. The whole transcription in print was made in its present form by Peter Alderson, who later took over as Speech Research Manager at IBM. The volume was later entitled "A Corpus of Formal British English Speech: The Lancaster/IBM Spoken English Corpus", and was first published by
Longman Longman, also known as Pearson Longman, is a publishing company founded in London, England, in 1724 and is owned by Pearson PLC. Since 1968, Longman has been used primarily as an imprint by Pearson's Schools business. The Longman brand is also ...
in 1996, later by Routledge in 2013. The book is currently available from online bookstores including Routledge and Book Depository, or in electronic format from Google Play Books.


Other analyses

Grammatical tagging of each word, based on the CLAWS1 tagset, was added to the text of the SEC by an automatic process. The fact that this tagging was in machine-readable form made it possible to relate
grammatical In linguistics, grammaticality is determined by the conformity to language usage as derived by the grammar of a particular variety (linguistics), speech variety. The notion of grammaticality rose alongside the theory of generative grammar, the go ...
and
prosodic In linguistics, prosody () is concerned with elements of speech that are not individual phonetic segments (vowels and consonants) but are properties of syllables and larger units of speech, including linguistic functions such as intonation, str ...
information in the texts. Subsequent work used probabilistic models to develop further the grammatical tagging and to produce automatic
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
techniques. Anne Wichmann published her research on SEC intonation, "Intonation in Text and Discourse: Beginnings, middles, and ends" in 2000.


Machine-Readable Spoken English Corpus (MARSEC)

Although the text and its associated tagging existed in machine-readable form, the recordings themselves existed only as tape-recordings. A collaboration, funded by the
Economic and Social Research Council The Economic and Social Research Council (ESRC), formerly the Social Science Research Council (SSRC), is part of UK Research and Innovation (UKRI). UKRI is a non-departmental public body (NDPB) funded by the UK government. ESRC provides fundi ...
in 1992-4, between speech scientists at the Universities of Lancaster and
Leeds Leeds () is a city and the administrative centre of the City of Leeds district in West Yorkshire, England. It is built around the River Aire and is in the eastern foothills of the Pennines. It is also the third-largest settlement (by populati ...
in the United Kingdom set out to produce a version of the corpus which contained the recordings in digital form, time-linked to the text. The principal researchers were Gerry Knowles and Tamas Varadi (Lancaster) and Peter Roach and Simon Arnfield (Leeds). The outline of the project is set out in Knowles, and the automatic time-alignment is described by Roach and Arnfield. The digitized recordings were recorded on
CD-ROM A CD-ROM (, compact disc read-only memory) is a type of read-only memory consisting of a pre-pressed optical compact disc that contains data. Computers can read—but not write or erase—CD-ROMs. Some CDs, called enhanced CDs, hold both comput ...
. It was subsequently made available for downloading for research purposes from Leeds University, though this facility is no longer supported.


Aix-MARSEC

The work on MARSEC in Lancaster and Leeds finished around 1995, but the corpus has subsequently been the object of a considerable amount of further development at the University of Aix-en-Provence, France, under the direction of Daniel Hirst. The database consists of two major components: the digitalized recordings from MARSEC and the annotations. Annotations have so far been undertaken at nine levels, including
phoneme In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west o ...
s,
syllable A syllable is a unit of organization for a sequence of speech sounds typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). Syllables are often considered the phonological "bu ...
s,
word A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...
s, stress feet,
rhythm Rhythm (from Greek , ''rhythmos'', "any regular recurring motion, symmetry") generally means a " movement marked by the regulated succession of strong and weak elements, or of opposite or different conditions". This general meaning of regular recu ...
units and minor and major turn units. Two supplementary levels, the grammatical annotation by CLAWS and a Property Grammar system developed at Aix-en-Provence, are to be integrated soon. A possible disadvantage of this treatment is that the corpus can only be searched using specially written scripts. The database, together with tools, is available under
GNU GPL The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general us ...
licensing at the Aix-MARSEC project site.
Download Aix-MARSEC audio files (sign-up required)


References

{{Corpus linguistics English corpora Dialectology Applied linguistics Linguistic research Phonetics Corpora