Speech corpus
   HOME

TheInfoList



OR:

A speech corpus (or spoken corpus) is a
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
of speech audio files and text transcriptions. In
speech technology Speech technology relates to the technologies designed to duplicate and respond to the human voice. They have many uses. These include aid to the voice-disabled, the hearing-disabled, and the blind, along with communication with computers without a ...
, speech corpora are used, among other things, to create acoustic models (which can then be used with a
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
or
speaker identification Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to ''speaker recognition'' or speech recognition. Speaker verification ...
engine). In
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
, spoken corpora are used to do research into
phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...
,
conversation analysis Conversation analysis (CA) is an approach to the study of social interaction, embracing both verbal and non-verbal conduct, in situations of everyday life. CA originated as a sociological method, but has since spread to other fields. CA began with ...
,
dialectology Dialectology (from Greek , ''dialektos'', "talk, dialect"; and , ''-logia'') is the scientific study of linguistic dialect, a sub-field of sociolinguistics. It studies variations in language based primarily on geographic distribution and their assoc ...
and other fields. A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases). There are two types of Speech Corpora: # Read Speech – which includes: #* Book excerpts #* Broadcast news #* Lists of words #* Sequences of numbers # Spontaneous Speech – which includes: #* Dialogs – between two or more people (includes meetings; one such corpus is the KEC); #* Narratives – a person telling a story (one such corpus is the
Buckeye Corpus The Buckeye Corpus of conversational speech is a speech corpus created by a team of linguists and psychologists at Ohio State University led by Prof. Mark Pitt. Dilley, L., & Pitt, M. (2007). A study of regressive place assimilation in spontaneous s ...
); #* Map-tasks – one person explains a route on a map to another; #* Appointment-tasks – two people try to find a common meeting time based on individual schedules. A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.


See also

*
Arabic Speech Corpus The Arabic Speech Corpus is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The ...
*
Common Voice Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The t ...
* EXMARaLDA *
Lingua Libre Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license. Description Lingua Libre enables to record words, phrases ...
, an online
libre Libre may refer to: Computing * Libre software, free software * Libre Computer Project, developer of open-hardware single-board computers Medicine *FreeStyle Libre, a glucose monitoring device Media * Libre Times, news site which people can free ...
tool * List of children's speech corpora *
Non-native speech database A non-native speech database is a Speech corpus, speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, Text-to-speech, text to speech systems, pron ...
*
Praat Praat (; , ''wikt:praat#Dutch, "talk"'') is a free software, free computer software package for speech analysis in phonetics. It was designed, and continues to be developed, by Paul Boersma and David Weenink of the University of Amsterdam. It ca ...
*
Spoken English Corpus The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984-7. The corpus manual can be found on International Computer Archive of Modern and Medieval English, ICAME. History The Spok ...
*
The BABEL Speech Corpus The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is dist ...
*
TIMIT TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time. TIMIT was designed to further acoustic-phonetic knowledge and au ...
*
Transcriber Transcriber is an open-source software tool for the transcription and annotation of speech signals for linguistic research. It supports multiple hierarchical layers of segmentation, named entity annotation, speaker lists, topic lists, and ove ...
*
Transcription (linguistics) Transcription in the linguistic sense is the systematic representation of spoken language in written form. The source can either be utterances (''speech'' or ''sign language'') or preexisting text in another writing system. Transcription shoul ...


References

* Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum. * Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.


External links


Santa Barbara Corpus of Spoken American English

Buckeye Corpus
The Buckeye Corpus of Conversational Speech
The KEC -- The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings


* ttp://std.metu.edu.tr/en/ The Spoken Turkish Corpus at METU Ankara
Spoken Corpus Klient with the Corp-Oral Corpus at ILTEC Lisbon

VoxForge – open source speech corpora

OLAC: Open Language Archives Community



Simmortel Speech Recognition Corpus for Indian English and Hindi

ELRA: the European Language Resources Association

The PELCRA Conversational Corpus of Polish

The Arabic Speech Corpus

Corpus of Political Speeches
: Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library Corpora Corpus linguistics Speech recognition Dialectology Phonetics Language documentation de:Textkorpus {{corpora-stub