Speech Corpus
   HOME
*





Speech Corpus
A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine). In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields. A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases). There are two types of Speech Corpora: # Read Speech – which includes: #* Book excerpts #* Broadcast news #* Lists of words #* Sequences of numbers # Spontaneous Speech – which includes: #* Dialogs – between two or more people (includes meetings; one such corpus is the KEC); #* Narratives – a person telling a story (one such corpus is the Buckeye Corpus); #* Map-tasks – one person explains a route on a map to another; #* Appointment-tasks – two people try to find a common meeti ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations, including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues, including supporting concurrent access and fault tolerance. A database management system (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS software additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an appli ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Free Software
Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, not price; all users are legally free to do what they want with their copies of a free software (including profiting from them) regardless of how much is paid to obtain the program.Selling Free Software
(gnu.org)
Computer programs are deemed "free" if they give end-users (not just the developer) ultimate control over the software and, subsequently, over their devices. The right to study and modify a computer program entails that

picture info

Phonetics
Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. The field of phonetics is traditionally divided into three sub-disciplines based on the research questions involved such as how humans plan and execute movements to produce speech (articulatory phonetics), how various movements affect the properties of the resulting sound (acoustic phonetics), or how humans convert sound waves to linguistic information (auditory phonetics). Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language which differs from the phonological unit of phoneme; the phoneme is an abstract categorization of phones. Phonetics deals with two aspects of human speech: production—the ways humans make sounds—and perception—the way speech is understood. The communicative modali ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Dialectology
Dialectology (from Greek , ''dialektos'', "talk, dialect"; and , ''-logia'') is the scientific study of linguistic dialect, a sub-field of sociolinguistics. It studies variations in language based primarily on geographic distribution and their associated features. Dialectology treats such topics as divergence of two local dialects from a common ancestor and synchronic variation. Dialectologists are ultimately concerned with grammatical, lexical and phonological features that correspond to regional areas. Thus they usually deal not only with populations that have lived in certain areas for generations, but also with migrant groups that bring their languages to new areas (see language contact). Commonly studied concepts in dialectology include the problem of mutual intelligibility in defining languages and dialects; situations of diglossia, where two dialects are used for different functions; dialect continua including a number of partially mutually intelligible dialects; and pluric ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Speech Recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis. Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker-independent" systems. Systems that use training are called "speaker dependent". Speech recognition ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Corpus Linguistics
Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The text-corpus method uses the body of texts written in any natural language to derive the set of abstract rules which govern that language. Those results can be used to explore the relationships between that subject language and other languages which have undergone a similar analysis. The first such corpora were manually derived from source texts, but now that work is automated. Corpora have not only been used for linguistics research, they have also been used to compile dictionaries (starting with ''The American Heritage Dictionary of the English Language'' in 1969) and grammar guides, such as ''A Compreh ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Corpora
Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ''Corpus'' (album), by Sebastian Santa Maria * Corpus Delicti (band), also known simply as Corpus Medicine * Corpus callosum, a structure in the brain * Corpus cavernosum (other), a pair of structures in human genitals * Corpus luteum, a temporary endocrine structure in mammals * Corpus gastricum, the Latin term referring to the body of the stomach * Corpus alienum, a foreign object originating outside the body * Corpus albicans * Corpora amylacea * Corpora arenacea Other uses * ''Corpus'' (Bernini), a 1650 sculpture of Christ by Gian Lorenzo Bernini * Corpus (museum), a human body themed museum in the Netherlands * Corpus Clock, a large sculptural clock * Corpus (dance troupe), a Canadian dance troupe * Corpus (typography) ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Transcriber
Transcriber is an open-source software tool for the transcription and annotation of speech signals for linguistic research. It supports multiple hierarchical layers of segmentation, named entity annotation, speaker lists, topic lists, and overlapping speakers. Two views of the sound pressure waveform at different resolutions may be viewed simultaneously. Various character encodings, including Unicode, are supported. Annotations from Transcriber may be exported in XML. OASIS' ''Cover Pages'' publishes the open DTD used by Transcriber. Transcriber is written in Tcl/Tk with the Snack audio library and is therefore available on most major platforms. It is distributed under the GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the Four Freedoms (Free software), four freedoms to run, study, share, and modify the software. The license was th .... Transcriber has been super ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


TIMIT
TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time. TIMIT was designed to further acoustic-phonetic knowledge and automatic speech recognition systems. It was commissioned by DARPA and corpus design was a joint effort between the Massachusetts Institute of Technology, SRI International, and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and verified and prepared for publishing by the National Institute of Standards and Technology (NIST). There is also a telephone bandwidth version called NTIMIT (Network TIMIT). TIMIT and NTIMIT are not freely available — either membership of the Linguistic Data Consortium, or a monetary payment, is required for access to the dataset. History The TIMIT telephone corpus was an early attempt to create a database with speech samples. It was published in the year 1988 on CD-ROM and consists ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


The BABEL Speech Corpus
The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association. Development of the BABEL Project Following the creation of a speech corpus of European Union languages by the SAM project, funding was granted by the European Union for the creation along similar lines of a speech corpus of languages of Central and Eastern Europe, with the name of BABEL. The initial impetus came from the SAM (Speech Assessment Methods) project funded by the European Union as ESPRIT Project #1541 in 1987–89. This project was conducted by an international group of phoneticians, and was applied in the first instance to the European Communities languages Danish, Dutch, English, French, German, and Italian (by 1989). SAM produced many speech research tools (including t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Spoken English Corpus
The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984-7. The corpus manual can be found on ICAME. History The Spoken English Corpus (SEC) project was supported jointly in 1984-5 by the Humanities Research Fund at Lancaster University and by IBM (UK) Ltd, and subsequently by IBM UK Ltd. The project was supported by Geoffrey Leech at Lancaster and Geoffrey Kaye at IBM. The project was a collaboration, funded by IBM, between the Unit for Computer Research on the English Language (UCREL) at the University of Lancaster and the IBM Scientific Centre in Winchester. Compilation SEC comprises 53 recorded passages, mainly from the BBC, spoken in the accent usually referred to as Received Pronunciation, or RP. The collection covers categories such as commentary, news broadcast, lecture, dialogue, poetry and propaganda. The corpus contains 52,637 words, totalling 339 minutes. The compilation of the corpus is described by ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Praat
Praat (; , ''wikt:praat#Dutch, "talk"'') is a free software, free computer software package for speech analysis in phonetics. It was designed, and continues to be developed, by Paul Boersma and David Weenink of the University of Amsterdam. It can run on a wide range of operating systems, including various versions of Unix, Linux, Mac OS, Mac and Microsoft Windows (2000, XP, Vista, 7, 8, 10). The program supports speech synthesis, including articulatory synthesis. Its logo depicts a mouth over an ear. Version history References External links Praat: doing Phonetics by Computer
— Official site Free audio software Free linguistic software Linguistic research software Free software programmed in C Phonetics Phonology {{science-software-stub ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]