The BABEL speech corpus is a corpus of recorded speech materials from five

Central and Eastern Europe Central and Eastern Europe is a term encompassing the countries in the Baltics, Central Europe, Eastern Europe and Southeast Europe (mostly the Balkans), usually meaning former communist states from the Eastern Bloc and Warsaw Pact in Europe. ...

an languages. Intended for use in speech technology applications, it was funded by a grant from the

European Union The European Union (EU) is a supranational political and economic union of member states that are located primarily in Europe. The union has a total area of and an estimated total population of about 447million. The EU has often been des ...

and completed in 1998. It is distributed by the

European Language Resources Association The European Language Resources Association (ELRA) is a not-for-profit organisation established under the law of the Grand Duchy of Luxembourg. Its seat is in Luxembourg and its headquarters is in Paris, France. Activities Since its founding in ...

Development of the BABEL Project

Following the creation of a speech corpus of European Union languages by the SAM project, funding was granted by the

for the creation along similar lines of a speech corpus of languages of

, with the name of BABEL. The initial impetus came from the SAM (Speech Assessment Methods) project funded by the European Union as ESPRIT Project #1541 in 1987–89. This project was conducted by an international group of phoneticians, and was applied in the first instance to the European Communities languages Danish, Dutch, English, French, German, and Italian (by 1989). SAM produced many speech research tools (including the

SAMPA __NOTOC__ The Speech Assessment Methods Phonetic Alphabet (SAMPA) is a computer-readable phonetic script using 7-bit printable ASCII characters, based on the International Phonetic Alphabet (IPA). It was originally developed in the late 1980s for ...

computer-based phonetic transcription which was also used for the BABEL project) and a corpus of recorded speech material distributed on CD-ROM. A proposal was made to the European Union under the Copernicus initiative in 1994, with the objective of creating a corpus of spoken

Bulgarian Bulgarian may refer to: * Something of, from, or related to the country of Bulgaria * Bulgarians, a South Slavic ethnic group * Bulgarian language, a Slavic language * Bulgarian alphabet * A citizen of Bulgaria, see Demographics of Bulgaria * Bul ...

Estonian Estonian may refer to: * Something of, from, or related to Estonia, a country in the Baltic region in northern Europe * Estonians, people from Estonia, or of Estonian descent * Estonian language * Estonian cuisine * Estonian culture See also

...

, Hungarian,

Polish Polish may refer to: * Anything from or related to Poland, a country in Europe * Polish language * Poles, people from Poland or of Polish descent * Polish chicken *Polish brothers (Mark Polish and Michael Polish, born 1970), American twin screenwr ...

and

Romanian Romanian may refer to: *anything of, from, or related to the country and nation of Romania **Romanians, an ethnic group **Romanian language, a Romance language ***Romanian dialects, variants of the Romanian language **Romanian cuisine, traditional ...

, and Grant #1304 was awarded for this. A pilot project to create a small corpus of spoken Bulgarian was carried out jointly by the Universities of Sofia (Bulgaria) and Reading (U.K.). The initial meeting of the whole project team took place at the University of Reading in 1995.

Recorded material

Since the objective was to produce material suitable for use in speech technology applications, the digital recordings were made in strictly controlled conditions in recording studios. For each language the material had the following composition: * Many-talker set: 30 males and 30 females each read 100 numbers, 3 connected-speech passages and 5 "filler" sentences (to provide further instances of some items) or 4 passages if no fillers were needed. * Few-talker set: 5 males and 5 females, normally selected from the above group, each read 5 blocks of 100 numbers, 15 passages and 25 filler sentences, plus 5 lists of syllables. * Very-few-talker set: 1 male and 1 female selected from the above read 5 blocks of syllables, with and without carrier sentences.

Membership of the BABEL Project

Project Director: Peter Roach (University of Reading)

Project leaders in Central and Eastern Europe

Bulgaria: initially, A. Misheva until her death in 1995, then S. Dimitrova (University of Sofia). Estonia: E. Meister (University of Tallinn) Hungary: K. Vicsi (Technical University of Budapest) Poland: R. Gubrynowicz (Polish Academy of Sciences) and W. Gonet (University of Lublin) Romania: M. Boldea (University of Timișoara)

Project members in Western Europe

France: L. Lamel (LIMSI, Paris); A. Marchal (CNRS) Germany : W. Barry (University of Saarbruecken) ; K. Marasek (University of Stuttgart) United Kingdom: J. Wells (University College London); P. Roach (University of Reading)

Project outcomes

An intermediate project assessment meeting was held in Lublin, Poland, in 1996. Work then continued until a final assessment and presentation of outcomes in Granada, Spain, at the First International Conference on Language Resources and Evaluation, in 1998. The project was completed in December 1998. The resulting set of corpora was then supplied to the

. ELRA is exclusively responsible for distributing the material to users via their website. At the time of its completion, BABEL was the largest high-quality speech database available for research purposes in languages such as HungarianFegyó, Tibor; Péter Mihajlik; Péter Tatai; Géza Gordos (2001). "Pronunciation modeling in Hungarian number recognition." In INTERSPEECH, pp. 1465-1468. and Estonian. It has been used for research into topics such as pronunciation modeling and automatic speech recognition. The project was also part of what has been called the most significant recent development in corpus linguistics – the increasing range of languages covered by corpus data, which promises to bring to a wider range of languages the benefits that corpus linguistics has brought to the study of Western European languages.

References

{{reflist, 30em Phonetics works Linguistic research 1998 establishments in Europe