Speech translation is the process by which

conversation Conversation is interactive communication between two or more people. The development of conversational skills and etiquette is an important part of socialization. The development of conversational skills in a new language is a frequent focus ...

al spoken phrases are instantly

translated Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...

and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

How it works

A speech translation system would typically integrate the following three software technologies:

automatic speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

(ASR),

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

(MT) and

voice synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...

(TTS). The speaker of language A speaks into a microphone and the

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

module recognizes the utterance. It compares the input with a phonological model, consisting of a large

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

of speech data from multiple speakers. The input is then converted into a string of words, using dictionary and grammar of language A, based on a massive corpus of text in language A. The

module then translates this string. Early systems replaced every word with a corresponding word in language B. Current systems do not use word-for-word translation, but rather take into account the entire context of the input to generate the appropriate translation. The generated translation

utterance In spoken language analysis, an utterance is a continuous piece of speech, often beginning and ending with a clear pause. In the case of oral languages, it is generally, but not always, bounded by silence. Utterances do not exist in written langu ...

is sent to the

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...

module, which estimates the pronunciation and intonation matching the string of words based on a

of speech data in language B. Waveforms matching the text are selected from this database and the speech synthesis connects and outputs them."Overcoming the Language Barrier with Speech Translation Technology"
by Satoshi, Nakamura in ''Science & Technology Trends - Quarterly Review No.31'' April 2009

History

In 1983,

NEC Corporation is a Japanese multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It provides IT and network soluti ...

demonstrated speech translation as a concept exhibit at the

ITU Telecom World ITU Telecom is part of the ITU (International Telecommunication Union), the United Nations specialized agency for information and communication technologies – ICTs. ITU Telecom organizes the global ITU Telecom World event, the platform for inn ...

(Telecom '83). In 1999, the C-Star-2 consortium demonstrated speech-to-speech translation of 5 languages including English, Japanese, Italian, Korean, and German."A Japanese-to-English Speech Translation System: ATR-MATRIX"
by Takezawa, Morimoto, Sagisaka, Campbell, Iida, Sugaya, Yokoo, Yamamoto in ''Proceedings of the International Conference on Spoken Language Processing'' 1998

Features

Apart from the problems involved in the text translation, it also has to deal with special problems occur in speech-to-speech translation, incorporating incoherence of spoken language, fewer grammar constraints of spoken language, unclear word boundary of spoken language, the correction of speech recognition errors and multiple optional inputs. Additionally, speech-to-speech translation also has its advantages compared with text translation, including less complex structure of spoken language and less vocabulary in spoken language.

Research and development

Research and development has gradually progressed from relatively simple to more advanced translation. International evaluation workshops were established to support the development of speech-translation technology. They allow research institutes to cooperate and compete against each other at the same time. The concept of those workshop is a kind of contest: a common dataset is provided by the organizers and the participating research institutes create systems that are evaluated. In this way, efficient research is being promoted. The International Workshop on Spoken Language Translation
IWSLT
, organized by C-STAR, an international

consortium A consortium (plural: consortia) is an association of two or more individuals, companies, organizations or governments (or any combination of these entities) with the objective of participating in a common activity or pooling their resources for ...

for research on speech translation, has been held since 2004. "Every year, the number of participating institutes increases, and it has become a key event for speech translation research."

Standards

When many countries begin to research and develop speech translation, it will be necessary to standardize interfaces and data formats to ensure that the systems are mutually compatible. International joint research is being fostered by speech translation consortiums (e.g. the C-STAR international consortium for joint research of speech translation and A-STAR for the Asia-Pacific region). They were founded as "international joint-research organization to design formats of bilingual corpora that are essential to advance the research and development of this technology ... and to standardize interfaces and data formats to connect speech translation module internationally".

Applications

Today, speech translation systems are being used throughout the world. Examples include medical facilities, schools, police, hotels, retail stores, and factories. These systems are applicable anywhere that spoken language is being used to communicate. A popular application is Jibbigo that works offline.

Challenges and future prospects

Currently, speech translation technology is available as product that instantly translates free form multi-lingual conversations. These systems instantly translate continuous speech. Challenges in accomplishing this include overcoming speaker-dependent variations in style of speaking or

pronunciation Pronunciation is the way in which a word or a language is spoken. This may refer to generally agreed-upon sequences of sounds used in speaking a given word or language in a specific dialect ("correct pronunciation") or simply the way a particular ...

are issues that have to be dealt with in order to provide high-quality translation for all users. Moreover,

systems must be able to remedy external factors such as acoustic noise or speech by other speakers in real-world use of speech translation systems. For the reason that the user does not understand the target language when speech translation is used, a method "must be provided for the user to check whether the translation is correct, by such means as translating it again back into the user's language". In order to achieve the goal of erasing the language barrier worldwide, multiple languages have to be supported. This requires speech corpora, bilingual corpora and text corpora for each of the estimated 6,000 languages said to exist on our planet today. As the collection of

corpora Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

is extremely expensive, collecting data from the Web would be an alternative to conventional methods. "Secondary use of news or other media published in multiple languages would be an effective way to improve performance of speech translation." However, "current

copyright A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, education ...

law does not take secondary uses such as these types of corpora into account" and thus "it will be necessary to revise it so that it is more flexible."

References

{{Reflist, 30em Translation