Statistical machine translation (SMT) is a

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

approach where translations are generated on the basis of

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...

s whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation, that superseded the previous rule-based approach that required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. The first ideas of

statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

machine translation were introduced by Warren Weaver in 1949, including the ideas of applying

Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

. Statistical machine translation was re-introduced in the late 1980s and early 1990s by researchers at

IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...

's Thomas J. Watson Research Center. Before the introduction of neural machine translation, it was by far the most widely studied machine translation method.

Basis

The idea behind statistical machine translation comes from

. A document is translated according to the

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

p(e, f)

that a string

e

in the target language (for example, English) is the translation of a string

f

in the source language (for example, French). The problem of modeling the probability distribution

p(e, f)

has been approached in a number of ways. One approach which lends itself well to computer implementation is to apply

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

, that is

p(e, f) \propto p(f, e) p(e)

, where the translation model

p(f, e)

is the probability that the source string is the translation of the target string, and the language model

p(e)

is the probability of seeing that target language string. This decomposition is attractive as it splits the problem into two subproblems. Finding the best translation

\tilde

is done by picking up the one that gives the highest probability: :

\tilde = arg \max_ p(e, f) = arg \max_ p(f, e) p(e)

. For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings

e^*

in the native language. Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

. As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence. Language models are typically approximated by smoothed ''n''-gram models, and similar approaches have been applied to translation models, but this introduces additional complexity due to different sentence lengths and word orders in the languages. Statistical translation models were initially

word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...

based (Models 1-5 from

Hidden Markov model from Stephan Vogel and Model 6 from Franz-Joseph Och), but significant advances were made with the introduction of

phrase In grammar, a phrasecalled expression in some contextsis a group of words or singular word acting as a grammatical unit. For instance, the English language, English expression "the very happy squirrel" is a noun phrase which contains the adject ...

based models. Later work incorporated

syntax In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...

or quasi-syntactic structures.D. Chiang (2005)
A Hierarchical Phrase-Based Model for Statistical Machine Translation
In ''Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)''.

Benefits

The most frequently cited benefits of statistical machine translation (SMT) over rule-based approach are: * More efficient use of human and data resources **There are many parallel corpora in machine-readable format and even more monolingual data. **Generally, SMT systems are not tailored to any specific pair of languages. * More fluent translations owing to use of a language model

Shortcomings

* Corpus creation can be costly. * Specific errors are hard to predict and fix. * Results may have superficial fluency that masks translation problems. * Statistical machine translation usually works less well for language pairs with significantly different word order. * The benefits obtained for translation between Western European languages are not representative of results for other language pairs, owing to smaller training corpora and greater grammatical differences.

Word-based translation

In word-based translation, the fundamental unit of translation is a word in some natural language. Typically, the number of words in translated sentences are different, because of compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Necessarily it is assumed by information theory that each covers the same concept. In practice this is not really true. For example, the English word ''corner'' can be translated in Spanish by either ''rincón'' or ''esquina'', depending on whether it is to mean its internal or external angle. Simple word-based translation cannot translate between languages with different fertility. Word-based translation systems can relatively simply be made to cope with high fertility, such that they could map a single word to multiple words, but not the other way about. For example, if we were translating from English to French, each word in English could produce any number of French words— sometimes none at all. But there is no way to group two English words producing a single French word. An example of a word-based translation system is the freely available GIZA++ package ( GPLed), which includes the training program for

models and HMM model and Model 6. The word-based translation is not widely used today; phrase-based systems are more common. Most phrase-based systems are still using GIZA++ to align the corpus. The alignments are used to extract phrases or deduce syntax rules. And matching words in bi-text is still a problem actively discussed in the community. Because of the predominance of GIZA++, there are now several distributed implementations of it online.

Phrase-based translation

In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called blocks or phrases. These are typically not linguistic

s, but

phraseme A phraseme, also called a set phrase, fixed expression, multiword expression (in computational linguistics), or idiom, is a multi-word or multi-morphemic utterance whose components include at least one that is selectionally constrained or restri ...

s that were found using statistical methods from corpora. It has been shown that restricting the phrases to linguistic phrases (syntactically motivated groups of words, see

syntactic categories A syntactic category is a syntactic unit that theories of syntax assume. Word classes, largely corresponding to traditional parts of speech (e.g. noun, verb, preposition, etc.), are syntactic categories. In phrase structure grammars, the ''phrasa ...

) decreased the quality of translation. The chosen phrases are further mapped one-to-one based on a phrase translation table, and may be reordered. This table could be learnt based on word-alignment, or directly from a parallel corpus. The second model is trained using the expectation maximization algorithm, similarly to the word-based IBM model.

Syntax-based translation

Syntax-based translation is based on the idea of translating

syntactic In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency ...

units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial)

parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...

s of sentences/utterances. Until the 1990s, with advent of strong stochastic parsers, the statistical counterpart of the old idea of syntax-based translation did not take off. Examples of this approach include DOP-based MT and later synchronous context-free grammars.

Hierarchical phrase-based translation

Hierarchical phrase-based translation combines the phrase-based and syntax-based approaches to translation. It uses synchronous context-free grammar rules, but the grammars can be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents. This idea was first introduced in Chiang's Hiero system (2005).

Language models

A language model is an essential component of any statistical machine translation system, which aids in making the translation as fluent as possible. It is a function that takes a translated sentence and returns the probability of it being said by a native speaker. A good language model will for example assign a higher probability to the sentence "the house is small" than to "small the is house". Other than

word order In linguistics, word order (also known as linear order) is the order of the syntactic constituents of a language. Word order typology studies it from a cross-linguistic perspective, and examines how languages employ different orders. Correlatio ...

, language models may also help with word choice: if a foreign word has multiple possible translations, these functions may give better probabilities for certain translations in specific contexts in the target language.

Systems implementing statistical machine translation

Google Translate Google Translate is a multilingualism, multilingual neural machine translation, neural machine translation service developed by Google to translation, translate text, documents and websites from one language into another. It offers a web applic ...

(started transition to neural machine translation in 2016) * Microsoft Translator (started transition to neural machine translation in 2016) * Yandex.Translate (switched to hybrid approach incorporating neural machine translation in 2017)

Challenges with statistical machine translation

Problems with statistical machine translation include:

Sentence alignment

Single sentences in one language can be found translated into several sentences in the other and vice versa. Long sentences may be broken up, while short sentences may be merged. There are even languages that use writing systems without clear indication of a sentence end, such as Thai. Sentence aligning can be performed through the Gale-Church alignment algorithm. Efficient search and retrieval of the highest scoring sentence alignment is possible through this and other mathematical models.

Word alignment

Sentence alignment is usually either provided by the corpus or obtained by the aforementioned Gale-Church alignment algorithm. To learn e.g. the translation model, however, we need to know which words align in a source-target sentence pair. The IBM-Models or the HMM-approach were attempts at solving this challenge. Function words that have no clear equivalent in the target language are another issue for the statistical models. For example, when translating from English to German, in the sentence "John does not live here", the word "does" has no clear alignment in the translated sentence "John wohnt hier nicht". Through logical reasoning, it may be aligned with the words "wohnt" (as it contains grammatical information for the English word "live") or "nicht" (as it only appears in the sentence because it is negated) or it may be unaligned.

Statistical anomalies

An example of such an anomaly is the phrase "I took the train to Berlin" being mistranslated as "I took the train to Paris" due to the statistical abundance of "train to Paris" in the training set.

Idiom and register

Depending on the corpora used, the use of

idiom An idiom is a phrase or expression that largely or exclusively carries a Literal and figurative language, figurative or non-literal meaning (linguistic), meaning, rather than making any literal sense. Categorized as formulaic speech, formulaic ...

and linguistic register might not receive a translation that accurately represents the original intent. For example, the popular Canadian

Hansard ''Hansard'' is the transcripts of parliamentary debates in Britain and many Commonwealth of Nations, Commonwealth countries. It is named after Thomas Curson Hansard (1776–1833), a London printer and publisher, who was the first official printe ...

bilingual corpus primarily consists of parliamentary speech examples, where "Hear, Hear!" is frequently associated with "Bravo!" Using a model built on this corpus to translate ordinary speech in a conversational register would lead to incorrect translation of the word ''hear'' as ''Bravo!''W. J. Hutchins and H. Somers. (1992). ''An Introduction to Machine Translation'', 18.3:322. This problem is connected with word alignment, as in very specific contexts the idiomatic expression aligned with words that resulted in an idiomatic expression of the same meaning in the target language. However, it is unlikely, as the alignment usually does not work in any other contexts. For that reason, idioms could only be subjected to phrasal alignment, as they could not be decomposed further without losing their meaning. This problem was specific for word-based translation.

Different word orders

Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located, or where the same words are used as a question or a statement. In

, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the machine translator can only manage small sequences of words, and word order has to be thought of by the program designer. Attempts at solutions have included re-ordering models, where a distribution of location changes for each item of translation is guessed from aligned bi-text. Different location changes can be ranked with the help of the language model and the best can be selected.

Out of vocabulary (OOV) words

SMT systems typically store different word forms as separate symbols without any relation to each other, and word forms or phrases that were not in the training data cannot be translated. This might be because of the lack of training data, changes in the human domain where the system is used, or differences in morphology.

Notes and references

External links

Annotated list of statistical natural language processing resources
— Includes links to freely available statistical machine translation software {{DEFAULTSORT:Statistical Machine Translation Machine translation Statistical natural language processing