Neural machine translation (NMT) is an approach to

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

that uses an

artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. It is the dominant approach today and can produce translations that rival human translations when translating between high-resource languages under specific conditions. However, there still remain challenges, especially with languages where less high-quality data is available, and with domain shift between the data a system was trained on and the texts it is supposed to translate. NMT systems also tend to produce fairly literal translations.

Overview

In the translation task, a sentence

\mathbf = x_

(consisting of

I

tokens

x_i

) in the source language is to be translated into a sentence

\mathbf = x_

(consisting of

J

tokens

x_j

) in the target language. The source and target tokens (which in the simple event are used for each other in order for a particular game ] vectors, so they can be processed mathematically. NMT models assign a probability

P(y, x)

to potential translations y and then search a subset of potential translations for the one with the highest probability. Most NMT models are ''auto-regressive'': They model the probability of each target token as a function of the source sentence and the previously predicted target tokens. The probability of the whole translation then is the product of the probabilities of the individual predicted tokens:

P(y, x) = \prod_^ P(y_j ,  y_, \mathbf)

NMT models differ in how exactly they model this function

P

, but most use some variation of the ''encoder-decoder'' architecture: They first use an encoder network to process

\mathbf

and encode it into a vector or matrix representation of the source sentence. Then they use a decoder network that usually produces one target word at a time, taking into account the source representation and the tokens it previously produced. As soon as the decoder produces a special ''end of sentence'' token, the decoding process is finished. Since the decoder refers to its own previous outputs during, this way of decoding is called ''auto-regressive''.

History

Early approaches

In 1987, Robert B. Allen demonstrated the use of feedforward neural network, feed-forward neural networks for translating auto-generated English sentences with a limited vocabulary of 31 words into Spanish. In this experiment, the size of the network's input and output layers was chosen to be just large enough for the longest sentences in the source and target language, respectively, because the network did not have any mechanism to encode sequences of arbitrary length into a fixed-size representation. In his summary, Allen also already hinted at the possibility of using auto-associative models, one for encoding the source and one for decoding the target. Lonnie Chrisman built upon Allen's work in 1991 by training separate recursive auto-associative memory (RAAM) networks (developed by Jordan B. Pollack) for the source and the target language. Each of the RAAM networks is trained to encode an arbitrary-length sentence into a fixed-size hidden representation and to decode the original sentence again from that representation. Additionally, the two networks are also trained to share their hidden representation; this way, the source encoder can produce a representation that the target decoder can decode. Forcada and Ñeco simplified this procedure in 1997 to directly train a source encoder and a target decoder in what they called a ''recursive hetero-associative memory''. Also in 1997, Castaño and Casacuberta employed an Elman's recurrent neural network in another machine translation task with very limited vocabulary and complexity. Even though these early approaches were already similar to modern NMT, the computing resources of the time were not sufficient to process datasets large enough for the computational complexity of the machine translation problem on real-world texts. Instead, other methods like

statistical machine translation Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contra ...

rose to become the state of the art of the 1990s and 2000s.

Hybrid approaches

During the time when statistical machine translation was prevalent, some works used neural methods to replace various parts in the statistical machine translation while still using the log-linear approach to tie them together. For example, in various works together with other researchers, Holger Schwenk replaced the usual n-gram language model with a neural one and estimated phrase translation probabilities using a feed-forward network.

seq2seq

In 2013 and 2014, end-to-end neural machine translation had their breakthrough with Kalchbrenner & Blunsom using a

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

(CNN) for encoding the source and both Cho et al. and Sutskever et al. using a

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

(RNN) instead. All three used an RNN conditioned on a fixed encoding of the source as their decoder to produce the translation. However, these models performed poorly on longer sentences. This problem was addressed when Bahdanau et al. introduced

attention Attention or focus, is the concentration of awareness on some phenomenon to the exclusion of other stimuli. It is the selective concentration on discrete information, either subjectively or objectively. William James (1890) wrote that "Atte ...

to their encoder-decoder architecture: At each decoding step, the state of the decoder is used to calculate a source representation that focuses on different parts of the source and uses that representation in the calculation of the probabilities for the next token. Based on these RNN-based architectures,

Baidu Baidu, Inc. ( ; ) is a Chinese multinational technology company specializing in Internet services and artificial intelligence. It holds a dominant position in China's search engine market (via Baidu Search), and provides a wide variety of o ...

launched the "first large-scale NMT system" in 2015, followed by Google Neural Machine Translation in 2016. From that year on, neural models also became the prevailing choice in the main machine translation conference Workshop on Statistical Machine Translation. Gehring et al. combined a CNN encoder with an attention mechanism in 2017, which handled long-range dependencies in the source better than previous approaches and also increased translation speed because a CNN encoder is parallelizable, whereas an

RNN encoder RNN or rnn may refer to: * Random neural network, a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals * Recurrent neural network Recurrent neural networks (RNNs) are a class of artifici ...

has to encode one token at a time due to its recurrent nature. In the same year, “Microsoft Translator released AI-powered online neural machine translation (NMT).

DeepL Translator DeepL Translator is a neural machine translation service that was launched in August 2017 and is owned by Cologne-based DeepL SE. The translating system was first developed within Linguee and launched as entity ''DeepL''. It initially offered t ...

, which was at the time based on a

CNN encoder Cable News Network (CNN) is a multinational news organization operating, most notably, a website and a TV channel headquartered in Atlanta. Founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable news ...

, was also released in the same year and was judged by several news outlets to outperform its competitors. It has also been seen that

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...

released in 2020 can function as a neural machine translation system. Some other machine translation systems, such as Microsoft translator and SYSTRAN can be also seen to have integrated neural networks into their operations.

Transformer

Another network architecture that lends itself to parallelization is the

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

, which was introduced by Vaswani et al. also in 2017. Like previous models, the transformer still uses the attention mechanism for weighting encoder output for the decoding steps. However, the transformer's encoder and decoder networks themselves are also based on attention instead of recurrence or convolution: Each layer weighs and transforms the previous layer's output in a process called ''self-attention''. Since the attention mechanism does not have any notion of token order, but the order of words in a sentence is obviously relevant, the token embeddings are combined with an explicit encoding of their position in the sentence. Since both the transformer's encoder and decoder are free from recurrent elements, they can both be parallelized during training. However, the original transformer's decoder is still auto-regressive, which means that decoding still has to be done one token at a time during inference. The transformer model quickly became the dominant choice for machine translation systems and was still by far the most-used architecture in the Workshop on Statistical Machine Translation in 2022 and 2023. Usually, NMT models’ weights are initialized randomly and then learned by training on parallel datasets. However, since using

large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...

(LLMs) such as BERT pre-trained on large amounts of monolingual data as a starting point for learning other tasks has proven very successful in wider NLP, this paradigm is also becoming more prevalent in NMT. This is especially useful for low-resource languages, where large parallel datasets do not exist. An example of this is the mBART model, which first trains one transformer on a multilingual dataset to recover masked tokens in sentences, and then fine-tunes the resulting

autoencoder An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function ...

on the translation task.

Generative LLMs

Instead of fine-tuning a pre-trained language model on the translation task, sufficiently large generative models can also be directly prompted to translate a sentence into the desired language. This approach was first comprehensively tested and evaluated for GPT 3.5 in 2023 by Hendy et al. They found that "GPT systems can produce highly fluent and competitive translation outputs even in the zero-shot setting especially for the high-resource language translations". The WMT23 evaluated the same approach (but using

GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...

) and found that it was on par with the state of the art when translating into English, but not quite when translating into lower-resource languages. This is plausible considering that GPT models are trained mainly on English text.

Comparison with statistical machine translation

NMT has overcome several challenges that were present in statistical machine translation (SMT): * NMT's full reliance on continuous representation of tokens overcame sparsity issues caused by rare words or phrases. Models were able to generalize more effectively. * The limited n-gram length used in SMT's n-gram language models caused a loss of context. NMT systems overcome this by not having a hard cut-off after a fixed number of tokens and by using attention to choosing which tokens to focus on when generating the next token. * End-to-end training of a single model improved translation performance and also simplified the whole process. * The huge n-gram models (up to 7-gram) used in SMT required large amounts of memory, whereas NMT requires less.

Training procedure

Cross-entropy loss

NMT models are usually trained to maximize the likelihood of observing the training data. I.e., for a dataset of

T

source sentences

X = \mathbf^, ..., \mathbf^

and corresponding target sentences

Y = \mathbf^, ..., \mathbf^

, the goal is finding the model parameters

\theta^*

that maximize the sum of the likelihood of each target sentence in the training data given the corresponding source sentence:

\theta^* = \underset \sum_i^T P_(\mathbf^, \mathbf^)

Expanding to token level yields:

\theta^* = \underset \sum_i^T \prod_^ P(y_j^ ,  y_^, \mathbf^)

Since we are only interested in the maximum, we can just as well search for the maximum of the logarithm instead (which has the advantage that it avoids floating point underflow that could happen with the product of low probabilities).

\theta^* = \underset \sum_i^T \log\prod_^ P(y_j^ ,  y_^, \mathbf^)

Using the fact that the logarithm of a product is the sum of the factors’ logarithms and flipping the sign yields the classic cross-entropy loss:

\theta^* = \underset - \sum_i^T \log\sum_^ P(y_j^ ,  y_^, \mathbf^)

In practice, this minimization is done iteratively on small subsets (mini-batches) of the training set using

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

Teacher forcing

During inference, auto-regressive decoders use the token generated in the previous step as the input token. However, the vocabulary of target tokens is usually very large. So, at the beginning of the training phase, untrained models will pick the wrong token almost always; and subsequent steps would then have to work with wrong input tokens, which would slow down training considerably. Instead, ''teacher forcing'' is used during the training phase: The model (the “student” in the teacher forcing metaphor) is always fed the previous ground-truth tokens as input for the next token, regardless of what it predicted in the previous step.

Translation by prompt engineering LLMs

As outlined in the history section above, instead of using an NMT system that is trained on parallel text, one can also prompt a generative LLM to translate a text. These models differ from an encoder-decoder NMT system in a number of ways: * Generative language models are not trained on the translation task, let alone on a parallel dataset. Instead, they are trained on a language modeling objective, such as predicting the next word in a sequence drawn from a large dataset of text. This dataset can contain documents in many languages, but is in practice dominated by English text. After this pre-training, they are fine-tuned on another task, usually to follow instructions. * Since they are not trained on translation, they also do not feature an encoder-decoder architecture. Instead, they just consist of a transformer's decoder. * In order to be competitive on the machine translation task, LLMs need to be much larger than other NMT systems. E.g., GPT-3 has 175 billion parameters, while mBART has 680 million and the original transformer-big has “only” 213 million. This means that they are computationally more expensive to train and use. A generative LLM can be prompted in a zero-shot fashion by just asking it to translate a text into another language without giving any further examples in the prompt. Or one can include one or several example translations in the prompt before asking to translate the text in question. This is then called one-shot or few-shot learning, respectively. For example, the following prompts were used by Hendy et al. (2023) for zero-shot and one-shot translation:

### Translate this sentence from ource language 

The Ource () is a  long river in northeastern France, a right tributary of the river Seine. Its source is in the Haute-Marne  department, 2 km south of  Poinson-lès-Grancey. It flows generally northwest. It joins the Seine at Bar-sur-Seine.
 ...
to arget language 

Arget (; ) is a Communes of France, commune in Pyrénées-Atlantiques, a Departments of France, department in the Nouvelle-Aquitaine region of south-western France. It is part of the traditional province of Béarn.
 Geography
Arget is located some ...
 Source:
ource sentence 

The Ource () is a  long river in northeastern France, a right tributary of the river Seine. Its source is in the Haute-Marne  department, 2 km south of  Poinson-lès-Grancey. It flows generally northwest. It joins the Seine at Bar-sur-Seine.
 ...
### Target:

Translate this into 1. arget language 

Arget (; ) is a Communes of France, commune in Pyrénées-Atlantiques, a Departments of France, department in the Nouvelle-Aquitaine region of south-western France. It is part of the traditional province of Béarn.
 Geography
Arget is located some ...

 hot 1 source1. hot 1 reference 

Hot commonly refers refer to:

*Heat, a hot temperature
*Pungency, in food, a spicy or hot quality

Hot or HOT may also refer to:
  Places
* Hot district, a district of Chiang Mai province, Thailand
** Hot subdistrict, a sub-district of Hot Distri ...
Translate this into 1. arget language 

Arget (; ) is a Communes of France, commune in Pyrénées-Atlantiques, a Departments of France, department in the Nouvelle-Aquitaine region of south-western France. It is part of the traditional province of Béarn.
 Geography
Arget is located some ...

 nput1.

Literature

* Koehn, Philipp (2020)
Neural Machine Translation.
Cambridge University Press. * Stahlberg, Felix (2020)
Neural Machine Translation: A Review and Survey.

References

{{Artificial intelligence navbox Applications of artificial intelligence Computational linguistics Machine translation Tasks of natural language processing