Google Neural Machine Translation (GNMT) was a
neural machine translation
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
It is the dominant a ...
(NMT) system developed by Google and introduced in November 2016 that used an
artificial neural network
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
to increase fluency and accuracy in
Google Translate
Google Translate is a multilingualism, multilingual neural machine translation, neural machine translation service developed by Google to translation, translate text, documents and websites from one language into another. It offers a web applic ...
.
The neural network consisted of two main blocks, an encoder and a decoder, both of
LSTM
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hi ...
architecture with 8 1024-wide layers each and a simple 1-layer 1024-wide
feedforward attention mechanism
In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented b"soft"weights assigned to eac ...
connecting them.
The total number of parameters has been variously described as over 160 million,
approximately 210 million, 278 million or 380 million. It used WordPiece
tokenizer
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
, and
beam search
In computer science, beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is a modification of best-first search that reduces its memory requirements. Best-first searc ...
decoding strategy. It ran on
Tensor Processing Units.
By 2020, the system had been replaced by another deep learning system based on a Transformer encoder and an RNN decoder.
GNMT improved on the quality of translation by applying an
example-based (EBMT)
machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
method in which the system learns from millions of examples of language translation.
GNMT's proposed architecture of system learning was first tested on over a hundred languages supported by Google Translate.
With the large end-to-end framework, the system learns over time to create better, more natural translations.
GNMT attempts to translate whole sentences at a time, rather than just piece by piece.
The GNMT network can undertake
interlingual machine translation by encoding the semantics of the sentence, rather than by memorizing phrase-to-phrase translations.
History
The
Google Brain
Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...
project was established in 2011 in the "secretive Google X research lab"
by Google Fellow
Jeff Dean, Google Researcher
Greg Corrado, and
Stanford University
Leland Stanford Junior University, commonly referred to as Stanford University, is a Private university, private research university in Stanford, California, United States. It was founded in 1885 by railroad magnate Leland Stanford (the eighth ...
Computer Science professor
Andrew Ng
Andrew Yan-Tak Ng (; born April 18, 1976) is a British-American computer scientist and Internet Entrepreneur, technology entrepreneur focusing on machine learning and artificial intelligence (AI). Ng was a cofounder and head of Google Brain and ...
.
Ng's work has led to some of the biggest breakthroughs at Google and Stanford.
In November 2016, Google Neural Machine Translation system (GNMT) was introduced. Since then, Google Translate began using neural machine translation (NMT) in preference to its previous
statistical methods
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
(SMT)
which had been used since October 2007, with its proprietary, in-house SMT technology.
Training GNMT was a big effort at the time and took, by a 2018 OpenAI estimate, on the order of 79 petaFLOP-days (or 7e21 FLOPs) of compute which was 1.5 orders of magnitude larger than
Seq2seq
Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, speech recognition, and text summarization.
Seq2seq uses sequence transfor ...
model of 2014 (but about 2x smaller than
GPT-J-6B in 2021).
Google Translate's NMT system uses a large artificial neural network capable of
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
.
By using millions of examples, GNMT improves the quality of translation,
using broader context to deduce the most relevant translation. The result is then rearranged and adapted to approach grammatically based human language.
GNMT's proposed architecture of system learning was first tested on over a hundred languages supported by Google Translate.
GNMT did not create its own universal interlingua but rather aimed at finding the commonality between many languages using insights from psychology and linguistics.
The new translation engine was first enabled for eight languages: to and from English and French, German, Spanish, Portuguese, Chinese, Japanese, Korean and Turkish in November 2016.
In March 2017, three additional languages were enabled: Russian, Hindi and Vietnamese along with Thai for which support was added later. Support for Hebrew and Arabic was also added with help from the Google Translate Community in the same month. In mid April 2017 Google Netherlands announced support for Dutch and other European languages related to English. Further support was added for nine Indian languages: Hindi, Bengali, Marathi, Gujarati, Punjabi, Tamil, Telugu, Malayalam and Kannada at the end of April 2017.
By 2020, Google had changed methodology to use a different neural network system based on
transformers
''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...
, and had phased out NMT.
Evaluation
The GNMT system was said to represent an improvement over the former Google Translate in that it will be able to handle "zero-shot translation", that is it directly translates one language into another. For example, it might be trained just for Japanese-English and Korean-English translation, but can perform Japanese-Korean translation. The system appears to have learned to produce a language-independent intermediate representation of language (an "
interlingua
Interlingua (, ) is an international auxiliary language (IAL) developed between 1937 and 1951 by the American International Auxiliary Language Association (IALA). It is a constructed language of the "naturalistic" variety, whose vocabulary, ...
"), which allows it to perform zero-shot translation by converting from and to the interlingua.
Google Translate previously first translated the
source language into English and then translated the English into the
target language rather than translating directly from one language to another.
A July 2019 study in ''
Annals of Internal Medicine
''Annals of Internal Medicine'' is an academic medical journal published by the American College of Physicians (ACP). It is one of the most widely cited and influential specialty medical journals in the world. ''Annals'' publishes content releva ...
'' found that "Google Translate is a viable, accurate tool for translating non–English-language trials". Only one disagreement between reviewers reading machine-translated trials was due to a translation error. Since many medical studies are excluded from systematic reviews because the reviewers do not understand the language, GNMT has the potential to reduce bias and improve accuracy in such reviews.
Languages supported by GNMT
As of December 2021, all of the languages of
Google Translate
Google Translate is a multilingualism, multilingual neural machine translation, neural machine translation service developed by Google to translation, translate text, documents and websites from one language into another. It offers a web applic ...
support GNMT, with Latin being the most recent addition.
#
Afrikaans
Afrikaans is a West Germanic languages, West Germanic language spoken in South Africa, Namibia and to a lesser extent Botswana, Zambia, Zimbabwe and also Argentina where there is a group in Sarmiento, Chubut, Sarmiento that speaks the Pat ...
#
Albanian
Albanian may refer to:
*Pertaining to Albania in Southeast Europe; in particular:
**Albanians, an ethnic group native to the Balkans
**Albanian language
**Albanian culture
**Demographics of Albania, includes other ethnic groups within the country ...
#
Amharic
Amharic is an Ethio-Semitic language, which is a subgrouping within the Semitic branch of the Afroasiatic languages. It is spoken as a first language by the Amhara people, and also serves as a lingua franca for all other metropolitan populati ...
#
Arabic
Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
#
Armenian
Armenian may refer to:
* Something of, from, or related to Armenia, a country in the South Caucasus region of Eurasia
* Armenians, the national people of Armenia, or people of Armenian descent
** Armenian diaspora, Armenian communities around the ...
#
Azerbaijani
#
Basque
Basque may refer to:
* Basques, an ethnic group of Spain and France
* Basque language, their language
Places
* Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France
* Basque Country (autonomous co ...
#
Belarusian
#
Bengali
#
Bosnian
#
Bulgarian
#
Burmese
#
Catalan
#
Cebuano
#
Chewa
#
Chinese (
Simplified)
#
Chinese (
Traditional
A tradition is a system of beliefs or behaviors (folk custom) passed down within a group of people or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examp ...
)
#
Corsican
#
Croatian
#
Czech
Czech may refer to:
* Anything from or related to the Czech Republic, a country in Europe
** Czech language
** Czechs, the people of the area
** Czech culture
** Czech cuisine
* One of three mythical brothers, Lech, Czech, and Rus
*Czech (surnam ...
#
Danish
#
Dutch
#
English
#
Esperanto
Esperanto (, ) is the world's most widely spoken Constructed language, constructed international auxiliary language. Created by L. L. Zamenhof in 1887 to be 'the International Language' (), it is intended to be a universal second language for ...
#
Estonian
Estonian may refer to:
* Something of, from, or related to Estonia, a country in the Baltic region in northern Europe
* Estonians, people from Estonia, or of Estonian descent
* Estonian language
* Estonian cuisine
* Estonian culture
See also ...
#
Filipino (
Tagalog)
#
Finnish
#
French
#
Galician
#
Georgian
#
German
German(s) may refer to:
* Germany, the country of the Germans and German things
**Germania (Roman era)
* Germans, citizens of Germany, people of German ancestry, or native speakers of the German language
** For citizenship in Germany, see also Ge ...
#
Greek
Greek may refer to:
Anything of, from, or related to Greece, a country in Southern Europe:
*Greeks, an ethnic group
*Greek language, a branch of the Indo-European language family
**Proto-Greek language, the assumed last common ancestor of all kno ...
#
Gujarati
#
Haitian Creole
Haitian Creole (; , ; , ), or simply Creole (), is a French-based creole languages, French-based creole language spoken by 10 to 12million people worldwide, and is one of the two official languages of Haiti (the other being French), where it ...
#
Hausa
#
Hawaiian
#
Hebrew
Hebrew (; ''ʿÎbrit'') is a Northwest Semitic languages, Northwest Semitic language within the Afroasiatic languages, Afroasiatic language family. A regional dialect of the Canaanite languages, it was natively spoken by the Israelites and ...
#
Hindi
Modern Standard Hindi (, ), commonly referred to as Hindi, is the Standard language, standardised variety of the Hindustani language written in the Devanagari script. It is an official language of India, official language of the Government ...
#
Hmong
Hmong may refer to:
* Hmong people, an ethnic group living mainly in Southwest China, Vietnam, Laos, and Thailand
* Hmong cuisine
* Hmong customs and culture
** Hmong music
** Hmong textile art
* Hmong language, a continuum of closely related ...
#
Hungarian
#
Icelandic
#
Igbo
#
Indonesian
#
Irish
#
Italian
Italian(s) may refer to:
* Anything of, from, or related to the people of Italy over the centuries
** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom
** Italian language, a Romance languag ...
#
Japanese
#
Javanese
#
Kannada
Kannada () is a Dravidian language spoken predominantly in the state of Karnataka in southwestern India, and spoken by a minority of the population in all neighbouring states. It has 44 million native speakers, and is additionally a ...
#
Kazakh
#
Khmer
#
Kinyarwanda
Kinyarwanda, Rwandan or Rwanda, officially known as Ikinyarwanda, is a Bantu language and the national language of Rwanda. It is a dialect of the Rwanda-Rundi language that is also spoken in adjacent parts of the Democratic Republic of the ...
#
Korean
Korean may refer to:
People and culture
* Koreans, people from the Korean peninsula or of Korean descent
* Korean culture
* Korean language
**Korean alphabet, known as Hangul or Korean
**Korean dialects
**See also: North–South differences in t ...
#
Kurdish
Kurdish may refer to:
*Kurds or Kurdish people
*Kurdish language
** Northern Kurdish (Kurmanji)
**Central Kurdish (Sorani)
**Southern Kurdish
** Laki Kurdish
*Kurdish alphabets
*Kurdistan, the land of the Kurdish people which includes:
**Southern ...
(
Kurmanji
Kurmanji (, ), also termed Northern Kurdish, is the northernmost of the Kurdish languages, spoken predominantly in southeast Turkey, northwest and northeast Iran, northern Iraq, northern Syria and the Caucasus and Khorasan regions. It is the ...
)
#
Kyrgyz
#
Lao
#
Latin
Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
#
Latvian
#
Lithuanian
#
Luxembourgish
Luxembourgish ( ; also ''Luxemburgish'', ''Luxembourgian'', ''Letzebu(e)rgesch''; ) is a West Germanic language that is spoken mainly in Luxembourg. About 400,000 people speak Luxembourgish worldwide.
The language is standardized and officiall ...
#
Macedonian
#
Malagasy
#
Malay
#
Malayalam
Malayalam (; , ) is a Dravidian languages, Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry (union territory), Puducherry (Mahé district) by the Malayali people. It is one of ...
#
Maltese
Maltese may refer to:
* Someone or something of, from, or related to Malta
* Maltese alphabet
* Maltese cuisine
* Maltese culture
* Maltese language, the Semitic language spoken by Maltese people
* Maltese people, people from Malta or of Maltese ...
#
Maori
#
Marathi
Marathi may refer to:
*Marathi people, an Indo-Aryan ethnolinguistic group of Maharashtra, India
**Marathi people (Uttar Pradesh), the Marathi people in the Indian state of Uttar Pradesh
*Marathi language, the Indo-Aryan language spoken by the Mar ...
#
Mongolian
#
Nepali
#
Norwegian (
Bokmål
Bokmål () (, ; ) is one of the official written standards for the Norwegian language, alongside Nynorsk. Bokmål is by far the most used written form of Norwegian today, as it is adopted by 85% to 90% of the population in Norway. There is no cou ...
)
#
Odia
#
Pashto
Pashto ( , ; , ) is an eastern Iranian language in the Indo-European language family, natively spoken in northwestern Pakistan and southern and eastern Afghanistan. It has official status in Afghanistan and the Pakistani province of Khyb ...
#
Persian
Persian may refer to:
* People and things from Iran, historically called ''Persia'' in the English language
** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples
** Persian language, an Iranian language of the ...
#
Polish
#
Portuguese
#
Punjabi (
Gurmukhi
Gurmukhī ( , Shahmukhi: ) is an abugida developed from the Laṇḍā scripts, standardized and used by the second Sikh guru, Guru Angad (1504–1552). Commonly regarded as a Sikh script, Gurmukhi is used in Punjab, India as the official scrip ...
)
#
Romanian
Romanian may refer to:
*anything of, from, or related to the country and nation of Romania
**Romanians, an ethnic group
**Romanian language, a Romance language
***Romanian dialects, variants of the Romanian language
**Romanian cuisine, traditional ...
#
Russian
Russian(s) may refer to:
*Russians (), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries
*A citizen of Russia
*Russian language, the most widely spoken of the Slavic languages
*''The Russians'', a b ...
#
Samoan
#
Scottish Gaelic
Scottish Gaelic (, ; Endonym and exonym, endonym: ), also known as Scots Gaelic or simply Gaelic, is a Celtic language native to the Gaels of Scotland. As a member of the Goidelic language, Goidelic branch of Celtic, Scottish Gaelic, alongs ...
#
Serbian
#
Shona
#
Sindhi
#
Sinhala
#
Slovak
#
Slovenian
#
Somali
#
Sotho
#
Spanish
Spanish might refer to:
* Items from or related to Spain:
**Spaniards are a nation and ethnic group indigenous to Spain
**Spanish language, spoken in Spain and many countries in the Americas
**Spanish cuisine
**Spanish history
**Spanish culture
...
#
Sundanese
#
Swahili
#
Swedish
#
Tajik
#
Tamil
Tamil may refer to:
People, culture and language
* Tamils, an ethno-linguistic group native to India, Sri Lanka, and some other parts of Asia
**Sri Lankan Tamils, Tamil people native to Sri Lanka
** Myanmar or Burmese Tamils, Tamil people of Ind ...
#
Tatar
#
Telugu
#
Thai
#
Turkish
#
Turkmen
#
Ukrainian
#
Urdu
Urdu (; , , ) is an Indo-Aryan languages, Indo-Aryan language spoken chiefly in South Asia. It is the Languages of Pakistan, national language and ''lingua franca'' of Pakistan. In India, it is an Eighth Schedule to the Constitution of Indi ...
#
Uyghur
Uyghur may refer to:
* Uyghurs, a Turkic ethnic group living in Eastern and Central Asia (West China)
** Uyghur language, a Turkic language spoken primarily by the Uyghurs
*** Old Uyghur language, a different Turkic language spoken in the Uyghur K ...
#
Uzbek
#
Vietnamese
Vietnamese may refer to:
* Something of, from, or related to Vietnam, a country in Southeast Asia
* Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam
** Overseas Vietnamese, Vietnamese people living outside Vietna ...
#
Welsh
#
West Frisian
#
Xhosa
Xhosa may refer to:
* Xhosa people, a nation, and ethnic group, who live in south-central and southeasterly region of South Africa
* Xhosa language, one of the 11 official languages of South Africa, principally spoken by the Xhosa people
See als ...
#
Yiddish
Yiddish, historically Judeo-German, is a West Germanic language historically spoken by Ashkenazi Jews. It originated in 9th-century Central Europe, and provided the nascent Ashkenazi community with a vernacular based on High German fused with ...
#
Yoruba
#
Zulu
See also
*
Example-based machine translation
Example-based machine translation (EBMT) is a method of machine translation often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base at run-time. It is essentially a translation by analogy and can be vie ...
*
Rule-based machine translation
Rule-based machine translation (RBMT) is a classical approach of machine translation systems based on linguistic information about source and target languages. Such information is retrieved from (unilingual, bilingual or multilingual) dictionaries ...
*
Comparison of machine translation applications
Machine translation is an algorithm which attempts to translate text or speech from one natural language to another.
General information
Basic general information for popular machine translation applications.
Languages features comparison ...
*
Statistical machine translation
Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contra ...
*
Artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
*
Cache language model A cache language model is a type of statistical language model. These occur in the natural language processing subfield of computer science and assign probabilities to given sequences of words by means of a probability distribution. Statistical lang ...
*
Computational linguistics
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
*
Computer-assisted translation
Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software, also known as a translator, to assist a human translator in the translation process. The tr ...
*
History of machine translation
*
List of emerging technologies
This is a list of emerging technologies, which are emerging technologies, in-development technical innovations that have significant potential in their applications. The criteria for this list is that the technology must:
# Exist in some way; ...
*
List of research laboratories for machine translation
*
Neural machine translation
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
It is the dominant a ...
*
Machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
*
Universal translator
References
External links
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine TranslationStatistical Machine TranslationInternational Association for Machine Translation (IAMT)Machine Translation Archive by
John Hutchins. An electronic repository (and bibliography) of articles, books and papers in the field of machine translation and computer-based translation technology
Machine translation (computer-based translation)– Publications by John Hutchins (includes
PDF
Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
s of several books on machine translation)
{{Natural Language Processing
Applications of artificial intelligence
Computational linguistics
Machine translation
Artificial neural networks
Tasks of natural language processing
Google Translate