Machine translation, sometimes referred to by the abbreviation MT (not to be confused with

computer-aided translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...

, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates the use of software to

translate Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transl ...

text or speech from one

language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of met ...

to another. On a basic level, MT performs mechanical substitution of words in one language for words in another, but that alone rarely produces a good translation because recognition of whole phrases and their closest counterparts in the target language is needed. Not all words in one language have equivalent words in another language, and many words have more than one meaning. Solving this problem with

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

statistical and neural techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of

idiom An idiom is a phrase or expression that typically presents a figurative, non-literal meaning attached to the phrase; but some phrases become figurative idioms while retaining the literal meaning of the phrase. Categorized as formulaic language, ...

s, and the isolation of anomalies. Current machine translation software often allows for customization by domain or

profession A profession is a field of work that has been successfully ''professionalized''. It can be defined as a disciplined group of individuals, '' professionals'', who adhere to ethical standards and who hold themselves out as, and are accepted by ...

(such as weather reports), improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal or formulaic language is used. It follows that machine translation of government and legal documents more readily produces usable output than machine translation of conversation or less standardised text. Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are proper names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators and, in a very limited number of cases, can even produce output that can be used as is (e.g., weather reports). The progress and potential of machine translation have been much debated through its history. Since the 1950s, a number of scholars, first and most notably

Yehoshua Bar-Hillel Yehoshua Bar-Hillel ( he, יהושע בר-הלל; 8 September 1915, in Vienna – 25 September 1975, in Jerusalem) was an Israeli philosopher, mathematician, and linguist. He was a pioneer in the fields of machine translation and formal linguis ...

, have questioned the possibility of achieving fully automatic machine translation of high quality.

History

Origins

The origins of machine translation can be traced back to the work of

Al-Kindi Abū Yūsuf Yaʻqūb ibn ʼIsḥāq aṣ-Ṣabbāḥ al-Kindī (; ar, أبو يوسف يعقوب بن إسحاق الصبّاح الكندي; la, Alkindus; c. 801–873 AD) was an Arab Muslim philosopher, polymath, mathematician, physician ...

, a ninth-century Arabic

cryptographer Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...

who developed techniques for systemic language translation, including cryptanalysis, frequency analysis, and

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

and

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, which are used in modern machine translation. The idea of machine translation later appeared in the 17th century. In 1629,

René Descartes René Descartes ( or ; ; Latinized: Renatus Cartesius; 31 March 1596 – 11 February 1650) was a French philosopher, scientist, and mathematician, widely considered a seminal figure in the emergence of modern philosophy and science. Mathem ...

proposed a universal language, with equivalent ideas in different tongues sharing one symbol. The idea of using digital computers for translation of natural languages was proposed as early as 1946 by England's A. D. Booth and

Warren Weaver Warren Weaver (July 17, 1894 – November 24, 1978) was an American scientist, mathematician, and science administrator. He is widely recognized as one of the pioneers of machine translation and as an important figure in creating support for scien ...

Rockefeller Foundation The Rockefeller Foundation is an American private foundation and philanthropic medical research and arts funding organization based at 420 Fifth Avenue, New York City. The second-oldest major philanthropic institution in America, after the Carneg ...

at the same time. "The memorandum written by

in 1949 is perhaps the single most influential publication in the earliest days of machine translation." Others followed. A demonstration was made in 1954 on the APEXC machine at Birkbeck College (

University of London The University of London (UoL; abbreviated as Lond or more rarely Londin in post-nominals) is a federal public research university located in London, England, United Kingdom. The university was established by royal charter in 1836 as a degree ...

) of a rudimentary translation of English into French. Several papers on the topic were published at the time, and even articles in popular journals (for example an article by Cleave and Zacharov in the September 1955 issue of ''

Wireless World ''Electronics World'' (''Wireless World'', founded in 1913, and in September 1984 renamed ''Electronics & Wireless World'') is a technical magazine in electronics and RF engineering aimed at professional design engineers. It is produced monthly in ...

''). A similar application, also pioneered at Birkbeck College at the time, was reading and composing

Braille Braille (Pronounced: ) is a tactile writing system used by people who are visually impaired, including people who are Blindness, blind, Deafblindness, deafblind or who have low vision. It can be read either on Paper embossing, embossed paper ...

texts by computer.

1950s

The first researcher in the field,

, began his research at MIT (1951). A

Georgetown University Georgetown University is a private university, private research university in the Georgetown (Washington, D.C.), Georgetown neighborhood of Washington, D.C. Founded by Bishop John Carroll (archbishop of Baltimore), John Carroll in 1789 as Georg ...

MT research team, led by Professor Michael Zarechnak, followed (1951) with a public demonstration of its Georgetown-IBM experiment system in 1954. MT research programs popped up in Japan and Russia (1955), and the first MT conference was held in London (1956).

David G. Hays David Glenn Hays (November 17, 1928 – July 26, 1995) was a linguist, computer scientist and social scientist best known for his early work in machine translation and computational linguistics. Career overview David Hays graduated from Harvard ...

"wrote about computer-assisted language processing as early as 1957" and "was project leader on computational linguistics at

Rand The RAND Corporation (from the phrase "research and development") is an American nonprofit global policy think tank created in 1948 by Douglas Aircraft Company to offer research and analysis to the United States Armed Forces. It is finan ...

from 1955 to 1968."

1960–1975

Researchers continued to join the field as the Association for Machine Translation and Computational Linguistics was formed in the U.S. (1962) and the National Academy of Sciences formed the Automatic Language Processing Advisory Committee (ALPAC) to study MT (1964). Real progress was much slower, however, and after the

ALPAC report ALPAC (Automatic Language Processing Advisory Committee) was a committee of seven scientists led by John R. Pierce, established in 1964 by the United States government in order to evaluate the progress in computational linguistics in general and ...

(1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced. According to a 1972 report by the Director of Defense Research and Engineering (DDR&E), the feasibility of large-scale MT was reestablished by the success of the Logos MT system in translating military manuals into Vietnamese during that conflict. The French Textile Institute also used MT to translate abstracts from and into French, English, German and Spanish (1970); Brigham Young University started a project to translate Mormon texts by automated translation (1971).

1975 and beyond

SYSTRAN, which "pioneered the field under contracts from the U.S. government" in the 1960s, was used by Xerox to translate technical manuals (1978). Beginning in the late 1980s, as

computation Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform computations are known as ''computers''. An es ...

al power increased and became less expensive, more interest was shown in statistical models for machine translation. MT became more popular after the advent of computers. SYSTRAN's first implementation system was implemented in 1988 by the online service of the French Postal Service called Minitel. Various computer based translation companies were also launched, including Trados (1984), which was the first to develop and market Translation Memory technology (1989), though this is not the same as MT. The first commercial MT system for Russian / English / German-Ukrainian was developed at Kharkov State University (1991). By 1998, "for as little as $29.95" one could "buy a program for translating in one direction between English and a major European language of your choice" to run on a PC. MT on the web started with SYSTRAN offering free translation of small texts (1996) and then providing this via AltaVista Babelfish, which racked up 500,000 requests a day (1997). The second free translation service on the web was

Lernout & Hauspie Lernout & Hauspie Speech Products, or L&H, was a Belgium-based speech recognition technology company, founded by Jo Lernout and Pol Hauspie, that went bankrupt in 2001 because of a fraud engineered by the management. The company was based in Ypr ...

's GlobaLink. ''Atlantic Magazine'' wrote in 1998 that "Systran's Babelfish and GlobaLink's Comprende" handled "Don't bank on it" with a "competent performance."

Franz Josef Och Franz Josef Och (November 2, 1971) is a German computer scientist. He is best known for being the chief architect of Google Translate. He has worked as a Director at Facebook. Prior to this, he was Head of Data Science at ''Grail'', (an Illumina c ...

(the future head of Translation Development AT Google) won DARPA's speed MT competition (2003). More innovations during this time included MOSES, the open-source statistical MT engine (2007), a text/SMS translation service for mobiles in Japan (2008), and a mobile phone with built-in speech-to-speech translation functionality for English, Japanese and Chinese (2009). In 2012, Google announced that

Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an API ...

translates roughly enough text to fill 1 million books in one day.

Translation process

The human

translation process Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...

may be described as: # Decoding the meaning of the

source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. Description In historiography, distinctions are commonly m ...

; and # Re-

encoding In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...

this meaning in the target language. Behind this ostensibly simple procedure lies a complex cognitive operation. To decode the meaning of the

in its entirety, the translator must interpret and analyse all the features of the text, a process that requires in-depth knowledge of the

grammar In linguistics, the grammar of a natural language is its set of structure, structural constraints on speakers' or writers' composition of clause (linguistics), clauses, phrases, and words. The term can also refer to the study of such constraint ...

semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...

syntax In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituency) ...

s, etc., of the source language, as well as the culture of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the target language. Therein lies the challenge in machine translation: how to program a computer that will "understand" a text as a person does, and that will "create" a new text in the target language that sounds as if it has been written by a person. Unless aided by a 'knowledge base' MT provides only a general, though imperfect, approximation of the original text, getting the "gist" of it (a process called "gisting"). This is sufficient for many purposes, including making best use of the finite and expensive time of a human translator, reserved for those cases in which total accuracy is indispensable.

Approaches

Direct translation and transfer translation pyramid

Machine translation can use a method based on linguistic rules, which means that words will be translated in a linguistic way – the most suitable (orally speaking) words of the target language will replace the ones in the source language. It is often argued that the success of machine translation requires the problem of

natural language understanding Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an ...

to be solved first. Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated. According to the nature of the intermediary representation, an approach is described as

interlingual machine translation Interlingual machine translation is one of the classic approaches to machine translation. In this approach, the source language, i.e. the text to be translated is transformed into an interlingua, i.e., an abstract language-independent representa ...

transfer-based machine translation Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: ...

. These methods require extensive lexicons with morphological,

syntactic In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), ...

, and

semantic Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...

information, and large sets of rules. Given enough data, machine translation programs often work well enough for a

native speaker Native Speaker may refer to: * ''Native Speaker'' (novel), a 1995 novel by Chang-Rae Lee * ''Native Speaker'' (album), a 2011 album by Canadian band Braids * Native speaker, a person using their first language or mother tongue {{disambigua ...

of one language to get the approximate meaning of what is written by the other native speaker. The difficulty is getting enough data of the right kind to support the particular method. For example, the large multilingual

of data needed for statistical methods to work is not necessary for the grammar-based methods. But then, the grammar methods need a skilled linguist to carefully design the grammar that they use. To translate between closely related languages, the technique referred to as

rule-based machine translation Rule-based machine translation (RBMT; "Classical Approach" of MT) is machine translation systems based on linguistic information about source and target languages basically retrieved from (unilingual, bilingual or multilingual) dictionaries and gram ...

may be used.

Rule-based

The rule-based machine translation paradigm includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms. This type of translation is used mostly in the creation of dictionaries and grammar programs. Unlike other methods, RBMT involves more information about the linguistics of the source and target languages, using the morphological and syntactic rules and semantic analysis of both languages. The basic approach involves linking the structure of the input sentence with the structure of the output sentence using a parser and an analyzer for the source language, a generator for the target language, and a transfer lexicon for the actual translation. RBMT's biggest downfall is that everything must be made explicit: orthographical variation and erroneous input must be made part of the source language analyser in order to cope with it, and lexical selection rules must be written for all instances of ambiguity. Adapting to new domains in itself is not that hard, as the core grammar is the same across domains, and the domain-specific adjustment is limited to lexical selection adjustment.

Transfer-based machine translation

Transfer-based machine translation is similar to

in that it creates a translation from an intermediate representation that simulates the meaning of the original sentence. Unlike interlingual MT, it depends partially on the language pair involved in the translation.

Interlingual

Interlingual machine translation is one instance of rule-based machine-translation approaches. In this approach, the source language, i.e. the text to be translated, is transformed into an interlingual language, i.e. a "language neutral" representation that is independent of any language. The target language is then generated out of the

interlingua Interlingua (; ISO 639 language codes ia, ina) is an international auxiliary language (IAL) developed between 1937 and 1951 by the American International Auxiliary Language Association (IALA). It ranks among the most widely used IALs and is t ...

. One of the major advantages of this system is that the interlingua becomes more valuable as the number of target languages it can be turned into increases. However, the only interlingual machine translation system that has been made operational at the commercial level is the KANT system (Nyberg and Mitamura, 1992), which is designed to translate Caterpillar Technical English (CTE) into other languages.

Dictionary-based

Machine translation can use a method based on

dictionary A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologies ...

entries, which means that the words will be translated as they are by a dictionary.

Statistical

Statistical machine translation tries to generate translations using

statistical methods Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...

based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament and EUROPARL, the record of the

European Parliament The European Parliament (EP) is one of the legislative bodies of the European Union and one of its seven institutions. Together with the Council of the European Union (known as the Council and informally as the Council of Ministers), it adopts ...

. Where such corpora are available, good results can be achieved translating similar texts, but such corpora are still rare for many language pairs. The first statistical machine translation software was

CANDIDE ( , ) is a French satire written by Voltaire, a philosopher of the Age of Enlightenment, first published in 1759. The novella has been widely translated, with English versions titled ''Candide: or, All for the Best'' (1759); ''Candide: or, The ...

from IBM. Google used SYSTRAN for several years, but switched to a statistical translation method in October 2007. In 2005, Google improved its internal translation capabilities by using approximately 200 billion words from United Nations materials to train their system; translation accuracy improved. Google Translate and similar statistical translation programs work by detecting patterns in hundreds of millions of documents that have previously been translated by humans and making intelligent guesses based on the findings. Generally, the more human-translated documents available in a given language, the more likely it is that the translation will be of good quality. Newer approaches into Statistical Machine translation such as METIS II and PRESEMT use minimal corpus size and instead focus on derivation of syntactic structure through pattern recognition. With further development, this may allow statistical machine translation to operate off of a monolingual text corpus. SMT's biggest downfall includes it being dependent upon huge amounts of parallel texts, its problems with morphology-rich languages (especially with translating ''into'' such languages), and its inability to correct singleton errors.

Example-based

Example-based machine translation (EBMT) approach was proposed by

Makoto Nagao was a Japanese computer scientist. He contributed to various fields: machine translation, natural language processing, pattern recognition, image processing and library science. He was the 23rd president of Kyoto University (1997–2003) and ...

in 1984. Example-based machine translation is based on the idea of analogy. In this approach, the corpus that is used is one that contains texts that have already been translated. Given a sentence that is to be translated, sentences from this corpus are selected that contain similar sub-sentential components. The similar sentences are then used to translate the sub-sentential components of the original sentence into the target language, and these phrases are put together to form a complete translation.

Hybrid MT

Hybrid machine translation (HMT) leverages the strengths of statistical and rule-based translation methodologies. Several MT organizations claim a hybrid approach that uses both rules and statistics. The approaches differ in a number of ways: * Rules post-processed by statistics: Translations are performed using a rules based engine. Statistics are then used in an attempt to adjust/correct the output from the rules engine. * Statistics guided by rules: Rules are used to pre-process data in an attempt to better guide the statistical engine. Rules are also used to post-process the statistical output to perform functions such as normalization. This approach has a lot more power, flexibility and control when translating. It also provides extensive control over the way in which the content is processed during both pre-translation (e.g. markup of content and non-translatable terms) and post-translation (e.g. post translation corrections and adjustments). More recently, with the advent of Neural MT, a new version of hybrid machine translation is emerging that combines the benefits of rules, statistical and neural machine translation. The approach allows benefitting from pre- and post-processing in a rule guided workflow as well as benefitting from NMT and SMT. The downside is the inherent complexity which makes the approach suitable only for specific use cases.

Neural MT

A deep learning-based approach to MT,

neural machine translation Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Properties They requi ...

has made rapid progress in recent years, and Google has announced its translation services are now using this technology in preference over its previous statistical methods. A Microsoft team claimed to have reached human parity on WMT-2017 ("EMNLP 2017 Second Conference On Machine Translation") in 2018, marking a historical milestone. However, many researchers have criticized this claim, rerunning and discussing their experiments; current consensus is that the so-called human parity achieved is not real, being based wholly on limited domains, language pairs, and certain test suites i.e., it lacks statistical significance power. There is still a long journey before NMT reaches real human parity performances. To address the idiomatic phrase translation, multi-word expressions, and low-frequency words (also called OOV, or out-of-vocabulary word translation), language-focused linguistic features have been explored in state-of-the-art

(NMT) models. For instance, the Chinese character decompositions into radicals and strokes have proven to be helpful for translating multi-word expressions in NMT.

Major issues

Disambiguation

Word-sense disambiguation concerns finding a suitable translation when a word can have more than one meaning. The problem was first raised in the 1950s by

. He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word. Today there are numerous approaches designed to overcome this problem. They can be approximately divided into "shallow" approaches and "deep" approaches. Shallow approaches assume no knowledge of the text. They simply apply statistical methods to the words surrounding the ambiguous word. Deep approaches presume a comprehensive knowledge of the word. So far, shallow approaches have been more successful.

Claude Piron Claude Piron, also known by the pseudonym Johán Valano, was a Swiss psychologist, Esperantist, translator, and writer. He worked as a translator for the United Nations from 1956 to 1961 and then for the World Health Organization. He was a prolif ...

, a long-time translator for the United Nations and the

World Health Organization The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health. The WHO Constitution states its main objective as "the attainment by all peoples of the highest possible level of h ...

, wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve

ambiguities Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...

in the

, which the

grammatical In linguistics, grammaticality is determined by the conformity to language usage as derived by the grammar of a particular variety (linguistics), speech variety. The notion of grammaticality rose alongside the theory of generative grammar, the go ...

and

lexical Lexical may refer to: Linguistics * Lexical corpus or lexis, a complete set of all words in a language * Lexical item, a basic unit of lexicographical classification * Lexicon, the vocabulary of a person, language, or branch of knowledge * Lex ...

exigencies of the target language require to be resolved: The ideal deep approach would require the translation software to do all the research necessary for this kind of disambiguation on its own; but this would require a higher degree of AI than has yet been attained. A shallow approach which simply guessed at the sense of the ambiguous English phrase that Piron mentions (based, perhaps, on which kind of prisoner-of-war camp is more often mentioned in a given corpus) would have a reasonable chance of guessing wrong fairly often. A shallow approach that involves "ask the user about each ambiguity" would, by Piron's estimate, only automate about 25% of a professional translator's job, leaving the harder 75% still to be done by a human.

Non-standard speech

One of the major pitfalls of MT is its inability to translate non-standard language with the same accuracy as standard language. Heuristic or statistical based MT takes input from various sources in standard form of a language. Rule-based translation, by nature, does not include common non-standard usages. This causes errors in translation from a vernacular source or into colloquial language. Limitations on translation from casual speech present issues in the use of machine translation in mobile devices.

Named entities

information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...

, named entities, in a narrow sense, refer to concrete or abstract entities in the real world such as people, organizations, companies, and places that have a proper name: George Washington, Chicago, Microsoft. It also refers to expressions of time, space and quantity such as 1 July 2011, $500. In the sentence "Smith is the president of Fabrionix" both ''Smith'' and ''Fabrionix'' are named entities, and can be further qualified via first name or other information; "president" is not, since Smith could have earlier held another position at Fabrionix, e.g. Vice President. The term

rigid designator In modal logic and the philosophy of language, a term is said to be a rigid designator or absolute substantial term when it designates (picks out, denotes, refers to) the same thing in ''all possible worlds'' in which that thing exists. A designat ...

is what defines these usages for analysis in statistical machine translation. Named entities must first be identified in the text; if not, they may be erroneously translated as common nouns, which would most likely not affect the

BLEU Bleu or BLEU may refer to: * the French word for blue * '' Three Colors: Blue'', a 1993 movie * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese ...

rating of the translation but would change the text's human readability. They may be omitted from the output translation, which would also have implications for the text's readability and message.

Transliteration Transliteration is a type of conversion of a text from one writing system, script to another that involves swapping Letter (alphabet), letters (thus ''wikt:trans-#Prefix, trans-'' + ''wikt:littera#Latin, liter-'') in predictable ways, such as ...

includes finding the letters in the target language that most closely correspond to the name in the source language. This, however, has been cited as sometimes worsening the quality of translation. For "Southern California" the first word should be translated directly, while the second word should be transliterated. Machines often transliterate both because they treated them as one entity. Words like these are hard for machine translators, even those with a transliteration component, to process. Use of a "do-not-translate" list, which has the same end goal – transliteration as opposed to translation. still relies on correct identification of named entities. A third approach is a class-based model. Named entities are replaced with a token to represent their "class"; "Ted" and "Erica" would both be replaced with "person" class token. Then the statistical distribution and use of person names, in general, can be analyzed instead of looking at the distributions of "Ted" and "Erica" individually, so that the probability of a given name in a specific language will not affect the assigned probability of a translation. A study by Stanford on improving this area of translation gives the examples that different probabilities will be assigned to "David is going for a walk" and "Ankit is going for a walk" for English as a target language due to the different number of occurrences for each name in the training data. A frustrating outcome of the same study by Stanford (and other attempts to improve named recognition translation) is that many times, a decrease in the

scores for translation will result from the inclusion of methods for named entity translation. Somewhat related are the phrases "drinking tea with milk" vs. "drinking tea with Molly."

Translation from multiparallel sources

Some work has been done in the utilization of multiparallel

corpora Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

, that is a body of text that has been translated into 3 or more languages. Using these methods, a text that has been translated into 2 or more languages may be utilized in combination to provide a more accurate translation into a third language compared with if just one of those source languages were used alone.

Ontologies in MT

ontology In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities exis ...

is a formal representation of knowledge that includes the concepts (such as objects, processes etc.) in a domain and some relations between them. If the stored information is of linguistic nature, one can speak of a lexicon.Vossen, Piek: ''Ontologies''. In: Mitkov, Ruslan (ed.) (2003): Handbook of Computational Linguistics, Chapter 25. Oxford: Oxford University Press. In NLP, ontologies can be used as a source of knowledge for machine translation systems. With access to a large knowledge base, systems can be enabled to resolve many (especially lexical) ambiguities on their own. In the following classic examples, as humans, we are able to interpret the

prepositional phrase An adpositional phrase, in linguistics, is a syntactic category that includes ''prepositional phrases'', ''postpositional phrases'', and ''circumpositional phrases''. Adpositional phrases contain an adposition (preposition, postposition, or circ ...

according to the context because we use our world knowledge, stored in our lexicons:

I saw a man/star/molecule with a microscope/telescope/binoculars.

A machine translation system initially would not be able to differentiate between the meanings because syntax does not change. With a large enough ontology as a source of knowledge however, the possible interpretations of ambiguous words in a specific context can be reduced. Other areas of usage for ontologies within NLP include information retrieval,

and text summarization.

Building ontologies

The ontology generated for the PANGLOSS knowledge-based machine translation system in 1993 may serve as an example of how an ontology for NLP purposes can be compiled: * A large-scale ontology is necessary to help parsing in the active modules of the machine translation system. * In the PANGLOSS example, about 50,000 nodes were intended to be subsumed under the smaller, manually-built ''upper'' (abstract) ''region'' of the ontology. Because of its size, it had to be created automatically. * The goal was to merge the two resources LDOCE online and

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...

to combine the benefits of both: concise definitions from Longman, and semantic relations allowing for semi-automatic taxonomization to the ontology from WordNet. ** A ''definition match''

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

was created to automatically merge the correct meanings of ambiguous words between the two online resources, based on the words that the definitions of those meanings have in common in LDOCE and WordNet. Using a

similarity matrix In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...

, the algorithm delivered matches between meanings including a confidence factor. This algorithm alone, however, did not match all meanings correctly on its own. ** A second ''hierarchy match'' algorithm was therefore created which uses the taxonomic hierarchies found in WordNet (deep hierarchies) and partially in LDOCE (flat hierarchies). This works by first matching unambiguous meanings, then limiting the search space to only the respective ancestors and descendants of those matched meanings. Thus, the algorithm matched locally unambiguous meanings (for instance, while the word ''

seal Seal may refer to any of the following: Common uses * Pinniped, a diverse group of semi-aquatic marine mammals, many of which are commonly called seals, particularly: ** Earless seal, or "true seal" ** Fur seal * Seal (emblem), a device to imp ...

'' as such is ambiguous, there is only one meaning of ''

'' in the ''animal'' subhierarchy). * Both algorithms complemented each other and helped constructing a large-scale ontology for the machine translation system. The WordNet hierarchies, coupled with the matching definitions of LDOCE, were subordinated to the ontology's ''upper region''. As a result, the PANGLOSS MT system was able to make use of this knowledge base, mainly in its generation element.

Applications

While no system provides the holy grail of fully automatic high-quality machine translation of unrestricted text, many fully automated systems produce reasonable output. The quality of machine translation is substantially improved if the domain is restricted and controlled. Despite their inherent limitations, MT programs are used around the world. Probably the largest institutional user is the

European Commission The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body o ...

. The project, for example, coordinated by the

University of Gothenburg The University of Gothenburg ( sv, Göteborgs universitet) is a university in Sweden's second largest city, Gothenburg. Founded in 1891, the university is the third-oldest of the current Swedish universities and with 37,000 students and 6000 st ...

, received more than 2.375 million euros project support from the EU to create a reliable translation tool that covers a majority of the EU languages. The further development of MT systems comes at a time when budget cuts in human translation may increase the EU's dependency on reliable MT programs. The European Commission contributed 3.072 million euros (via its ISA programme) for the creation of MT@EC, a statistical machine translation program tailored to the administrative needs of the EU, to replace a previous rule-based machine translation system. In 2005,

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

claimed that promising results were obtained using a proprietary statistical machine translation engine. The statistical translation engine used in the Google language tools for Arabic <-> English and Chinese <-> English had an overall score of 0.4281 over the runner-up IBM's

-4 score of 0.3954 (Summer 2006) in tests conducted by the National Institute for Standards and Technology. With the recent focus on terrorism, the military sources in the United States have been investing significant amounts of money in natural language engineering. ''In-Q-Tel'' (a

venture capital Venture capital (often abbreviated as VC) is a form of private equity financing that is provided by venture capital firms or funds to start-up company, startups, early-stage, and emerging companies that have been deemed to have high growth poten ...

fund, largely funded by the US Intelligence Community, to stimulate new technologies through private sector entrepreneurs) brought up companies like Language Weaver. Currently the military community is interested in translation and processing of languages like Arabic machine translation, Arabic, Pashto language, Pashto, and Dari (Eastern Persian), Dari. Within these languages, the focus is on key phrases and quick communication between military members and civilians through the use of mobile phone apps. The Information Processing Technology Office in DARPA hosts programs like DARPA TIDES program, TIDES and Babylon translator. US Air Force has awarded a $1 million contract to develop a language translation technology. The notable rise of social networking on the web in recent years has created yet another niche for the application of machine translation software – in utilities such as Facebook, or instant messaging clients such as Skype, GoogleTalk, MSN Messenger, etc. – allowing users speaking different languages to communicate with each other. Machine translation applications have also been released for most mobile devices, including mobile telephones, pocket PCs, PDAs, etc. Due to their portability, such instruments have come to be designated as mobile translation tools enabling mobile business networking between partners speaking different languages, or facilitating both foreign language learning and unaccompanied traveling to foreign countries without the need of the intermediation of a human translator. Despite being labelled as an unworthy competitor to human translation in 1966 by the Automated Language Processing Advisory Committee put together by the United States government, the quality of machine translation has now been improved to such levels that its application in online collaboration and in the medical field are being investigated. The application of this technology in medical settings where human translators are absent is another topic of research, but difficulties arise due to the importance of accurate translations in medical diagnoses. Flaws in machine translation have also been noted for Humour in translation, their entertainment value. Two videos uploaded to YouTube in April 2017 involve two Japanese hiragana characters えぐ (''E (kana), e'' and ''Ku (kana), gu'') being repeatedly pasted into Google Translate, with the resulting translations quickly degrading into nonsensical phrases such as "DECEARING EGG" and "Deep-sea squeeze trees", which are then read in increasingly absurd voices; the full-length version of the video currently has 6.9 million views as of March 2022.

Evaluation

There are many factors that affect how machine translation systems are evaluated. These factors include the intended use of the translation, the nature of the machine translation software, and the nature of the translation process. Different programs may work well for different purposes. For example, statistical machine translation (SMT) typically outperforms example-based machine translation (EBMT), but researchers found that when evaluating English to French translation, EBMT performs better. The same concept applies for technical documents, which can be more easily translated by SMT because of their formal language. In certain applications, however, e.g., product descriptions written in a controlled language, a dictionary-based machine translation, dictionary-based machine-translation system has produced satisfactory translations that require no human intervention save for quality inspection. There are various means for evaluating the output quality of machine translation systems. The oldest is the use of human judges to assess a translation's quality. Even though human evaluation is time-consuming, it is still the most reliable method to compare different systems such as rule-based and statistical systems. Automated means of evaluation include

, NIST (metric), NIST, METEOR, and LEPOR. Relying exclusively on unedited machine translation ignores the fact that communication in natural language, human language is context-embedded and that it takes a person to comprehend the context (language use), context of the original text with a reasonable degree of probability. It is certainly true that even purely human-generated translations are prone to error. Therefore, to ensure that a machine-generated translation will be useful to a human being and that publishable-quality translation is achieved, such translations must be reviewed and edited by a human. The late

wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve

in the

, which the

and

exigencies of the target language require to be resolved. Such research is a necessary prelude to the pre-editing necessary in order to provide input for machine-translation software such that the output will not be garbage in garbage out, meaningless.See th
annually performed NIST tests since 2001
and Bilingual Evaluation Understudy In addition to disambiguation problems, decreased accuracy can occur due to varying levels of training data for machine translating programs. Both example-based and statistical machine translation rely on a vast array of real example sentences as a base for translation, and when too many or too few sentences are analyzed accuracy is jeopardized. Researchers found that when a program is trained on 203,529 sentence pairings, accuracy actually decreases. The optimal level of training data seems to be just over 100,000 sentences, possibly because as training data increases, the number of possible sentences increases, making it harder to find an exact translation match.

Using machine translation as a teaching tool

Although there have been concerns about machine translation's accuracy, Dr. Ana Nino of the University of Manchester has researched some of the advantages in utilizing machine translation in the classroom. One such pedagogical method is called using "MT as a Bad Model."Nino, Ana.
Machine Translation in Foreign Language Learning: Language Learners' and Tutors' Perceptions of Its Advantages and Disadvantages
ReCALL: the Journal of EUROCALL 21.2 (May 2009) 241–258. MT as a Bad Model forces the language learner to identify inconsistencies or incorrect aspects of a translation; in turn, the individual will (hopefully) possess a better grasp of the language. Dr. Nino cites that this teaching tool was implemented in the late 1980s. At the end of various semesters, Dr. Nino was able to obtain survey results from students who had used MT as a Bad Model (as well as other models.) Overwhelmingly, students felt that they had observed improved comprehension, lexical retrieval, and increased confidence in their target language.

Machine translation and signed languages

In the early 2000s, options for machine translation between spoken and signed languages were severely limited. It was a common belief that deaf individuals could use traditional translators. However, stress, intonation, pitch, and timing are conveyed much differently in spoken languages compared to signed languages. Therefore, a deaf individual may misinterpret or become confused about the meaning of written text that is based on a spoken language.Zhao, L., Kipper, K., Schuler, W., Vogler, C., & Palmer, M. (2000)
A Machine Translation System from English to American Sign Language
. Lecture Notes in Computer Science, 1934: 54–67. Researchers Zhao, et al. (2000), developed a prototype called TEAM (translation from English to ASL by machine) that completed English to American Sign Language (ASL) translations. The program would first analyze the syntactic, grammatical, and morphological aspects of the English text. Following this step, the program accessed a sign synthesizer, which acted as a dictionary for ASL. This synthesizer housed the process one must follow to complete ASL signs, as well as the meanings of these signs. Once the entire text is analyzed and the signs necessary to complete the translation are located in the synthesizer, a computer generated human appeared and would use ASL to sign the English text to the user.

Copyright

Only creative work, works that are originality, original are subject to copyright protection, so some scholars claim that machine translation results are not entitled to copyright protection because MT does not involve creativity. The copyright at issue is for a derivative work; the author of the originality, original work in the original language does not lose his rights when a work is translated: a translator must have permission to publishing, publish a translation.

Notes

External links

The Advantages and Disadvantages of Machine Translation

International Association for Machine Translation (IAMT)

Machine Translation Archive
by W. John Hutchins, John Hutchins. An electronic repository (and bibliography) of articles, books and papers in the field of machine translation and computer-based translation technology
Machine translation (computer-based translation)
– Publications by John Hutchins (includes PDF format, PDFs of several books on machine translation)
Machine Translation and Minority Languages

Slator News & analysis of the latest developments in machine translation
{{Authority control Machine translation, Applications of artificial intelligence Computational linguistics Computer-assisted translation Tasks of natural language processing

History

Origins

1950s

1960–1975

1975 and beyond

Translation process

Approaches

Rule-based

Transfer-based machine translation

Interlingual

Dictionary-based

Statistical

Example-based

Hybrid MT

Neural MT

Major issues

Disambiguation

Non-standard speech

Named entities

Translation from multiparallel sources

Ontologies in MT

Building ontologies

Applications

Evaluation

Using machine translation as a teaching tool

Machine translation and signed languages

Copyright

See also

Notes

Further reading

External links