computer translation
   HOME

TheInfoList



OR:

Machine translation, sometimes referred to by the abbreviation MT (not to be confused with
computer-aided translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...
, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates the use of software to
translate Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transl ...
text or speech from one
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
to another. On a basic level, MT performs mechanical substitution of words in one language for words in another, but that alone rarely produces a good translation because recognition of whole phrases and their closest counterparts in the target language is needed. Not all words in one language have equivalent words in another language, and many words have more than one meaning. Solving this problem with
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
statistical and neural techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies. Current machine translation software often allows for customization by domain or
profession A profession is a field of work that has been successfully ''professionalized''. It can be defined as a disciplined group of individuals, '' professionals'', who adhere to ethical standards and who hold themselves out as, and are accepted by ...
(such as weather reports), improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal or formulaic language is used. It follows that machine translation of government and legal documents more readily produces usable output than machine translation of conversation or less standardised text. Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are proper names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators and, in a very limited number of cases, can even produce output that can be used as is (e.g., weather reports). The progress and potential of machine translation have been much debated through its history. Since the 1950s, a number of scholars, first and most notably
Yehoshua Bar-Hillel Yehoshua Bar-Hillel ( he, יהושע בר-הלל; 8 September 1915, in Vienna – 25 September 1975, in Jerusalem) was an Israeli philosopher, mathematician, and linguist. He was a pioneer in the fields of machine translation and formal linguis ...
, have questioned the possibility of achieving fully automatic machine translation of high quality.


History


Origins

The origins of machine translation can be traced back to the work of
Al-Kindi Abū Yūsuf Yaʻqūb ibn ʼIsḥāq aṣ-Ṣabbāḥ al-Kindī (; ar, أبو يوسف يعقوب بن إسحاق الصبّاح الكندي; la, Alkindus; c. 801–873 AD) was an Arab Muslim philosopher, polymath, mathematician, physician ...
, a ninth-century Arabic
cryptographer Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...
who developed techniques for systemic language translation, including cryptanalysis, frequency analysis, and
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...
and statistics, which are used in modern machine translation. The idea of machine translation later appeared in the 17th century. In 1629,
René Descartes René Descartes ( or ; ; Latinized: Renatus Cartesius; 31 March 1596 – 11 February 1650) was a French philosopher, scientist, and mathematician, widely considered a seminal figure in the emergence of modern philosophy and science. Ma ...
proposed a universal language, with equivalent ideas in different tongues sharing one symbol. The idea of using digital computers for translation of natural languages was proposed as early as 1946 by England's A. D. Booth and
Warren Weaver Warren Weaver (July 17, 1894 – November 24, 1978) was an American scientist, mathematician, and science administrator. He is widely recognized as one of the pioneers of machine translation and as an important figure in creating support for scien ...
at Rockefeller Foundation at the same time. "The memorandum written by
Warren Weaver Warren Weaver (July 17, 1894 – November 24, 1978) was an American scientist, mathematician, and science administrator. He is widely recognized as one of the pioneers of machine translation and as an important figure in creating support for scien ...
in 1949 is perhaps the single most influential publication in the earliest days of machine translation." Others followed. A demonstration was made in 1954 on the APEXC machine at Birkbeck College (
University of London The University of London (UoL; abbreviated as Lond or more rarely Londin in post-nominals) is a federal public research university located in London, England, United Kingdom. The university was established by royal charter in 1836 as a degree ...
) of a rudimentary translation of English into French. Several papers on the topic were published at the time, and even articles in popular journals (for example an article by Cleave and Zacharov in the September 1955 issue of ''
Wireless World ''Electronics World'' (''Wireless World'', founded in 1913, and in September 1984 renamed ''Electronics & Wireless World'') is a technical magazine in electronics and RF engineering aimed at professional design engineers. It is produced monthly in ...
''). A similar application, also pioneered at Birkbeck College at the time, was reading and composing
Braille Braille (Pronounced: ) is a tactile writing system used by people who are visually impaired, including people who are blind, deafblind or who have low vision. It can be read either on embossed paper or by using refreshable braille disp ...
texts by computer.


1950s

The first researcher in the field,
Yehoshua Bar-Hillel Yehoshua Bar-Hillel ( he, יהושע בר-הלל; 8 September 1915, in Vienna – 25 September 1975, in Jerusalem) was an Israeli philosopher, mathematician, and linguist. He was a pioneer in the fields of machine translation and formal linguis ...
, began his research at MIT (1951). A
Georgetown University Georgetown University is a private university, private research university in the Georgetown (Washington, D.C.), Georgetown neighborhood of Washington, D.C. Founded by Bishop John Carroll (archbishop of Baltimore), John Carroll in 1789 as Georg ...
MT research team, led by Professor Michael Zarechnak, followed (1951) with a public demonstration of its Georgetown-IBM experiment system in 1954. MT research programs popped up in Japan and Russia (1955), and the first MT conference was held in London (1956).
David G. Hays David Glenn Hays (November 17, 1928 – July 26, 1995) was a linguist, computer scientist and social scientist best known for his early work in machine translation and computational linguistics. Career overview David Hays graduated from Harvard ...
"wrote about computer-assisted language processing as early as 1957" and "was project leader on computational linguistics at
Rand The RAND Corporation (from the phrase "research and development") is an American nonprofit global policy think tank created in 1948 by Douglas Aircraft Company to offer research and analysis to the United States Armed Forces. It is finan ...
from 1955 to 1968."


1960–1975

Researchers continued to join the field as the Association for Machine Translation and Computational Linguistics was formed in the U.S. (1962) and the National Academy of Sciences formed the Automatic Language Processing Advisory Committee (ALPAC) to study MT (1964). Real progress was much slower, however, and after the
ALPAC report ALPAC (Automatic Language Processing Advisory Committee) was a committee of seven scientists led by John R. Pierce, established in 1964 by the United States government in order to evaluate the progress in computational linguistics in general and ...
(1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced. According to a 1972 report by the Director of Defense Research and Engineering (DDR&E), the feasibility of large-scale MT was reestablished by the success of the Logos MT system in translating military manuals into Vietnamese during that conflict. The French Textile Institute also used MT to translate abstracts from and into French, English, German and Spanish (1970); Brigham Young University started a project to translate Mormon texts by automated translation (1971).


1975 and beyond

SYSTRAN, which "pioneered the field under contracts from the U.S. government" in the 1960s, was used by Xerox to translate technical manuals (1978). Beginning in the late 1980s, as computational power increased and became less expensive, more interest was shown in statistical models for machine translation. MT became more popular after the advent of computers. SYSTRAN's first implementation system was implemented in 1988 by the online service of the French Postal Service called Minitel. Various computer based translation companies were also launched, including Trados (1984), which was the first to develop and market Translation Memory technology (1989), though this is not the same as MT. The first commercial MT system for Russian / English / German-Ukrainian was developed at Kharkov State University (1991). By 1998, "for as little as $29.95" one could "buy a program for translating in one direction between English and a major European language of your choice" to run on a PC. MT on the web started with SYSTRAN offering free translation of small texts (1996) and then providing this via AltaVista Babelfish, which racked up 500,000 requests a day (1997). The second free translation service on the web was
Lernout & Hauspie Lernout & Hauspie Speech Products, or L&H, was a Belgium-based speech recognition technology company, founded by Jo Lernout and Pol Hauspie, that went bankrupt in 2001 because of a fraud engineered by the management. The company was based in Ypr ...
's GlobaLink. ''Atlantic Magazine'' wrote in 1998 that "Systran's Babelfish and GlobaLink's Comprende" handled "Don't bank on it" with a "competent performance." Franz Josef Och (the future head of Translation Development AT Google) won DARPA's speed MT competition (2003). More innovations during this time included MOSES, the open-source statistical MT engine (2007), a text/SMS translation service for mobiles in Japan (2008), and a mobile phone with built-in speech-to-speech translation functionality for English, Japanese and Chinese (2009). In 2012, Google announced that
Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an API ...
translates roughly enough text to fill 1 million books in one day.


Translation process

The human translation process may be described as: # Decoding the meaning of the
source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. Description In historiography, distinctions are commonly m ...
; and # Re-
encoding In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
this meaning in the target language. Behind this ostensibly simple procedure lies a complex cognitive operation. To decode the meaning of the
source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. Description In historiography, distinctions are commonly m ...
in its entirety, the translator must interpret and analyse all the features of the text, a process that requires in-depth knowledge of the
grammar In linguistics, the grammar of a natural language is its set of structural constraints on speakers' or writers' composition of clauses, phrases, and words. The term can also refer to the study of such constraints, a field that includes domain ...
,
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comp ...
, syntax, idioms, etc., of the source language, as well as the culture of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the target language. Therein lies the challenge in machine translation: how to program a computer that will "understand" a text as a person does, and that will "create" a new text in the target language that sounds as if it has been written by a person. Unless aided by a 'knowledge base' MT provides only a general, though imperfect, approximation of the original text, getting the "gist" of it (a process called "gisting"). This is sufficient for many purposes, including making best use of the finite and expensive time of a human translator, reserved for those cases in which total accuracy is indispensable.


Approaches

Machine translation can use a method based on linguistic rules, which means that words will be translated in a linguistic way – the most suitable (orally speaking) words of the target language will replace the ones in the source language. It is often argued that the success of machine translation requires the problem of
natural language understanding Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an A ...
to be solved first. Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated. According to the nature of the intermediary representation, an approach is described as interlingual machine translation or transfer-based machine translation. These methods require extensive lexicons with morphological, syntactic, and semantic information, and large sets of rules. Given enough data, machine translation programs often work well enough for a
native speaker Native Speaker may refer to: * ''Native Speaker'' (novel), a 1995 novel by Chang-Rae Lee * ''Native Speaker'' (album), a 2011 album by Canadian band Braids * Native speaker, a person using their first language or mother tongue {{disambigua ...
of one language to get the approximate meaning of what is written by the other native speaker. The difficulty is getting enough data of the right kind to support the particular method. For example, the large multilingual
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
of data needed for statistical methods to work is not necessary for the grammar-based methods. But then, the grammar methods need a skilled linguist to carefully design the grammar that they use. To translate between closely related languages, the technique referred to as
rule-based machine translation Rule-based machine translation (RBMT; "Classical Approach" of MT) is machine translation systems based on linguistic information about source and target languages basically retrieved from (unilingual, bilingual or multilingual) dictionaries and gram ...
may be used.


Rule-based

The rule-based machine translation paradigm includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms. This type of translation is used mostly in the creation of dictionaries and grammar programs. Unlike other methods, RBMT involves more information about the linguistics of the source and target languages, using the morphological and syntactic rules and semantic analysis of both languages. The basic approach involves linking the structure of the input sentence with the structure of the output sentence using a parser and an analyzer for the source language, a generator for the target language, and a transfer lexicon for the actual translation. RBMT's biggest downfall is that everything must be made explicit: orthographical variation and erroneous input must be made part of the source language analyser in order to cope with it, and lexical selection rules must be written for all instances of ambiguity. Adapting to new domains in itself is not that hard, as the core grammar is the same across domains, and the domain-specific adjustment is limited to lexical selection adjustment.


Transfer-based machine translation

Transfer-based machine translation is similar to interlingual machine translation in that it creates a translation from an intermediate representation that simulates the meaning of the original sentence. Unlike interlingual MT, it depends partially on the language pair involved in the translation.


Interlingual

Interlingual machine translation is one instance of rule-based machine-translation approaches. In this approach, the source language, i.e. the text to be translated, is transformed into an interlingual language, i.e. a "language neutral" representation that is independent of any language. The target language is then generated out of the
interlingua Interlingua (; ISO 639 language codes ia, ina) is an international auxiliary language (IAL) developed between 1937 and 1951 by the American International Auxiliary Language Association (IALA). It ranks among the most widely used IALs and is t ...
. One of the major advantages of this system is that the interlingua becomes more valuable as the number of target languages it can be turned into increases. However, the only interlingual machine translation system that has been made operational at the commercial level is the KANT system (Nyberg and Mitamura, 1992), which is designed to translate Caterpillar Technical English (CTE) into other languages.


Dictionary-based

Machine translation can use a method based on dictionary entries, which means that the words will be translated as they are by a dictionary.


Statistical

Statistical machine translation tries to generate translations using
statistical methods Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament and EUROPARL, the record of the
European Parliament The European Parliament (EP) is one of the legislative bodies of the European Union and one of its seven institutions. Together with the Council of the European Union (known as the Council and informally as the Council of Ministers), it adopts ...
. Where such corpora are available, good results can be achieved translating similar texts, but such corpora are still rare for many language pairs. The first statistical machine translation software was CANDIDE from IBM. Google used SYSTRAN for several years, but switched to a statistical translation method in October 2007. In 2005, Google improved its internal translation capabilities by using approximately 200 billion words from United Nations materials to train their system; translation accuracy improved. Google Translate and similar statistical translation programs work by detecting patterns in hundreds of millions of documents that have previously been translated by humans and making intelligent guesses based on the findings. Generally, the more human-translated documents available in a given language, the more likely it is that the translation will be of good quality. Newer approaches into Statistical Machine translation such as METIS II and PRESEMT use minimal corpus size and instead focus on derivation of syntactic structure through pattern recognition. With further development, this may allow statistical machine translation to operate off of a monolingual text corpus. SMT's biggest downfall includes it being dependent upon huge amounts of parallel texts, its problems with morphology-rich languages (especially with translating ''into'' such languages), and its inability to correct singleton errors.


Example-based

Example-based machine translation (EBMT) approach was proposed by
Makoto Nagao was a Japanese computer scientist. He contributed to various fields: machine translation, natural language processing, pattern recognition, image processing and library science. He was the 23rd president of Kyoto University (1997–2003) and ...
in 1984. Example-based machine translation is based on the idea of analogy. In this approach, the corpus that is used is one that contains texts that have already been translated. Given a sentence that is to be translated, sentences from this corpus are selected that contain similar sub-sentential components. The similar sentences are then used to translate the sub-sentential components of the original sentence into the target language, and these phrases are put together to form a complete translation.


Hybrid MT

Hybrid machine translation (HMT) leverages the strengths of statistical and rule-based translation methodologies. Several MT organizations claim a hybrid approach that uses both rules and statistics. The approaches differ in a number of ways: * Rules post-processed by statistics: Translations are performed using a rules based engine. Statistics are then used in an attempt to adjust/correct the output from the rules engine. * Statistics guided by rules: Rules are used to pre-process data in an attempt to better guide the statistical engine. Rules are also used to post-process the statistical output to perform functions such as normalization. This approach has a lot more power, flexibility and control when translating. It also provides extensive control over the way in which the content is processed during both pre-translation (e.g. markup of content and non-translatable terms) and post-translation (e.g. post translation corrections and adjustments). More recently, with the advent of Neural MT, a new version of hybrid machine translation is emerging that combines the benefits of rules, statistical and neural machine translation. The approach allows benefitting from pre- and post-processing in a rule guided workflow as well as benefitting from NMT and SMT. The downside is the inherent complexity which makes the approach suitable only for specific use cases.


Neural MT

A deep learning-based approach to MT,
neural machine translation Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Properties They requi ...
has made rapid progress in recent years, and Google has announced its translation services are now using this technology in preference over its previous statistical methods. A Microsoft team claimed to have reached human parity on WMT-2017 ("EMNLP 2017 Second Conference On Machine Translation") in 2018, marking a historical milestone. However, many researchers have criticized this claim, rerunning and discussing their experiments; current consensus is that the so-called human parity achieved is not real, being based wholly on limited domains, language pairs, and certain test suites i.e., it lacks statistical significance power. There is still a long journey before NMT reaches real human parity performances. To address the idiomatic phrase translation, multi-word expressions, and low-frequency words (also called OOV, or out-of-vocabulary word translation), language-focused linguistic features have been explored in state-of-the-art
neural machine translation Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Properties They requi ...
(NMT) models. For instance, the Chinese character decompositions into radicals and strokes have proven to be helpful for translating multi-word expressions in NMT.


Major issues


Disambiguation

Word-sense disambiguation concerns finding a suitable translation when a word can have more than one meaning. The problem was first raised in the 1950s by
Yehoshua Bar-Hillel Yehoshua Bar-Hillel ( he, יהושע בר-הלל; 8 September 1915, in Vienna – 25 September 1975, in Jerusalem) was an Israeli philosopher, mathematician, and linguist. He was a pioneer in the fields of machine translation and formal linguis ...
. He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word. Today there are numerous approaches designed to overcome this problem. They can be approximately divided into "shallow" approaches and "deep" approaches. Shallow approaches assume no knowledge of the text. They simply apply statistical methods to the words surrounding the ambiguous word. Deep approaches presume a comprehensive knowledge of the word. So far, shallow approaches have been more successful.
Claude Piron Claude Piron, also known by the pseudonym Johán Valano, was a Swiss psychologist, Esperantist, translator, and writer. He worked as a translator for the United Nations from 1956 to 1961 and then for the World Health Organization. He was a prolif ...
, a long-time translator for the United Nations and the
World Health Organization The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health. The WHO Constitution states its main objective as "the attainment by all peoples of the highest possible level of ...
, wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve ambiguities in the
source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. Description In historiography, distinctions are commonly m ...
, which the grammatical and
lexical Lexical may refer to: Linguistics * Lexical corpus or lexis, a complete set of all words in a language * Lexical item, a basic unit of lexicographical classification * Lexicon, the vocabulary of a person, language, or branch of knowledge * Lex ...
exigencies of the target language require to be resolved: The ideal deep approach would require the translation software to do all the research necessary for this kind of disambiguation on its own; but this would require a higher degree of AI than has yet been attained. A shallow approach which simply guessed at the sense of the ambiguous English phrase that Piron mentions (based, perhaps, on which kind of prisoner-of-war camp is more often mentioned in a given corpus) would have a reasonable chance of guessing wrong fairly often. A shallow approach that involves "ask the user about each ambiguity" would, by Piron's estimate, only automate about 25% of a professional translator's job, leaving the harder 75% still to be done by a human.


Non-standard speech

One of the major pitfalls of MT is its inability to translate non-standard language with the same accuracy as standard language. Heuristic or statistical based MT takes input from various sources in standard form of a language. Rule-based translation, by nature, does not include common non-standard usages. This causes errors in translation from a vernacular source or into colloquial language. Limitations on translation from casual speech present issues in the use of machine translation in mobile devices.


Named entities

In
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
, named entities, in a narrow sense, refer to concrete or abstract entities in the real world such as people, organizations, companies, and places that have a proper name: George Washington, Chicago, Microsoft. It also refers to expressions of time, space and quantity such as 1 July 2011, $500. In the sentence "Smith is the president of Fabrionix" both ''Smith'' and ''Fabrionix'' are named entities, and can be further qualified via first name or other information; "president" is not, since Smith could have earlier held another position at Fabrionix, e.g. Vice President. The term
rigid designator In modal logic and the philosophy of language, a term is said to be a rigid designator or absolute substantial term when it designates (picks out, denotes, refers to) the same thing in ''all possible worlds'' in which that thing exists. A designat ...
is what defines these usages for analysis in statistical machine translation. Named entities must first be identified in the text; if not, they may be erroneously translated as common nouns, which would most likely not affect the
BLEU Bleu or BLEU may refer to: * the French word for blue * '' Three Colors: Blue'', a 1993 movie * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese ...
rating of the translation but would change the text's human readability. They may be omitted from the output translation, which would also have implications for the text's readability and message.
Transliteration Transliteration is a type of conversion of a text from one script to another that involves swapping letters (thus ''trans-'' + '' liter-'') in predictable ways, such as Greek → , Cyrillic → , Greek → the digraph , Armenian → or L ...
includes finding the letters in the target language that most closely correspond to the name in the source language. This, however, has been cited as sometimes worsening the quality of translation. For "Southern California" the first word should be translated directly, while the second word should be transliterated. Machines often transliterate both because they treated them as one entity. Words like these are hard for machine translators, even those with a transliteration component, to process. Use of a "do-not-translate" list, which has the same end goal – transliteration as opposed to translation. still relies on correct identification of named entities. A third approach is a class-based model. Named entities are replaced with a token to represent their "class"; "Ted" and "Erica" would both be replaced with "person" class token. Then the statistical distribution and use of person names, in general, can be analyzed instead of looking at the distributions of "Ted" and "Erica" individually, so that the probability of a given name in a specific language will not affect the assigned probability of a translation. A study by Stanford on improving this area of translation gives the examples that different probabilities will be assigned to "David is going for a walk" and "Ankit is going for a walk" for English as a target language due to the different number of occurrences for each name in the training data. A frustrating outcome of the same study by Stanford (and other attempts to improve named recognition translation) is that many times, a decrease in the
BLEU Bleu or BLEU may refer to: * the French word for blue * '' Three Colors: Blue'', a 1993 movie * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese ...
scores for translation will result from the inclusion of methods for named entity translation. Somewhat related are the phrases "drinking tea with milk" vs. "drinking tea with Molly."


Translation from multiparallel sources

Some work has been done in the utilization of multiparallel
corpora Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
, that is a body of text that has been translated into 3 or more languages. Using these methods, a text that has been translated into 2 or more languages may be utilized in combination to provide a more accurate translation into a third language compared with if just one of those source languages were used alone.


Ontologies in MT

An
ontology In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities exi ...
is a formal representation of knowledge that includes the concepts (such as objects, processes etc.) in a domain and some relations between them. If the stored information is of linguistic nature, one can speak of a lexicon.Vossen, Piek: ''Ontologies''. In: Mitkov, Ruslan (ed.) (2003): Handbook of Computational Linguistics, Chapter 25. Oxford: Oxford University Press. In NLP, ontologies can be used as a source of knowledge for machine translation systems. With access to a large knowledge base, systems can be enabled to resolve many (especially lexical) ambiguities on their own. In the following classic examples, as humans, we are able to interpret the
prepositional phrase An adpositional phrase, in linguistics, is a syntactic category that includes ''prepositional phrases'', ''postpositional phrases'', and ''circumpositional phrases''. Adpositional phrases contain an adposition (preposition, postposition, or ci ...
according to the context because we use our world knowledge, stored in our lexicons:
I saw a man/star/molecule with a microscope/telescope/binoculars.
A machine translation system initially would not be able to differentiate between the meanings because syntax does not change. With a large enough ontology as a source of knowledge however, the possible interpretations of ambiguous words in a specific context can be reduced. Other areas of usage for ontologies within NLP include information retrieval,
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
and text summarization.


Building ontologies

The ontology generated for the PANGLOSS knowledge-based machine translation system in 1993 may serve as an example of how an ontology for NLP purposes can be compiled: * A large-scale ontology is necessary to help parsing in the active modules of the machine translation system. * In the PANGLOSS example, about 50,000 nodes were intended to be subsumed under the smaller, manually-built ''upper'' (abstract) ''region'' of the ontology. Because of its size, it had to be created automatically. * The goal was to merge the two resources LDOCE online and
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short defin ...
to combine the benefits of both: concise definitions from Longman, and semantic relations allowing for semi-automatic taxonomization to the ontology from WordNet. ** A ''definition match''
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
was created to automatically merge the correct meanings of ambiguous words between the two online resources, based on the words that the definitions of those meanings have in common in LDOCE and WordNet. Using a
similarity matrix In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
, the algorithm delivered matches between meanings including a confidence factor. This algorithm alone, however, did not match all meanings correctly on its own. ** A second ''hierarchy match'' algorithm was therefore created which uses the taxonomic hierarchies found in WordNet (deep hierarchies) and partially in LDOCE (flat hierarchies). This works by first matching unambiguous meanings, then limiting the search space to only the respective ancestors and descendants of those matched meanings. Thus, the algorithm matched locally unambiguous meanings (for instance, while the word ''
seal Seal may refer to any of the following: Common uses * Pinniped, a diverse group of semi-aquatic marine mammals, many of which are commonly called seals, particularly: ** Earless seal, or "true seal" ** Fur seal * Seal (emblem), a device to imp ...
'' as such is ambiguous, there is only one meaning of ''
seal Seal may refer to any of the following: Common uses * Pinniped, a diverse group of semi-aquatic marine mammals, many of which are commonly called seals, particularly: ** Earless seal, or "true seal" ** Fur seal * Seal (emblem), a device to imp ...
'' in the ''animal'' subhierarchy). * Both algorithms complemented each other and helped constructing a large-scale ontology for the machine translation system. The WordNet hierarchies, coupled with the matching definitions of LDOCE, were subordinated to the ontology's ''upper region''. As a result, the PANGLOSS MT system was able to make use of this knowledge base, mainly in its generation element.


Applications

While no system provides the holy grail of fully automatic high-quality machine translation of unrestricted text, many fully automated systems produce reasonable output. The quality of machine translation is substantially improved if the domain is restricted and controlled. Despite their inherent limitations, MT programs are used around the world. Probably the largest institutional user is the
European Commission The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body ...
. The project, for example, coordinated by the
University of Gothenburg The University of Gothenburg ( sv, Göteborgs universitet) is a university in Sweden's second largest city, Gothenburg. Founded in 1891, the university is the third-oldest of the current Swedish universities and with 37,000 students and 6000 st ...
, received more than 2.375 million euros project support from the EU to create a reliable translation tool that covers a majority of the EU languages. The further development of MT systems comes at a time when budget cuts in human translation may increase the EU's dependency on reliable MT programs. The European Commission contributed 3.072 million euros (via its ISA programme) for the creation of MT@EC, a statistical machine translation program tailored to the administrative needs of the EU, to replace a previous rule-based machine translation system. In 2005,
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
claimed that promising results were obtained using a proprietary statistical machine translation engine. The statistical translation engine used in the Google language tools for Arabic <-> English and Chinese <-> English had an overall score of 0.4281 over the runner-up IBM's
BLEU Bleu or BLEU may refer to: * the French word for blue * '' Three Colors: Blue'', a 1993 movie * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese ...
-4 score of 0.3954 (Summer 2006) in tests conducted by the National Institute for Standards and Technology. With the recent focus on terrorism, the military sources in the United States have been investing significant amounts of money in natural language engineering. ''In-Q-Tel'' (a
venture capital Venture capital (often abbreviated as VC) is a form of private equity financing that is provided by venture capital firms or funds to start-up company, startups, early-stage, and emerging companies that have been deemed to have high growth poten ...
fund, largely funded by the US Intelligence Community, to stimulate new technologies through private sector entrepreneurs) brought up companies like Language Weaver. Currently the military community is interested in translation and processing of languages like
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
,
Pashto Pashto (,; , ) is an Eastern Iranian language in the Indo-European language family. It is known in historical Persian literature as Afghani (). Spoken as a native language mostly by ethnic Pashtuns, it is one of the two official langua ...
, and
Dari Dari (, , ), also known as Dari Persian (, ), is the variety of the Persian language spoken in Afghanistan. Dari is the term officially recognised and promoted since 1964 by the Afghan government for the Persian language,Lazard, G.Darī  ...
. Within these languages, the focus is on key phrases and quick communication between military members and civilians through the use of mobile phone apps. The Information Processing Technology Office in
DARPA The Defense Advanced Research Projects Agency (DARPA) is a research and development agency of the United States Department of Defense responsible for the development of emerging technologies for use by the military. Originally known as the Ad ...
hosts programs like TIDES and
Babylon translator Babylon is a computer dictionary and translation program developed by the Israeli company Babylon Software Ltd. based in the city of Or Yehuda. The company was established in 1997 by the Israeli entrepreneur Amnon Ovadia. Its IPO took place t ...
. US Air Force has awarded a $1 million contract to develop a language translation technology. The notable rise of social networking on the web in recent years has created yet another niche for the application of machine translation software – in utilities such as Facebook, or
instant messaging Instant messaging (IM) technology is a type of online chat allowing real-time text transmission over the Internet or another computer network. Messages are typically transmitted between two or more parties, when each user inputs text and trigge ...
clients such as Skype, GoogleTalk, MSN Messenger, etc. – allowing users speaking different languages to communicate with each other. Machine translation applications have also been released for most mobile devices, including mobile telephones, pocket PCs, PDAs, etc. Due to their portability, such instruments have come to be designated as
mobile translation Mobile translation is any electronic device or software application that provides audio translation. The concept includes any handheld electronic device that is specifically designed for audio translation. It also includes any machine translation ...
tools enabling mobile business networking between partners speaking different languages, or facilitating both foreign language learning and unaccompanied traveling to foreign countries without the need of the intermediation of a human translator. Despite being labelled as an unworthy competitor to human translation in 1966 by the Automated Language Processing Advisory Committee put together by the United States government, the quality of machine translation has now been improved to such levels that its application in online collaboration and in the medical field are being investigated. The application of this technology in medical settings where human translators are absent is another topic of research, but difficulties arise due to the importance of accurate translations in medical diagnoses. Flaws in machine translation have also been noted for their entertainment value. Two videos uploaded to
YouTube YouTube is a global online video sharing and social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by Google, and is the second mo ...
in April 2017 involve two Japanese
hiragana is a Japanese syllabary, part of the Japanese writing system, along with ''katakana'' as well as ''kanji''. It is a phonetic lettering system. The word ''hiragana'' literally means "flowing" or "simple" kana ("simple" originally as contrast ...
characters えぐ ('' e'' and '' gu'') being repeatedly pasted into Google Translate, with the resulting translations quickly degrading into nonsensical phrases such as "DECEARING EGG" and "Deep-sea squeeze trees", which are then read in increasingly absurd voices; the full-length version of the video currently has 6.9 million views as of March 2022.


Evaluation

There are many factors that affect how machine translation systems are evaluated. These factors include the intended use of the translation, the nature of the machine translation software, and the nature of the translation process. Different programs may work well for different purposes. For example,
statistical machine translation Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contras ...
(SMT) typically outperforms
example-based machine translation Example-based machine translation (EBMT) is a method of machine translation often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base at run-time. It is essentially a translation by analogy and can be vi ...
(EBMT), but researchers found that when evaluating English to French translation, EBMT performs better. The same concept applies for technical documents, which can be more easily translated by SMT because of their formal language. In certain applications, however, e.g., product descriptions written in a
controlled language Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types ...
, a dictionary-based machine-translation system has produced satisfactory translations that require no human intervention save for quality inspection. There are various means for evaluating the output quality of machine translation systems. The oldest is the use of human judges to assess a translation's quality. Even though human evaluation is time-consuming, it is still the most reliable method to compare different systems such as rule-based and statistical systems.
Automate Automation describes a wide range of technologies that reduce human intervention in processes, namely by predetermining decision criteria, subprocess relationships, and related actions, as well as embodying those predeterminations in machines ...
d means of evaluation include
BLEU Bleu or BLEU may refer to: * the French word for blue * '' Three Colors: Blue'', a 1993 movie * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese ...
, NIST,
METEOR A meteoroid () is a small rocky or metallic body in outer space. Meteoroids are defined as objects significantly smaller than asteroids, ranging in size from grains to objects up to a meter wide. Objects smaller than this are classified as mi ...
, and LEPOR. Relying exclusively on unedited machine translation ignores the fact that communication in
human language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of met ...
is context-embedded and that it takes a person to comprehend the
context Context may refer to: * Context (language use), the relevant constraints of the communicative situation that influence language use, language variation, and discourse summary Computing * Context (computing), the virtual environment required to su ...
of the original text with a reasonable degree of probability. It is certainly true that even purely human-generated translations are prone to error. Therefore, to ensure that a machine-generated translation will be useful to a human being and that publishable-quality translation is achieved, such translations must be reviewed and edited by a human. The late
Claude Piron Claude Piron, also known by the pseudonym Johán Valano, was a Swiss psychologist, Esperantist, translator, and writer. He worked as a translator for the United Nations from 1956 to 1961 and then for the World Health Organization. He was a prolif ...
wrote that machine translation, at its best, automates the easier part of a translator's job; the harder and more time-consuming part usually involves doing extensive research to resolve ambiguities in the
source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. Description In historiography, distinctions are commonly m ...
, which the grammatical and
lexical Lexical may refer to: Linguistics * Lexical corpus or lexis, a complete set of all words in a language * Lexical item, a basic unit of lexicographical classification * Lexicon, the vocabulary of a person, language, or branch of knowledge * Lex ...
exigencies of the target language require to be resolved. Such research is a necessary prelude to the pre-editing necessary in order to provide input for machine-translation software such that the output will not be meaningless.See th
annually performed NIST tests since 2001
and Bilingual Evaluation Understudy
In addition to disambiguation problems, decreased accuracy can occur due to varying levels of training data for machine translating programs. Both example-based and statistical machine translation rely on a vast array of real example sentences as a base for translation, and when too many or too few sentences are analyzed accuracy is jeopardized. Researchers found that when a program is trained on 203,529 sentence pairings, accuracy actually decreases. The optimal level of training data seems to be just over 100,000 sentences, possibly because as training data increases, the number of possible sentences increases, making it harder to find an exact translation match.


Using machine translation as a teaching tool

Although there have been concerns about machine translation's accuracy, Dr. Ana Nino of the University of Manchester has researched some of the advantages in utilizing machine translation in the classroom. One such pedagogical method is called using "MT as a Bad Model."Nino, Ana.
Machine Translation in Foreign Language Learning: Language Learners' and Tutors' Perceptions of Its Advantages and Disadvantages
ReCALL: the Journal of EUROCALL 21.2 (May 2009) 241–258.
MT as a Bad Model forces the language learner to identify inconsistencies or incorrect aspects of a translation; in turn, the individual will (hopefully) possess a better grasp of the language. Dr. Nino cites that this teaching tool was implemented in the late 1980s. At the end of various semesters, Dr. Nino was able to obtain survey results from students who had used MT as a Bad Model (as well as other models.) Overwhelmingly, students felt that they had observed improved comprehension, lexical retrieval, and increased confidence in their target language.


Machine translation and signed languages

In the early 2000s, options for machine translation between spoken and signed languages were severely limited. It was a common belief that deaf individuals could use traditional translators. However, stress, intonation, pitch, and timing are conveyed much differently in spoken languages compared to signed languages. Therefore, a deaf individual may misinterpret or become confused about the meaning of written text that is based on a spoken language.Zhao, L., Kipper, K., Schuler, W., Vogler, C., & Palmer, M. (2000)
A Machine Translation System from English to American Sign Language
. Lecture Notes in Computer Science, 1934: 54–67.
Researchers Zhao, et al. (2000), developed a prototype called TEAM (translation from English to ASL by machine) that completed English to American Sign Language (ASL) translations. The program would first analyze the syntactic, grammatical, and morphological aspects of the English text. Following this step, the program accessed a sign synthesizer, which acted as a dictionary for ASL. This synthesizer housed the process one must follow to complete ASL signs, as well as the meanings of these signs. Once the entire text is analyzed and the signs necessary to complete the translation are located in the synthesizer, a computer generated human appeared and would use ASL to sign the English text to the user.


Copyright

Only
work Work may refer to: * Work (human activity), intentional activity people perform to support themselves, others, or the community ** Manual labour, physical work done by humans ** House work, housework, or homemaking ** Working animal, an animal t ...
s that are
original Originality is the aspect of created or invented works that distinguish them from reproductions, clones, forgeries, or substantially derivative works. The modern idea of originality is according to some scholars tied to Romanticism, by a notion t ...
are subject to
copyright A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, educatio ...
protection, so some scholars claim that machine translation results are not entitled to copyright protection because MT does not involve
creativity Creativity is a phenomenon whereby something new and valuable is formed. The created item may be intangible (such as an idea, a scientific theory, a musical composition, or a joke) or a physical object (such as an invention, a printed Literature ...
. The copyright at issue is for a derivative work; the author of the
original work Originality is the aspect of created or invented works that distinguish them from reproductions, clones, forgeries, or substantially derivative works. The modern idea of originality is according to some scholars tied to Romanticism, by a notion t ...
in the original language does not lose his
rights Rights are legal, social, or ethical principles of freedom or entitlement; that is, rights are the fundamental normative rules about what is allowed of people or owed to people according to some legal system, social convention, or ethical theory ...
when a work is translated: a translator must have permission to publish a translation.


See also

*
AI-complete In the field of artificial intelligence, the most difficult problems are informally known as AI-complete or AI-hard, implying that the difficulty of these computational problems, assuming intelligence is computational, is equivalent to that of solv ...
* Cache language model *
Comparison of machine translation applications Machine translation is an algorithm which attempts to translate text or speech from one natural language to another. General information Basic general information for popular machine translation applications. Languages features compariso ...
* Comparison of different machine translation approaches * Computational linguistics *
Computer-assisted translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...
and
Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The translat ...
*
Controlled language in machine translation Using controlled language in machine translation poses several problems. In an automated translation, the first step in order to understand the controlled language is to know what it is and to distinguish between natural language and controlled ...
*
Controlled natural language Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types ...
*
Foreign language writing aid A foreign language writing aid is a computer program or any other instrument that assists a non-native language user (also referred to as a foreign language learner) in writing decently in their target language. Assistive operations can be classifie ...
* Fuzzy matching *
History of machine translation Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. In the 1950s, machine translation became a reality in research, although ref ...
*
Human language technology Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech. Working with language technology often requires broa ...
* Humour in translation ("howlers") * Language and Communication Technologies *
Language barrier A language barrier is a figurative phrase used primarily to refer to linguistic barriers to communication, i.e. the difficulties in communication experienced by people or groups originally speaking different languages, or even dialects in some ...
* List of emerging technologies *
List of research laboratories for machine translation The following is a list of research laboratories that focus on machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or intera ...
*
Mobile translation Mobile translation is any electronic device or software application that provides audio translation. The concept includes any handheld electronic device that is specifically designed for audio translation. It also includes any machine translation ...
*
Neural machine translation Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Properties They requi ...
* OpenLogos *
Phraselator The Phraselator is a weatherproof handheld language translation device developed by Applied Data Systems and VoxTec, a former division of the military contractor Marine Acoustics, located in Annapolis, Maryland, USA. It was designed to serve as a ...
* Postediting * Pseudo-translation * Round-trip translation *
Statistical machine translation Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contras ...
* *
Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The translat ...
*
ULTRA (machine translation system) ULTRA is a machine translation system created for five languages (Japanese, Chinese, Spanish, English, and German) in the Computing Research Laboratory in 1991. ULTRA (Universal Language Translator), is a machine translation system developed at t ...
* Universal Networking Language * Universal translator


Notes


Further reading

* * * *


External links


The Advantages and Disadvantages of Machine Translation

International Association for Machine Translation (IAMT)

Machine Translation Archive
by John Hutchins. An electronic repository (and bibliography) of articles, books and papers in the field of machine translation and computer-based translation technology
Machine translation (computer-based translation)
– Publications by John Hutchins (includes PDFs of several books on machine translation)
Machine Translation and Minority Languages

Slator News & analysis of the latest developments in machine translation
{{Authority control Applications of artificial intelligence Computational linguistics Computer-assisted translation Tasks of natural language processing