HOME

TheInfoList



OR:

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The
Loeb Classical Library The Loeb Classical Library (LCL; named after James Loeb; , ) is a monographic series of books originally published by Heinemann and since 1934 by Harvard University Press. It has bilingual editions of ancient Greek and Latin literature, ...
and the
Clay Sanskrit Library The Clay Sanskrit Library is a series of books published by New York University Press and the JJC Foundation. Each work features the text in its original language (transliterated Sanskrit) on the left-hand page, with its English translation on the ...
are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study;
Origen Origen of Alexandria (), also known as Origen Adamantius, was an Early Christianity, early Christian scholar, Asceticism#Christianity, ascetic, and Christian theology, theologian who was born and spent the first half of his career in Early cent ...
's
Hexapla ''Hexapla'' (), also called ''Origenis Hexaplorum'', is a Textual criticism, critical edition of the Hebrew Bible in six versions, four of them translated into Ancient Greek, Greek, preserved only in fragments. It was an immense and complex wor ...
(Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the
Rosetta Stone The Rosetta Stone is a stele of granodiorite inscribed with three versions of a Rosetta Stone decree, decree issued in 196 BC during the Ptolemaic dynasty of ancient Egypt, Egypt, on behalf of King Ptolemy V Epiphanes. The top and middle texts ...
, whose discovery allowed the
Ancient Egyptian language The Egyptian language, or Ancient Egyptian (; ), is an extinct branch of the Afro-Asiatic languages that was spoken in ancient Egypt. It is known today from a large corpus of surviving texts, which were made accessible to the modern world ...
to begin being deciphered. Large collections of parallel texts are called parallel corpora (see
text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
). Alignments of parallel corpora at sentence level are prerequisite for many areas of
linguistic Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task. Parallel texts may be used in
language education Language education refers to the processes and practices of teaching a second language, second or foreign language. Its study reflects interdisciplinarity, interdisciplinary approaches, usually including some applied linguistics. There are f ...
.


Types of parallel corpora

Parallel corpora can be classified into four main categories: * A ''parallel corpus'' contains translations of the same document in two or more languages, aligned at least at the sentence level. These tend to be rarer than less-comparable corpora. * A ''noisy parallel corpus'' contains bilingual sentences that are not perfectly aligned or have poor quality translations. Nevertheless, most of its contents are bilingual translations of a specific document. * A ''comparable corpus'' is built from non-sentence-aligned and untranslated bilingual documents, but the documents are topic-aligned. * A ''quasi-comparable corpus'' includes very heterogeneous and non-parallel bilingual documents that may or may not be topic-aligned.


Noise in corpora

Large corpora used as training sets for
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events. However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between
bilingual Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. When the languages are just two, it is usually called bilingualism. It is believed that multilingual speakers outnumber monolin ...
elements represented in both corpora and
monolingual Monoglottism ( Greek μόνος ''monos'', "alone, solitary", + γλῶττα , "tongue, language") or, more commonly, monolingualism or unilingualism, is the condition of being able to speak only a single language, as opposed to multilingualism. ...
elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages.


Bitext

In the field of
translation studies Translation studies is an academic interdiscipline dealing with the systematic study of the theory, description and application of translation, interpreting, and localization. As an interdiscipline, translation studies borrows much from the vari ...
a bitext is a merged document composed of both source- and target-language versions of a given text. Bitexts are generated by a piece of software called an ''alignment tool'', or a ''bitext tool'', which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a ''bitext database'' or a ''bilingual corpus'', and can be consulted with a search tool.


Bitexts and translation memories

''Bitexts'' have some similarities with translation memories. The most salient difference is that a translation memory loses the original context, while a bitext retains the original sentence order. That said, some implementations of translation memory, such as Translation Memory eXchange (TMX), a standard
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
format for exchanging translation memories between
computer-assisted translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software, also known as a translator, to assist a human translator in the translation process. The tr ...
(CAT) programs, allow preserving the original order of sentences. Bitexts are designed to be consulted by a human
translator Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''trans ...
, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance. In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up. Online bitexts and translation memories may also be called online bilingual concordances. Several are available on the public Web, including Linguée, Reverso, and Tradooit.


See also

*
Bilingual inscription In epigraphy, a multilingual inscription is an inscription that includes the same text in two or more languages. A bilingual is an inscription that includes the same text in two languages (or trilingual in the case of three languages, etc.). Mult ...
*
Computer-assisted reviewing Automation describes a wide range of technologies that reduce human intervention in processes, mainly by predetermining decision criteria, subprocess relationships, and related actions, as well as embodying those predeterminations in machine ...
* Example-based machine translation *
Natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
*
Polyglot (book) A polyglot is a book that contains Parallel text, side-by-side versions of the same text in several different languages. Some editions of the Bible or its parts are polyglots, in which the Hebrew language, Hebrew and Greek language, Greek origin ...
*
Ruby character Ruby characters or rubi characters () are small, annotative glosses that are usually placed above or to the right of logographic characters of languages in the East Asian cultural sphere, such as Chinese ''hanzi'', Japanese ''kanji'', and Kor ...
*
Statistical machine translation Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contra ...


References


External links


Parallel corpora


The JRC-Acquis Multilingual Parallel Corpus
of the total body of
European Union The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are Geography of the European Union, located primarily in Europe. The u ...
(EU) law: ''
Acquis Communautaire The Community acquis or ''acquis communautaire'' (; ), sometimes called the EU acquis, and often shortened to acquis, is the accumulated legislation, legal acts and court decisions that constitute the body of European Union law that came into ...
'' with 231 language pairs.
European Parliament Proceedings Parallel Corpus 1996–2011

The Opus project aims at collecting freely available parallel corpora


* ttp://www.linguateca.pt/COMPARA/ COMPARA – Portuguese/English parallel corpora
TERMSEARCH – English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.

TradooIT – English/French/Spanish – Free Online tools

Nunavut Hansard – English/Inuktitut parallel corpus

ParaSol – A parallel corpus of Slavic and other languages

Glosbe: Multilanguage parallel corpora
with online search interface
InterCorp: A multilingual parallel corpus
40 languages aligned with Czech
online search interface

myCAT – Olanto
concordancer (open source AGPL) with online search on JCR and UNO corpus
TAUS
with online search interface.
linguatools
multilingual parallel corpora, online search interface.
EUR-Lex Corpus – corpus
built up of the
EUR-Lex EUR-Lex is the official online database of European Union law and other public documents of the European Union (EU), published in 24 official Languages of the European Union, languages of the EU. The Official Journal of the European Union, Offici ...
database consists of
European Union law European Union law is a system of Supranational union, supranational Law, laws operating within the 27 member states of the European Union (EU). It has grown over time since the 1952 founding of the European Coal and Steel Community, to promote ...
and other public documents of the
European Union The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are Geography of the European Union, located primarily in Europe. The u ...

Language Grid – Multilingual service platform that includes parallel text services


Documentation



* ttps://web.archive.org/web/20060913013656/https://www.cs.unt.edu/~rada/wpt/ Proceedings of the 2003 Workshop on Building and Using Parallel Texts
Proceedings of the 2005 Workshop on Building and Using Parallel Texts


Alignment tools




Uplug – tools for processing parallel corpora (2003)

An implementation of the Gale and Church sentence alignment algorithm (2005)

The Hunalign sentence aligner (2005)

Champollion (2006)

mALIGNa (2008–2020)

Gargantua sentence aligner (2010)

Bleualign – machine translation based sentence alignment (2010)

YASA (2013)

Hierarchical alignment tool (HAT) (2018)

Vecalign sentence alignment algorithm (2019)

Web Alignment Tool at University of Grenoble
{{Natural language processing Translation databases Language acquisition Corpus linguistics