Parallel text alignment
   HOME

TheInfoList



OR:

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The
Loeb Classical Library The Loeb Classical Library (LCL; named after James Loeb; , ) is a series of books originally published by Heinemann in London, but is currently published by Harvard University Press. The library contains important works of ancient Greek and L ...
and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference
Bibles The Bible (from Koine Greek , , 'the books') is a collection of religious texts or scriptures that are held to be sacred in Christianity, Judaism, Samaritanism, and many other religions. The Bible is an anthologya compilation of texts of a v ...
may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study;
Origen Origen of Alexandria, ''Ōrigénēs''; Origen's Greek name ''Ōrigénēs'' () probably means "child of Horus" (from , "Horus", and , "born"). ( 185 – 253), also known as Origen Adamantius, was an Early Christianity, early Christian scholar, ...
's
Hexapla ''Hexapla'' ( grc, Ἑξαπλᾶ, "sixfold") is the term for a critical edition of the Hebrew Bible in six versions, four of them translated into Greek, preserved only in fragments. It was an immense and complex word-for-word comparison of the ...
(Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the
Rosetta Stone The Rosetta Stone is a stele composed of granodiorite inscribed with three versions of a Rosetta Stone decree, decree issued in Memphis, Egypt, in 196 BC during the Ptolemaic dynasty on behalf of King Ptolemy V Epiphanes. The top and middle te ...
, whose discovery allowed the Ancient Egyptian language to begin being deciphered. Large collections of parallel texts are called parallel corpora (see text corpus). Alignments of parallel corpora at sentence level are prerequisite for many areas of linguistic research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task. Parallel texts may be used in
language education Language education – the process and practice of teaching a second or foreign language – is primarily a branch of applied linguistics, but can be an interdisciplinary field. There are four main learning categories for language education: ...
.


Types of parallel corpora

Parallel corpora can be classified into four main categories: * A ''parallel corpus'' contains translations of the same document in two or more languages, aligned at least at the sentence level. These tend to be rarer than less-comparable corpora. * A ''noisy parallel corpus'' contains bilingual sentences that are not perfectly aligned or have poor quality translations. Nevertheless, most of its contents are bilingual translations of a specific document. * A ''comparable corpus'' is built from non-sentence-aligned and untranslated bilingual documents, but the documents are topic-aligned. * A ''quasi-comparable corpus'' includes very heterogeneous and non-parallel bilingual documents that may or may not be topic-aligned.


Noise in corpora

Large corpora used as training sets for
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events. However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between
bilingual Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. It is believed that multilingual speakers outnumber monolingual speakers in the world's population. More than half of all E ...
elements represented in both corpora and monolingual elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages.


Bitext

In the field of translation studies a bitext is a merged document composed of both source- and target-language versions of a given text. Bitexts are generated by a piece of software called an ''alignment tool'', or a ''bitext tool'', which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a ''bitext database'' or a ''bilingual corpus'', and can be consulted with a search tool.


Bitexts and translation memories

''Bitexts'' have some similarities with translation memories. The most salient difference is that a translation memory loses the original context, while a bitext retains the original sentence order. That said, some implementations of translation memory, such as
Translation Memory eXchange Translation Memory eXchange (TMX) is an XML specification for the exchange of translation memory (TM) data between computer-aided translation and localization tools with little or no loss of critical data. TMX was originally developed and maintaine ...
(TMX), a standard
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
format for exchanging translation memories between
computer-assisted translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...
(CAT) programs, allow preserving the original order of sentences. Bitexts are designed to be consulted by a human
translator Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...
, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance. In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up. Online bitexts and translation memories may also be called online bilingual concordances. Several are available on the public Web, including Linguée, Reverso, and Tradooit.


See also

*
Bilingual inscription In epigraphy, a multilingual inscription is an inscription that includes the same text in two or more languages. A bilingual is an inscription that includes the same text in two languages (or trilingual in the case of three languages, etc.). Mul ...
* Computer-assisted reviewing * Example-based machine translation *
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
* Polyglot (book) *
Ruby character Ruby characters or rubi characters () are small, annotative gloss (annotation), glosses that are usually placed above or to the right of logogram, logographic characters of languages in the East Asian cultural sphere, such as Sinitic languages, Ch ...
* Statistical machine translation


References


External links


Parallel corpora


The JRC-Acquis Multilingual Parallel Corpus
of the total body of
European Union The European Union (EU) is a supranational political and economic union of member states that are located primarily in Europe. The union has a total area of and an estimated total population of about 447million. The EU has often been des ...
(EU) law: '' Acquis Communautaire'' with 231 language pairs.
European Parliament Proceedings Parallel Corpus 1996–2011

The Opus project aims at collecting freely available parallel corpora


* ttp://www.linguateca.pt/COMPARA/ COMPARA – Portuguese/English parallel corpora
TERMSEARCH – English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.

TradooIT – English/French/Spanish – Free Online tools

Nunavut Hansard – English/Inuktitut parallel corpus

ParaSol – A parallel corpus of Slavic and other languages

Glosbe: Multilanguage parallel corpora
with online search interface
InterCorp: A multilingual parallel corpus
40 languages aligned with Czech
online search interface

myCAT – Olanto
concordancer (open source AGPL) with online search on JCR and UNO corpus
TAUS
with online search interface.
linguatools
multilingual parallel corpora, online search interface.
EUR-Lex Corpus – corpus
built up of the EUR-Lex database consists of
European Union law European Union law is a system of rules operating within the member states of the European Union (EU). Since the founding of the European Coal and Steel Community following World War II, the EU has developed the aim to "promote peace, its valu ...
and other public documents of the
European Union The European Union (EU) is a supranational political and economic union of member states that are located primarily in Europe. The union has a total area of and an estimated total population of about 447million. The EU has often been des ...

Language Grid – Multilingual service platform that includes parallel text services


Documentation



* ttps://web.archive.org/web/20060913013656/https://www.cs.unt.edu/~rada/wpt/ Proceedings of the 2003 Workshop on Building and Using Parallel Texts
Proceedings of the 2005 Workshop on Building and Using Parallel Texts


Alignment tools




Uplug – tools for processing parallel corpora (2003)

An implementation of the Gale and Church sentence alignment algorithm (2005)

The Hunalign sentence aligner (2005)

Champollion (2006)

mALIGNa (2008–2020)

Gargantua sentence aligner (2010)

Bleualign – machine translation based sentence alignment (2010)

YASA (2013)

Hierarchical alignment tool (HAT) (2018)

Vecalign sentence alignment algorithm (2019)

Web Alignment Tool at University of Grenoble
{{Natural language processing Translation databases Language acquisition Corpus linguistics