HOME

TheInfoList



OR:

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. Its name comes from the Japanese phrase "tatoeba" (), meaning "for example". It is written and maintained by a community of volunteers through a model of
open collaboration Open collaboration is any "system of innovation or production that relies on goal-oriented yet loosely coordinated participants who interact to create a product (or service) of economic value, which is made available to contributors and noncontribu ...
. Individual contributors are known as Tatoebans. It is hosted by Association Tatoeba, a French
non-profit organization A nonprofit organization (NPO) or non-profit organisation, also known as a non-business entity, not-for-profit organization, or nonprofit institution, is a legal entity organized and operated for a collective, public or social benefit, in co ...
funded through donations. As of November 2022, the Tatoeba Corpus has over 10,800,000 sentences in 420 languages. 55 of these languages have 10,000 or more sentences. About 1 million sentences have audio recordings. The sentences are interrelated within a
graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...
, facilitating translations in different languages. As of November 2022, the Tatoeba Graph lists over 21,800,000 links between sentences. 237 language pairs have over 10,000 translated sentences.


History

In 2006, Trang Ho was frustrated that unlike some of their Japanese counterparts, German
bilingual dictionaries A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another. Bilingual dictionaries can be ''unidirectional'', meaning that they list the meanings of words of one lan ...
didn't feature
full-text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
of usage examples with translations. It led her to imagine her ideal dictionary and to build a prototype hosted on
SourceForge SourceForge is a web service that offers software consumers a centralized online location to control and manage open-source software projects and research business software. It provides source code repository hosting, bug tracking, mirrorin ...
under the name "multilangdict." The main focus was already the
crowdsourcing Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digita ...
of translated sentences: "A Wikipedia type of thing, except people add sentences, not articles." Alongside her studies at the
University of Technology of Compiègne The University of Technology of Compiègne (french: link=no, Université de Technologie de Compiègne, UTC) is a public research university located in Compiègne, France. The university has both the status of public university and grande école. ...
, Trang Ho gradually improved her website with a few classmates. She rebuilt the project from scratch twice and rebranded it as Tatoeba. In September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by
Hyogo University is a private university in Japan. Its campus is located in Shinzaike, Hiraoka-cho, Kakogawa, Hyōgo Prefecture. The university is one of the seven schools run by , a school foundation with a Buddhist background (Nishi Hongwanji denomination). ...
professor Yasuhito Tanaka and maintained by
Jim Breen James William Breen (born 1947) is a Research Fellow at Monash University in Australia, where he was a professor in the area of IT and telecommunications before his retirement in 2003. He holds a BSc in mathematics, an MBA and a PhD in computati ...
and Paul Blay — were imported into the Tatoeba Corpus. In December 2008, Trang Ho released the first version of the current codebase built around a more flexible
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
. The following month, the website moved to the tatoeba.org domain. Over the 2009-2010 academic year, Allan Simon — then a student at
SUPINFO SUPINFO International University, formerly called "École Supérieure d'Informatique", is a private institution of higher education in Computer Science that was created in 1965 and has been recognized by the French state since 10 January 1972. Ov ...
— became a core developer of Tatoeba. Together with Trang Ho and other young developers, they made Tatoeba more social: sentence lists, user profiles, private messaging, and Facebook-inspired Wall. They also introduced significant features like sentence linking, tagging, and "translation of translation" search. In November 2010, Tatoeba passed the 600,000 sentences mark. Within a year, the number of sentences added daily had increased almost 50-fold. Between 2014 and 2016, a new team of developers formed around Trang Ho.  They mentored students at the Google Summer of Code 2014 and added features to improve corpus quality. Over the 2018-2020 period, support from the
Mozilla Foundation The Mozilla Foundation (stylized as moz://a) is an American non-profit organization that exists to support and collectively lead the open source Mozilla project. Founded in July 2003, the organization sets the policies that govern development, ...
as part of the
Common Voice Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The t ...
project allowed Tatoeba to make its platform more open and user-friendly.


Openness


Reading

Users, even those who are not registered, can search for words in any language to retrieve sentences that use them. Each sentence in the Tatoeba Corpus is displayed next to its likely translations in other languages; translations and "translations of translations" are differentiated. Sentences are
tagged Tagged may refer to: * Tagged (website), a social discovery website * Tagged (web series), an American teen psychological thriller web series {{disambiguation ...
for content such as subject matter,
dialect The term dialect (from Latin , , from the Ancient Greek word , 'discourse', from , 'through' and , 'I speak') can refer to either of two distinctly different types of Linguistics, linguistic phenomena: One usage refers to a variety (linguisti ...
, or
vulgarity Vulgarity is the quality of being common, coarse, or unrefined. This judgement may refer to language, visual art, social class, or social climbers. John Bayley claims the term can never be self-referential, because to be aware of vulgarity is to d ...
; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Sentences can be browsed by language, tag, and other criteria.


Editing

Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. However, users are encouraged to add original sentences or translations in their native or strongest language. Users can freely edit their sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Advanced contributors, a rank above ordinary contributors, can tag, link, and unlink sentences. Corpus maintainers, a rank above advanced contributors, can untag and delete sentences. They can also modify owned sentences, though they typically do so only if the owner fails to respond to a request to make the change.


Operation

Tatoeba received a grant from Mozilla Drumbeat in December 2010. Some work on the Tatoeba infrastructure was sponsored by
Google Summer of Code The Google Summer of Code, often abbreviated to GSoC, is an international annual program in which Google awards stipends to contributors who successfully complete a free and open-source software coding project during the summer. , the program is ...
, 2014 edition. In May 2018 they received a $25,000
Mozilla Mozilla (stylized as moz://a) is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, spreads and supports Mozilla products, thereby promoting exclusively free software and open standards, wi ...
Open Source Support (MOSS) program grant. In August 2019 they received a $15,000 Mozilla Open Source Support (MOSS) program grant.


Access to content


Content licensing

By default, the sentences of the Tatoeba Corpus are published under a Creative Commons Attribution 2.0 license, freeing it for academic and other use. Users can also contribute sentences under
Creative Commons Zero A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyrics ...
, though translations of those sentences currently can't share the same license. Audio recordings of the sentences use the speaker's choice of license, such as CC BY 4.0, BY-SA, BY-NC, or no public license at all.


Offline use

Visitors can download tab-delimited sentence pairs ready for import into Anki and similar
Spaced Repetition Software Spaced repetition is an evidence-based learning technique that is usually performed with flashcards. Newly introduced and more difficult flashcards are shown more frequently, while older and less difficult flashcards are shown less frequently in ...
at the Tatoeba website.


Related projects


Second-language acquisition

The JMdict Japanese-English dictionary selects its example sentences from the Tatoeba Corpus. OpenRussian is a
free Free may refer to: Concept * Freedom, having the ability to do something, without having to obey anyone/anything * Freethought, a position that beliefs should be formed only on the basis of logic, reason, and empiricism * Emancipate, to procur ...
Russian dictionary built primarily from the content of
Wiktionary Wiktionary ( , , rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a number ...
and Tatoeba. GoodExample tries to automatically extract a diverse set of high-quality example sentences from the English Tatoeba Corpus. Reverso uses Tatoeba
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
in its bilingual concordancer. Charles Kelly and Paul Raine, both EFL teachers in Japan, have developed
language learning Language acquisition is the process by which humans acquire the capacity to perceive and comprehend language (in other words, gain the ability to be aware of language and to understand it), as well as to produce and use words and sentences to ...
activities based on sentences curated from the Tatoeba Corpus. Clozemaster is a language self-study program that generates gamified cloze tests from Tatoeba sentence pairs. Some Anki users share
flashcards A flashcard or flash card (also known as an index card) is a card bearing information on both sides, which is intended to be used as an aid in memorization. Each flashcard bears a question on one side and an answer on the other. Flashcards are ...
that were created using Tatoeba. Tatoeba datasets can power
incidental learning Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of learn ...
experiences that blend the acquisition of a foreign language with the user's everyday activities like
web browsing Web navigation refers to the process of navigating a Computer network, network of web resource, information resources in the International World Wide Web Conference, World Wide Web, which is organized as hypertext or hypermedia. The user interface ...
or book reading. A team at
MIT Media Lab The MIT Media Lab is a research laboratory at the Massachusetts Institute of Technology, growing out of MIT's Architecture Machine Group in the School of Architecture. Its research does not restrict to fixed academic disciplines, but draws from ...
used example sentences from Tatoeba in WordSense, a
mixed reality Mixed reality (MR) is a term used to describe the merging of a real-world environment and a computer-generated one. Physical and virtual objects may co-exist in mixed reality environments and interact in real time. Mixed reality is largely synony ...
platform that enables "
serendipitous Serendipity is an unplanned fortunate discovery. Serendipity is a common occurrence throughout the history of product invention and scientific discovery. Etymology The first noted use of "serendipity" was by Horace Walpole on 28 January 1754. ...
language learning in the wild." More recently, Japanese researchers implemented a Tatoeba search feature in an integrated writing assistance environment.


Regional or minority languages

Some language digital activists contribute to open collaborative projects like Tatoeba,
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
, and
Common Voice Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The t ...
to promote their
minority language A minority language is a language spoken by a minority of the population of a territory. Such people are termed linguistic minorities or language minorities. With a total number of 196 sovereign states recognized internationally (as of 2019) and ...
in digital spaces.
Regional languages * A regional language is a language spoken in a region of a sovereign state, whether it be a small area, a federated state or province or some wider area. Internationally, for the purposes of the European Charter for Regional or Minority Lan ...
like Kabyle,
Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...
, or
Basque Basque may refer to: * Basques, an ethnic group of Spain and France * Basque language, their language Places * Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France * Basque Country (autonomous co ...
can register more than a hundred members on Tatoeba.


Constructed languages

Selected content from Tatoeba in
Esperanto Esperanto ( or ) is the world's most widely spoken constructed international auxiliary language. Created by the Warsaw-based ophthalmologist L. L. Zamenhof in 1887, it was intended to be a universal second language for international communi ...
is available in the multilingual DVD ''Esperanto Elektronike'' published by
E@I E@I ("Education@Internet") is an international youth non-profit organization that hosts educational projects and meetings to support intercultural learning and the usage of languages and internet technologies. E@I started as an informal intern ...
. As of November 2022, Esperanto is Tatoeba's fifth
pivot language A pivot language, sometimes also called a bridge language, is an artificial or natural language used as an intermediary language for translation between many different languages – to translate between any pair of languages A and B, one translates ...
, with over 330,000 sentences translated into at least two languages. Other
constructed languages A constructed language (sometimes called a conlang) is a language whose phonology, grammar, and vocabulary, instead of having developed naturally, are consciously devised for some purpose, which may include being devised for a work of fiction. ...
like
Toki Pona Toki Pona (rendered as ''toki pona'' and often translated as 'the language of good'; ; ) is a philosophical artistic constructed language (philosophical artlang) known for its small vocabulary, simplicity, and ease of acquisition. It was create ...
,
Interlingua Interlingua (; ISO 639 language codes ia, ina) is an international auxiliary language (IAL) developed between 1937 and 1951 by the American International Auxiliary Language Association (IALA). It ranks among the most widely used IALs and is t ...
,
Klingon The Klingons ( ; Klingon: ''tlhIngan'' ) are a fictional species in the science fiction franchise ''Star Trek''. Developed by screenwriter Gene L. Coon in 1967 for the original ''Star Trek'' (''TOS'') series, Klingons were swarthy humanoids c ...
,
Lojban Lojban (pronounced ) is a logical, constructed, human language created by the Logical Language Group which aims to be syntactically unambigious. It succeeds the Loglan project. The Logical Language Group (LLG) began developing Lojban in 1987. ...
, and
Ido Ido () is a constructed language derived from Reformed Esperanto, and similarly designed with the goal of being a universal second language for people of diverse backgrounds. To function as an effective ''international auxiliary language'', I ...
also have a significant footprint.


Language technology

From 2008 to 2011, Francis Bond used the Tatoeba Corpus for his research on the Japanese language. Since 2013, Jörg Tiedemann has been spreading Tatoeba
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
more widely in the
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
community by sharing them on the OPUS repository and organizing the "Tatoeba Translation Challenge". With the rise of
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
, researchers increasingly use Tatoeba's data sets to train and evaluate their massively multilingual models in tasks like
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
,
language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solv ...
,
semantic search Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. Semantic search seek ...
, and
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
.


See also

*
Phrase book A phrase book or phrasebook is a collection of ready-made phrases, usually for a foreign language along with a translation, indexed and often in the form of questions and answers. Structure While mostly thematically structured into several c ...
*
Parallel text A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libra ...
*
Common Voice Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The t ...
*
Lingua Libre Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license. Description Lingua Libre enables to record words, phrases ...
*
Wiktionary Wiktionary ( , , rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a number ...


References


External links

* *
Video of Trang Ho introducing Tatoeba at MozFest 2019

Tatoeba's statistics

Tatoeba Translation Challenge
{{Corpus linguistics Advertising-free websites Computational linguistics Corpora Creative Commons-licensed websites Free-content websites French educational websites Language learning software Natural language processing Open educational resources Social networking language-learning websites