Language Identification
   HOME

TheInfoList



OR:

In
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, language identification or language guessing is the problem of determining which
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
given content is in. Computational approaches to this problem view it as a special case of
text categorization Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
, solved with various
statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
methods.


Overview

There are several statistical approaches to language identification using different techniques to classify the data. One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods. Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques. Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web. For a more recent method, see Řehůřek and Kolkus (2009). This method can detect multiple languages in an unstructured piece of text and works robustly on short texts of only a few words: something that the n-gram approaches struggle with. An older statistical method by Grefenstette was based on the prevalence of certain
function word In linguistics, function words (also called functors) are words that have little lexical meaning or have ambiguous meaning and express grammatical relationships among other words within a sentence, or specify the attitude or mood of the speaker. ...
s (e.g., "the" in English). A common non-statistical
intuitive Intuition is the ability to acquire knowledge without recourse to conscious reasoning. Different fields use the word "intuition" in very different ways, including but not limited to: direct access to unconscious knowledge; unconscious cognition; ...
approach (though highly uncertain) is to look for common letter combinations, or distinctive
diacritics A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacritic ...
or punctuation.


Identifying similar languages

One of the great bottlenecks of language identification systems is to distinguish between closely related languages. Similar languages like
Bulgarian Bulgarian may refer to: * Something of, from, or related to the country of Bulgaria * Bulgarians, a South Slavic ethnic group * Bulgarian language, a Slavic language * Bulgarian alphabet * A citizen of Bulgaria, see Demographics of Bulgaria * Bul ...
and Macedonian or
Indonesian Indonesian is anything of, from, or related to Indonesia, an archipelagic country in Southeast Asia. It may refer to: * Indonesians, citizens of Indonesia ** Native Indonesians, diverse groups of local inhabitants of the archipelago ** Indonesian ...
and
Malay Malay may refer to: Languages * Malay language or Bahasa Melayu, a major Austronesian language spoken in Indonesia, Malaysia, Brunei and Singapore ** History of the Malay language, the Malay language from the 4th to the 14th century ** Indonesi ...
present significant lexical and structural overlap, making it challenging for systems to discriminate between them. In 2014 the DSL shared task has been organized providing a dataset (Tan et al., 2014) containing 13 different languages (and language varieties) in six language groups: Group A (Bosnian, Croatian, Serbian), Group B (Indonesian, Malaysian), Group C (Czech, Slovak), Group D (Brazilian Portuguese, European Portuguese), Group E (Peninsular Spanish, Argentine Spanish), Group F (American English, British English). The best system reached performance of over 95% results (Goutte et al., 2014). Results of the DSL shared task are described in Zampieri et al. 2014.


Software

*
Apache OpenNLP The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named ent ...
includes char n-gram based statistical detector and comes with a model that can distinguish 103 languages *
Apache Tika Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library ...
contains a language detector for 18 languages


References

* Benedetto, D., E. Caglioti and V. Loreto
Language trees and zipping
''Physical Review Letters'', 88:4 (2002)
Complexity theory
* Cavnar, William B. and John M. Trenkle. "N-Gram-Based Text Categorization". Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994

* Cilibrasi, Rudi and Paul M.B. Vitanyi.
Clustering by compression
. ''IEEE Transactions on Information Theory'' 51(4), April 2005, 1523-1545. * Dunning, T. (1994) "Statistical Identification of Language". Technical Report MCCS 94-273, New Mexico State University, 1994. * Goodman, Joshua. (2002
Extended comment on "Language Trees and Zipping"
Microsoft Research, Feb 21 2002. (This is a criticism of the data compression in favor of the Naive Bayes method.) * Goutte, C.; Leger, S.; Carpuat, M. (2014
The NRC System for Discriminating Similar Languages
Proceedings of the Coling 2014 workshop "Applying NLP Tools to Similar Languages, Varieties and Dialects" * Grefenstette, Gregory. (1995) Comparing two language identification schemes. ''Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data'' (JADT 1995). * Poutsma, Arjen. (2001) Applying Monte Carlo techniques to language identification. SmartHaven, Amsterdam. Presented a

* Tan, L.; Zampieri, M.; Ljubešić, N.; Tiedemann, J. (2014
Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection
Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). Reykjavik, Iceland. p. 6-10 * The Economist. (2002)
The elements of style: Analysing compressed data leads to impressive results in linguistics
* Radim Řehůřek and Milan Kolkus. (2009)
Language Identification on the Web: Extending the Dictionary Method
''Computational Linguistics and Intelligent Text Processing''. * Zampieri, M.; Tan, L.; Ljubešić, N.; Tiedemann, J. (2014
A Report on the DSL Shared Task 2014
Proceedings of the 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial). Dublin, Ireland. p. 58-67.


See also

*
Native Language Identification Native-language identification (NLI) is the task of determining an author's native language (L1) based only on their writings in a second language (L2). NLI works through identifying language-usage patterns that are common to specific L1 groups and ...
*
Algorithmic information theory Algorithmic information theory (AIT) is a branch of theoretical computer science that concerns itself with the relationship between computation and information of computably generated objects (as opposed to stochastically generated), such as st ...
*
Artificial grammar learning Artificial grammar learning (AGL) is a paradigm of study within cognitive psychology and linguistics. Its goal is to investigate the processes that underlie human language learning by testing subjects' ability to learn a made-up grammar in a labora ...
*
Family name affixes Family name affixes are a clue for surname etymology and can sometimes determine the ethnic origin of a person. This is a partial list of affixes. Prefixes * A – (Romanian) "son of" * Ab – (Welsh, Cornish, Breton) "son of" * Af – (Danish, ...
*
Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
* Language Analysis for the Determination of Origin *
Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
*
Translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...


References

{{reflist Applications of artificial intelligence Computational linguistics * Natural language processing Translation Tasks of natural language processing