The Automated Similarity Judgment Program (ASJP) is a collaborative project applying computational approaches to

comparative linguistics Comparative linguistics, or comparative-historical linguistics (formerly comparative philology) is a branch of historical linguistics that is concerned with comparing languages to establish their historical relatedness. Genetic relatedness ...

using a database of word lists. The database is open access and consists of 40-item basic-vocabulary lists for well over half of the world's languages. It is continuously being expanded. In addition to isolates and languages of demonstrated genealogical groups, the database includes

pidgins A pidgin , or pidgin language, is a grammatically simplified means of communication that develops between two or more groups of people that do not have a language in common: typically, its vocabulary and grammar are limited and often drawn from s ...

, creoles,

mixed languages A mixed language is a language that arises among a bilingual group combining aspects of two or more languages but not clearly deriving primarily from any single language. It differs from a creole or pidgin language in that, whereas creoles/pidgin ...

, and

constructed languages A constructed language (sometimes called a conlang) is a language whose phonology, grammar, and vocabulary, instead of having developed naturally, are consciously devised for some purpose, which may include being devised for a work of fiction. ...

. Words of the database are transcribed into a simplified standard orthography (ASJPcode).Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008
Automated classification of the world's languages: A description of the method and preliminary results
''STUF – Language Typology and Universals'' 61.4: 285-308. The database has been used to estimate dates at which language families have diverged into daughter languages by a method related to but still different from

glottochronology Glottochronology (from Attic Greek γλῶττα ''tongue, language'' and χρόνος ''time'') is the part of lexicostatistics which involves comparative linguistics and deals with the chronological relationship between languages.Sheila Embleton ( ...

, to determine the homeland (

Urheimat In historical linguistics, the homeland or ''Urheimat'' (, from German '' ur-'' "original" and ''Heimat'', home) of a proto-language is the region in which it was spoken before splitting into different daughter languages. A proto-language is the r ...

) of a

proto-language In the tree model of historical linguistics, a proto-language is a postulated ancestral language from which a number of attested languages are believed to have descended by evolution, forming a language family. Proto-languages are usually unattest ...

, to investigate

sound symbolism In linguistics, sound symbolism is the resemblance between sound and meaning. It is a form of linguistic iconicity. For example, the English word ''ding'' may sound similar to the actual sound of a bell. Linguistic sound may be perceived as simi ...

, to evaluate different phylogenetic methods, and several other purposes. ASJP is not widely accepted among historical linguists as an adequate method to establish or evaluate relationships between language families. It is part of the

Cross-Linguistic Linked Data The Cross-Linguistic Linked Data (CLLD) project coordinates over a dozen linguistics databases covering the languages of the world. It is hosted by the Department of Linguistic and Cultural Evolution at the Max Planck Institute for Evolutionary An ...

project hosted by the

Max Planck Institute for the Science of Human History The Max Planck Institute for the Science of Human History (german: Max-Planck-Institut für Menschheitsgeschichte) performs basic research into archaeological science. The institute is one of 80+ research institutes of the Max Planck Society an ...

History

Original goals

ASJP was originally developed as a means for objectively evaluating the similarity of words with the same meaning from different languages, with the ultimate goal of classifying languages computationally, based on the lexical similarities observed. In the first ASJP paper two

semantically Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...

identical words from compared languages were judged similar if they showed at least two identical sound segments. Similarity between the two languages was calculated as a percentage of the total number of words compared that were judged as similar. This method was applied to 100-item word lists for 250 languages from

language families A language family is a group of languages related through descent from a common ''ancestral language'' or ''parental language'', called the proto-language of that family. The term "family" reflects the tree model of language origination in hi ...

including

Austroasiatic The Austroasiatic languages , , are a large language family A language family is a group of languages related through descent from a common ''ancestral language'' or ''parental language'', called the proto-language of that family. The te ...

Indo-European The Indo-European languages are a language family native to the overwhelming majority of Europe, the Iranian plateau, and the northern Indian subcontinent. Some European languages of this family, English, French, Portuguese, Russian, Dutc ...

Mayan Mayan most commonly refers to: * Maya peoples, various indigenous peoples of Mesoamerica and northern Central America * Maya civilization, pre-Columbian culture of Mesoamerica and northern Central America * Mayan languages, language family spoken ...

, and

Muskogean Muskogean (also Muskhogean, Muskogee) is a Native American language family spoken in different areas of the Southeastern United States. Though the debate concerning their interrelationships is ongoing, the Muskogean languages are generally div ...

ASJP Consortium

The ASJP Consortium, founded around 2008, came to involve around 25 professional linguists and other interested parties working as volunteer transcribers and/or extending aid to the project in other ways. The main driving force behind the founding of the consortium was Cecil H. Brown.

Søren Wichmann Søren Wichmann (born 1964) is a Danish linguist specializing in historical linguistics, linguistic typology, Mesoamerican languages, and epigraphy. Since June 2016, he has been employed as a University Lecturer at Leiden University Centre for Li ...

is daily curator of the project. A third central member of the consortium is Eric W. Holman, who has created most of the software used in the project.

Shorter word lists

While word lists used were originally based on the 100-item

Swadesh list The Swadesh list ("Swadesh" is pronounced ) is a classic compilation of tentatively universal concepts for the purposes of lexicostatistics. Translations of the Swadesh list into a set of languages allow researchers to quantify the interrelatednes ...

, it was statistically determined that a subset of 40 of the 100 items produced just as good if not slightly better classificatory results than the whole list. So subsequently word lists gathered contain only 40 items (or less, when attestations for some are lacking).

Levenshtein distance

In papers published since 2008, ASJP has employed a similarity judgment program based on

Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...

(LD). This approach was found to produce better classificatory results measured against expert opinion than the method used initially. LD is defined as the minimum number of successive changes necessary to convert one word into another, where each change is the insertion, deletion, or substitution of a symbol. Within the Levenshtein approach, differences in word length can be corrected for by dividing LD by the number of symbols of the longer of the two compared words. This produces normalized LD (LDN). An LDN divided (LDND) between the two languages is calculated by dividing the average LDN for all the word pairs involving the same meaning by the average LDN for all the word pairs involving different meanings. This second normalization is intended to correct for chance similarity.

Word list

The ASJP uses the following 40-word list.http://asjp.clld.org/static/Guidelines.pdf It is similar to the Swadesh–Yakhontov list, but has some differences. ;Body parts *eye *ear *nose *tongue *tooth *hand *knee *blood *bone *breast (woman’s) *liver *skin ;Animals and plants *louse *dog *fish (noun) *horn (animal part) *tree *leaf ;People *person *name (noun) ;Nature *sun *star *water *fire *stone *path *mountain *night (dark time) ;Verbs and adjectives *drink (verb) *die *see *hear *come *new *full ;Numerals and pronouns *one *two *I *you *we

ASJPcode

ASJP version from 2016 uses the following symbols to encode

phoneme In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west o ...

s: p b f v m w 8 t d s z c n r l S Z C j T 5 y k g x N q X h 7 L 4 G ! i e E 3 a u o They represent 7 vowels and 34 consonants, all found on the standard QWERTY keyboard. A mark follows two consonants so that they are considered to be in the same position. Thus, becomes . Syllables like , , and are considered lexically similar to . Similarly, a mark follows three consonants so that they are considered to be in the same position. is considered similar to , and . marks the preceding consonant as

glottalized Glottalization is the complete or partial closure of the glottis during the articulation of another sound. Glottalization of vowels and other sonorants is most often realized as creaky voice (partial closure). Glottalization of obstruent consona ...

References

Sources

*Søren Wichmann, Jeff Good (eds). 2014
Quantifying Language Dynamics: On the Cutting edge of Areal and Phylogenetic Linguistics
p. 203. Leiden: Brill. *Brown, Cecil H., et al. 2008
Automated Classification of the World's Languages: A Description of the Method and Preliminary Results
Language Typology and Universals 61(4). November 2008. *Wichmann, Søren, Eric W. Holman, and Cecil H. Brown (eds.). 2018
The ASJP Database
(version 18).

External links

ASJP Database
official home page {{Cross-Linguistic Linked Data Organizations established in 2008 Comparative linguistics Computational linguistics Historical linguistics Linguistics websites Linguistics databases Lexical databases Cross-Linguistic Linked Data Word lists