Automated Similarity Judgment Program

Last updated

Automated Similarity Judgment Program
Producer Max Planck Institute for the Science of Human History (Germany)
LanguagesEnglish
Access
CostFree
Coverage
Disciplines Quantitative comparative linguistics
Links
Website asjp.clld.org

The Automated Similarity Judgment Program (ASJP) is a collaborative project applying computational approaches to comparative linguistics using a database of word lists. The database is open access and consists of 40-item basic-vocabulary lists for well over half of the world's languages. [1] It is continuously being expanded. In addition to isolates and languages of demonstrated genealogical groups, the database includes pidgins, creoles, mixed languages, and constructed languages. Words of the database are transcribed into a simplified standard orthography (ASJPcode). [2] The database has been used to estimate dates at which language families have diverged into daughter languages by a method related to but still different from glottochronology, [3] to determine the homeland (Urheimat) of a proto-language, [4] to investigate sound symbolism, [5] to evaluate different phylogenetic methods, [6] and several other purposes.

Contents

ASJP is not widely accepted among historical linguists as an adequate method to establish or evaluate relationships between language families. [7]

It is part of the Cross-Linguistic Linked Data project hosted by the Max Planck Institute for the Science of Human History. [8]

History

Original goals

ASJP was originally developed as a means for objectively evaluating the similarity of words with the same meaning from different languages, with the ultimate goal of classifying languages computationally, based on the lexical similarities observed. In the first ASJP paper [2] two semantically identical words from compared languages were judged similar if they showed at least two identical sound segments. Similarity between the two languages was calculated as a percentage of the total number of words compared that were judged as similar. This method was applied to 100-item word lists for 250 languages from language families including Austroasiatic, Indo-European, Mayan, and Muskogean.

ASJP Consortium

The ASJP Consortium, founded around 2008,[ when? ] came to involve around 25 professional linguists and other interested parties working as volunteer transcribers and/or extending aid to the project in other ways. The main driving force behind the founding of the consortium was Cecil H. Brown. Søren Wichmann is daily curator of the project. A third central member of the consortium is Eric W. Holman, who has created most of the software used in the project.

Shorter word lists

While word lists used were originally based on the 100-item Swadesh list, it was statistically determined that a subset of 40 of the 100 items produced just as good if not slightly better classificatory results than the whole list. [9] So subsequently word lists gathered contain only 40 items (or less, when attestations for some are lacking).

Levenshtein distance

In papers published since 2008, ASJP has employed a similarity judgment program based on Levenshtein distance (LD). This approach was found to produce better classificatory results measured against expert opinion than the method used initially. LD is defined as the minimum number of successive changes necessary to convert one word into another, where each change is the insertion, deletion, or substitution of a symbol. Within the Levenshtein approach, differences in word length can be corrected for by dividing LD by the number of symbols of the longer of the two compared words. This produces normalized LD (LDN). An LDN divided (LDND) between the two languages is calculated by dividing the average LDN for all the word pairs involving the same meaning by the average LDN for all the word pairs involving different meanings. This second normalization is intended to correct for chance similarity. [10]

Word list

The ASJP uses the following 40-word list. [11] It is similar to the Swadesh–Yakhontov list, but has some differences.

Body parts
Animals and plants
People
Nature
Verbs and adjectives
Numerals and pronouns

ASJPcode

ASJP version from 2016[ citation needed ] uses the following symbols to encode phonemes: p b f v m w 8 t d s z c n r l S Z C j T 5 y k g x N q X h 7 L 4 G ! i e E 3 a u o

They represent 7 vowels and 34 consonants, all found on the standard QWERTY keyboard.

Sounds represented by ASJPcode [2]
ASJPcodeDescriptionIPA
ihigh front vowel, rounded and unroundedi,ɪ,y,ʏ
emid front vowel, rounded and unroundede,ø
Elow front vowel, rounded and unroundeda,æ,ɛ,ɶ,œ,e
3high and mid central vowel, rounded and unroundedɨ,ɘ,ə,ɜ,ʉ,ɵ,ɞ
alow central vowel, unroundedɐ
uhigh back vowel, rounded and unroundedɯ,u,ɑ
omid and low back vowel, rounded and unroundedɤ,ʌ,ɑ,o,ɔ,ɒ
pvoiceless bilabial stop and fricativep,ɸ
bvoiced bilabial stop and fricativeb,β
mbilabial nasalm
fvoiceless labiodental fricativef
vvoiced labiodental fricativev
8voiceless and voiced dental fricativeθ,ð
4dental nasal
tvoiceless alveolar stopt
dvoiced alveolar stopd
svoiceless alveolar fricatives
zvoiced alveolar fricativez
cvoiceless and voiced alveolar affricatet͡s,d͡z
nvoiceless and voiced alveolar nasaln
Svoiceless postalveolar fricativeʃ
Zvoiced postalveolar fricativeʒ
Cvoiceless palato-alveolar affricatet͡ʃ
jvoiced palato-alveolar affricated͡ʒ
Tvoiceless and voiced palatal stopc,ɟ
5palatal nasalɲ
kvoiceless velar stopk
gvoiced velar stopɡ
xvoiceless and voiced velar fricativex,ɣ
Nvelar nasalŋ
qvoiceless uvular stopq
Gvoiced uvular stopɢ
Xvoiceless and voiced uvular fricative, voiceless and voiced pharyngeal fricativeχ,ʁ,ħ,ʕ
7voiceless glottal stopʔ
hvoiceless and voiced glottal fricativeh,ɦ
lvoiced alveolar lateral approximatel
Lall other lateralsʟ,ɭ,ʎ
wvoiced bilabial-velar approximantw
ypalatal approximantj
rvoiced apico-alveolar trill and all varieties of “r-sounds”r,ʀ, etc.
 !all varieties of “click-sounds”ǃ,ǀ,ǁ,ǂ

A ~ mark follows two consonants so that they are considered to be in the same position. Thus, kʷat becomes kw~at. Syllables like kat, wat, kaw and kwi are considered lexically similar to kw~at.

Similarly, a $ mark follows three consonants so that they are considered to be in the same position. ndy$im is considered similar to nim, dam and yim.

" marks the preceding consonant as glottalized.

See also

Related Research Articles

Comecrudo is an extinct Pakawan language of Mexico. The name Comecrudo is Spanish for "eat-raw"; Carrizo is Spanish for "reed". It was best recorded in a list of 148 words in 1829 by French botanist Jean Louis Berlandier. It was spoken on the lower Rio Grande near Reynosa, Tamaulipas, in Mexico. Comecrudo has often been considered a Coahuiltecan language although most linguists now consider the relationship between them unprovable due to the lack of information.

The Mbum–Day languages are a subgroup of the old Adamawa languages family, provisionally now a branch of the Savanna languages. These languages are spoken in southern Chad, northwestern Central African Republic, northern Cameroon, and eastern Nigeria.

<span class="mw-page-title-main">Boran languages</span> Bora–Witoto language of Brazil

Boran is a small language family, consisting of just two languages.

Mosan is a hypothetical language family consisting of the Salishan, Wakashan, and Chimakuan languages of the Pacific Northwest region of North America. It was proposed by Edward Sapir in 1929 in the Encyclopædia Britannica. Little evidence has been adduced in favor of such a grouping, no progress has been made in reconstructing it, and it is now thought to reflect a language area rather than a genetic relationship. The term persists outside academic linguistic literature because of Sapir's stature.

Itonama is a moribund or extinct language isolate once spoken by the Itonama people in the Amazonian lowlands of north-eastern Bolivia. It was spoken on the Itonomas River and Lake in Beni Department.

Usku, or Afra, is a nearly extinct and poorly documented Papuan language spoken by 20 or more people, mostly adults, in Usku village, Senggi District, Keerom Regency, Papua, Indonesia.

Elseng is a poorly documented Papuan language spoken by about 300 people in the Indonesian province of Papua. It is also known as Morwap, which means "what is it?" ‘Morwap’ is vigorously rejected as a language name by speakers and government officials.

Søren Wichmann is a Danish linguist specializing in historical linguistics, linguistic typology, Mesoamerican languages, and epigraphy. Since June 2016, he has been employed as a University Lecturer at Leiden University Centre for Linguistics, Leiden University, after having worked at different institutions in Denmark, Mexico, Germany and Russia, including, during 2003-2015, the Department of Linguistics, Max Planck Institute for Evolutionary Anthropology.

Irántxe /iˈɻɑːntʃeɪ/, also known as Mỹky (Münkü) or still as Irántxe-Münkü, is an indigenous language spoken by the Irántxe and Mỹky peoples in the state of Mato Grosso in Brazil. Recent descriptions of the language analyze it as a language isolate, in that it "bears no similarity with other language families". Monserrat (2010) is a well-reviewed grammar of the language.

Aikanã is an endangered language isolate spoken by about 200 Aikanã people in Rondônia, Brazil. It is morphologically complex and has SOV word order. Aikanã uses the Latin script. The people live with speakers of Koaia (Kwaza).

<span class="mw-page-title-main">Macro-Chibchan languages</span>

Macro-Chibchan is a proposed grouping of the languages of the Lencan, Misumalpan, and Chibchan families into a single large phylum (macrofamily).

<span class="mw-page-title-main">Guató language</span> Language

Guató is a possible language isolate spoken by 1% of the Guató people of Brazil.

Quechumaran or Kechumaran is a language-family proposal that unites Quechua and Aymara. Quechuan languages, especially those of the south, share a large amount of vocabulary with Aymara. The hypothesis of the existence of Quechuamara was originally posted by linguist Norman McQuown in 1955. Terrence Kaufman finds the proposal reasonably convincing, but Willem Adelaar, a Quechua specialist, believes the similarities to be caused by borrowing during long-term contact. Lyle Campbell suspects that the proposal is valid but does not consider it to have been conclusively proved.

The Ayere–Ahan languages are a pair of languages of southwestern Nigeria, Ayere and Àhàn, that form an independent branch of the Volta–Niger languages. These languages are spoken in the border region of Kogi State and Ondo State, Nigeria.

Oko (ɔ̀kɔ́), also known as Ogori-Magongo and Oko-Eni-Osayin, is a dialect cluster spoken in Nigeria. It appears to form a branch of the "Nupe–Oko–Idoma" group of Niger–Congo languages. Most Oko speakers also speak Yoruba as a second language. The language is spoken in and around the towns of Ogori and Magongo in southwestern Kogi State, close to the Ondo and Edo state borders.

Pyu is a language isolate spoken in Papua New Guinea. As of 2000, the language had about 100 speakers. It is spoken in Biake No. 2 village of Biake ward, Green River Rural LLG in Sandaun Province.

Payaguá (Payawá) is an extinct language of Paraguay, Argentina, and Bolivia, spoken by the Payaguá Indians. It is usually classified as one of the Guaicuruan languages, but the data is insufficient to demonstrate that.

Kimki (Aipki) or Sukubatom (Sukubatong) is a South Pauwasi language of Batom District, Pegunungan Bintang Regency, Papua, Indonesia. Foley classifies Kimki as a language isolate, although he notes some similarities with Murkim. Usher demonstrates a connection to the other South Pauwasi languages.

The Mbahaam–Iha languages are a pair of Papuan languages spoken on the Bomberai Peninsula of western New Guinea. The two languages, Baham (Mbaham) and Iha, are closely related to each other.

Johann-Mattis List is a German scientist. He is known for his work on quantitative comparative linguistics. List is currently professor at the University of Passau, Germany, where he leads the Chair of Multilingual Computational Linguistics.

References

  1. "The ASJP Database -". asjp.clld.org. Retrieved February 15, 2024.
  2. 1 2 3 Brown, Cecil H; Holman, Eric W.; Wichmann, Søren; Velupillai, Viveka (2008). "Automated classification of the world's languages: A description of the method and preliminary results". STUF – Language Typology and Universals.
  3. "Automated dating of the world's language families based on lexical similarity" (PDF). pubman.mpdl.mpg.de. 2011.
  4. "Homelands of the world's language families: A quantitative approach". www.researchgate.net. 2010.
  5. Wichmann, Søren; Holman, Eric W.; Brown, Cecil H. (April 2010). "Sound Symbolism in Basic Vocabulary". Entropy. 12 (4): 844–858. doi: 10.3390/e12040844 . ISSN   1099-4300.
  6. Pompei, Simone; Loreto, Vittorio; Tria, Francesca (June 3, 2011). "On the Accuracy of Language Trees". PLOS ONE. 6 (6): e20109. arXiv: 1103.4012 . Bibcode:2011PLoSO...620109P. doi: 10.1371/journal.pone.0020109 . ISSN   1932-6203. PMC   3108590 . PMID   21674034.
  7. Cf. comments by Adelaar, Blust and Campbell in Holman, Eric W., et al. (2011) "Automated Dating of the World’s Language Families Based on Lexical Similarity." Current Anthropology, vol. 52, no. 6, pp. 841–875.
  8. "Cross-Linguistic Linked Data" . Retrieved February 22, 2020.
  9. Holman, Eric W.; Wichmann, Søren; Brown, Cecil H.; Velupillai, Viveka; Müller, André; Bakker, Dik (2008). "Explorations in automated language classification". Folia Linguistica.
  10. Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A 389: 3632-3639 ( doi : 10.1016/j.physa.2010.05.011).
  11. "Guidelines" (PDF). asjp.clld.org.

Sources