Lexicostatistics

Last updated

Lexicostatistics is a method of comparative linguistics that involves comparing the percentage of lexical cognates between languages to determine their relationship. Lexicostatistics is related to the comparative method but does not reconstruct a proto-language. It is to be distinguished from glottochronology, which attempts to use lexicostatistical methods to estimate the length of time since two or more languages diverged from a common earlier proto-language. This is merely one application of lexicostatistics, however; other applications of it may not share the assumption of a constant rate of change for basic lexical items.

Contents

The term "lexicostatistics" is misleading in that mathematical equations are used but not statistics. Other features of a language may be used other than the lexicon, though this is unusual. Whereas the comparative method used shared identified innovations to determine sub-groups, lexicostatistics does not identify these. Lexicostatistics is a distance-based method, whereas the comparative method considers language characters directly. The lexicostatistics method is a simple and fast technique relative to the comparative method but has limitations (discussed below). It can be validated by cross-checking the trees produced by both methods.

History

Lexicostatistics was developed by Morris Swadesh in a series of articles in the 1950s, based on earlier ideas. [1] [2] [3] The concept's first known use was by Dumont d'Urville in 1834 who compared various "Oceanic" languages and proposed a method for calculating a coefficient of relationship. Hymes (1960) and Embleton (1986) both review the history of lexicostatistics. [4] [5]

Method

Create word list

The aim is to generate a list of universally used meanings (hand, mouth, sky, I). Words are then collected for these meaning slots for each language being considered. Swadesh reduced a larger set of meanings down to 200 originally. He later found that it was necessary to reduce it further but that he could include some meanings that were not in his original list, giving his later 100-item list. The Swadesh list in Wiktionary gives the total 207 meanings in a number of languages. Alternative lists that apply more rigorous criteria have been generated, e.g. the Dolgopolsky list and the Leipzig–Jakarta list, as well as lists with a more specific scope; for example, Dyen, Kruskal and Black have 200 meanings for 84 Indo-European languages in digital form. [6]

Determine cognacies

A trained and experienced linguist is needed to make cognacy decisions. However, the decisions may need to be refined as the state of knowledge increases. However, lexicostatistics does not rely on all the decisions being correct. For each pair of words (in different languages) in this list, the cognacy of a form could be positive, negative or indeterminate. Sometimes a language has multiple words for one meaning, e.g. small and little for not big.

Calculate lexicostatistic percentages

This percentage is related to the proportion of meanings for a particular language pair that are cognate, i.e. relative to the total without indeterminacy. This value is entered into an N×N table of distances, where N is the number of languages being compared. When completed, this table is half-filled in triangular form. The higher the proportion of cognacy the closer the languages are related.

Create family tree

Creation of the language tree is based solely on the table found above. Various sub-grouping methods can be used but that adopted by Dyen, Kruskal and Black was:

Calculations have to be of nucleus and group lexical percentages.

Applications

A leading exponent of lexicostatistics application has been Isidore Dyen. [7] [8] [9] [10] He used lexicostatistics to classify Austronesian languages [11] as well as Indo-European ones. [6] A major study of the latter was reported by Dyen, Kruskal and Black (1992). [6] Studies have also been carried out on Amerindian and African languages.

Pama-Nyungan

The problem of internal branching within the Pama-Nyungan language family has been a long-standing issue for Australianist linguistics, and general consensus held that internal connections between the 25+ different subgroups of Pama-Nyungan were either impossible to reconstruct or that the subgroups were not in fact genetically related at all. [12] In 2012, Claire Bowern and Quentin Atkinson published the results from their application of computational phylogenetic methods on 194 doculects representing all major subgroups and isolates of Pama-Nyungan. [13] Their model "recovered" many of the branches and divisions that had erstwhile been proposed and accepted by many other Australianists, while also providing some insight into the more problematic branches, such as Paman (which is complicated by the lack of data) and Ngumpin-Yapa (where the genetic picture is obscured by very high rates of borrowing between languages). Their dataset forms the largest of its kind for a hunter-gatherer language family, and the second largest overall after Austronesian (Greenhill et al. 2008 Archived 2018-12-19 at the Wayback Machine ). They conclude that Pama-Nyungan languages are in fact not exceptional to lexicostatistical methods, which have successfully been applied to other language families of the world.

Criticisms

People such as Hoijer (1956) have showed that there were difficulties in finding equivalents to the meaning items while many have found it necessary to modify Swadesh's lists. [14] Gudschinsky (1956) questioned whether it was possible to obtain a universal list. [15]

Factors such as borrowing, tradition and taboo can skew the results, as with other methods. Sometimes lexicostatistics has been used with lexical similarity being used rather than cognacy to find resemblances. This is then equivalent to mass comparison.

The choice of meaning slots is subjective, as is the choice of synonyms.

Improved methods

Some of the modern computational statistical hypothesis testing methods can be regarded as improvements of lexicostatistics in that they use similar word lists and distance measures.

See also

Related Research Articles

<span class="mw-page-title-main">Altaic languages</span> Hypothetical language family of Eurasia

Altaic is a controversial proposed language family that would include the Turkic, Mongolic and Tungusic language families and possibly also the Japonic and Koreanic languages. The hypothetical language family has long been rejected by most comparative linguists, although it continues to be supported by a small but stable scholarly minority. Speakers of the constituent languages are currently scattered over most of Asia north of 35° N and in some eastern parts of Europe, extending in longitude from Turkey to Japan. The group is named after the Altai mountain range in the center of Asia.

<span class="mw-page-title-main">Austronesian languages</span> Large language family mostly of Southeast Asia and the Pacific

The Austronesian languages are a language family widely spoken throughout Maritime Southeast Asia, Malay Peninsula, parts of Mainland Southeast Asia, Madagascar, the islands of the Pacific Ocean and Taiwan. There are also a number of speakers in continental Asia. They are spoken by about 386 million people. This makes it the fifth-largest language family by number of speakers. Major Austronesian languages include Malay, Javanese, Sundanese, Tagalog (Filipino), Malagasy and Cebuano. According to some estimates, the family contains 1,257 languages, which is the second most of any language family.

Glottochronology is the part of lexicostatistics which involves comparative linguistics and deals with the chronological relationship between languages.

<span class="mw-page-title-main">Morris Swadesh</span> American linguist (1909–1967)

Morris Swadesh was an American linguist who specialized in comparative and historical linguistics.

<span class="mw-page-title-main">Pama–Nyungan languages</span> Aboriginal Australian language family

The Pama–Nyungan languages are the most widespread family of Australian Aboriginal languages, containing 306 out of 400 Aboriginal languages in Australia. The name "Pama–Nyungan" is a merism: it is derived from the two end-points of the range, the Pama languages of northeast Australia and the Nyungan languages of southwest Australia.

Comparative linguistics is a branch of historical linguistics that is concerned with comparing languages to establish their historical relatedness.

<span class="mw-page-title-main">Uto-Aztecan languages</span> North American language family

Uto-Aztecan, Uto-Aztekan or Uto-Nahuatl is a family of indigenous languages of the Americas, consisting of over thirty languages. Uto-Aztecan languages are found almost entirely in the Western United States and Mexico. The name of the language family was created to show that it includes both the Ute language of Utah and the Nahuan languages of Mexico.

The Swadesh list is a classic compilation of tentatively universal concepts for the purposes of lexicostatistics. Translations of the Swadesh list into a set of languages allow researchers to quantify the interrelatedness of those languages. The Swadesh list is named after linguist Morris Swadesh. It is used in lexicostatistics and glottochronology. Because there are several different lists, some authors also refer to "Swadesh lists".

Joseph Bernard Kruskal, Jr. was an American mathematician, statistician, computer scientist and psychometrician.

<span class="mw-page-title-main">Macro-Pama–Nyungan languages</span>

Macro-Pama-Nyungan is an umbrella term used to refer to a proposed Indigenous Australian language family. It was coined by the Australian linguist Nicholas Evans in his 1996 book Archaeology and linguistics: Aboriginal Australia in global perspective, co-authored by Patrick McConvell. The term arose from Evans' theory suggesting that two of the largest Indigenous Australian language families share a common origin, and should therefore be classified as a singular language family under "Macro-Pama-Nyungan".

<span class="mw-page-title-main">Daly languages</span> Regional group of Australian language families

The Daly languages are an areal group of four to five language families of Indigenous Australian languages. They are spoken within the vicinity of the Daly River in the Northern Territory.

<span class="mw-page-title-main">Malayic languages</span> Subgroup of the Austronesian language family

The Malayic languages are a branch of the Malayo-Polynesian subgroup of the Austronesian language family. The most prominent member is Malay, which is the national language of Brunei, Singapore and Malaysia, and is the basis for Indonesian, the national language of Indonesia. The Malayic branch also includes the local languages spoken by Indonesians and ethnic Malays, further several languages spoken by various other ethnic groups of Sumatra, Indonesia and Borneo. The most probable candidate for the urheimat of the Malayic languages is western Borneo.

Quantitative comparative linguistics is the use of quantitative analysis as applied to comparative linguistics. Examples include the statistical fields of lexicostatistics and glottochronology, and the borrowing of phylogenetics from biology.

<span class="mw-page-title-main">Ngarna languages</span>

The Ngarna or Warluwar(r)ic languages are a discontinuous primary branch of the Pama–Nyungan language family of Australia. The moribund Yanyuwa language is the only survivor of this group.

Isidore Dyen was an American linguist, Professor Emeritus of Malayo-Polynesian and Comparative Linguistics at Yale University. He was one of the foremost scholars in the field of Austronesian linguistics, publishing extensively on the reconstruction of Proto-Austronesian phonology and on subgrouping within the language family, the latter principally by means of lexicostatistics.

<span class="mw-page-title-main">Southern Daly languages</span> Proposed language family

The Southern Daly languages are a proposed family of two distantly related Australian aboriginal languages. They are:

<span class="mw-page-title-main">Macro-Gunwinyguan languages</span> Australian Aboriginal languages

The Macro-Gunwinyguan languages, also called Arnhem or Gunwinyguan, are a family of Australian Aboriginal languages spoken across eastern Arnhem Land in northern Australia. Their relationship has been demonstrated through shared morphology in their verbal inflections.

The Proto-Philippine language is a reconstructed ancestral proto-language of the Philippine languages, a proposed subgroup of the Austronesian languages which includes all languages within the Philippines as well as those within the northern portions of Sulawesi in Indonesia. Proto-Philippine is not directly attested to in any written work, but linguistic reconstruction by the comparative method has found regular similarities among languages that cannot be explained by coincidence or word-borrowing.

Proto-Pama–Nyungan is the hypothetical common ancestor of the Pama–Nyungan languages. It may have been spoken as recently as about 5,000 years ago, much more recently than the 40,000 to 60,000 years Aboriginal Australians are believed to have been inhabiting Australia.

The Takelma–Kalapuyan languages are a proposed small language family that comprises the Kalapuyan languages and Takelma, which were formerly spoken in the Willamette Valley and the Rogue Valley in Oregon.

References

  1. Swadesh, Morris (1955). "Towards greater accuracy in lexicostatistical dating". International Journal of American Linguistics. 21 (2): 121–137. doi:10.1086/464321. S2CID   144581963.
  2. Swadesh, Morris (1952). "Lexicostatistical dating of prehistoric ethnic contacts". Proceedings of the American Philosophical Society. 96: 452–463.
  3. Swadesh, Morris (1950). "Salish internal relationships". International Journal of American Linguistics. 16 (4): 157–167. doi:10.1086/464084. S2CID   145122561.
  4. Hymes, Dell (1960). "Lexicostatistics so far". Current Anthropology. 1 (1): 3–44. doi:10.1086/200074. S2CID   144569209.
  5. Embleton, Sheila (1986). Statistics in Historical Linguistics. Bochum.
  6. 1 2 3 Dyen, Isidore; Kruskal, Joseph; Black, Paul (1992). "An Indoeuropean Classification, a Lexicostatistical Experiment". Transactions of the American Philosophical Society. 82 (5): iii–132. doi:10.2307/1006517. JSTOR   1006517.
  7. Dyen, Isidore (1962). "The lexicostatistically determined relationship of a language group". International Journal of American Linguistics. 28 (3): 153–161. doi:10.1086/464687. S2CID   143070513.
  8. Dyen, Isidore (1963). "Lexicostatistically determined borrowing and taboo". Language. 39 (1): 60–66. doi:10.2307/410762. JSTOR   410762.
  9. Dyen, Isidore, ed. (1973). Lexicostatistics in Genetic Linguistics. The Hague: Mouton.
  10. Dyen, Isidore (1975). Linguistic Subgrouping and Lexicostatistics. The Hague: Mouton.
  11. Dyen, Isidore (1965). "A lexicostatistical classification of the Austronesian languages". International Journal of American Linguistics. 19.
  12. Dixon, Robert M.W. (2002). Australian languages: their nature and development. Cambridge University Press. pp. 48, 53. Australia provides a prototypical instance of a linguistic area. It has considerable time-depth, fairly uniform terrain leading to ease of interaction and communication, a fair proportion of reciprocal exogamous marriages, rampant multilingualism, and an open attitude to borrowing ... There is a basic uniformity to Australian languages which is the natural result of a long period of diffusion. Although no justification had been provided for 'Pama-Nyungan', it came to be accepted. People accepted it because it was accepted—as a species of belief. ... It is clear that 'Pama-Nyungan' cannot be supported as a genetic group. Nor is it a useful typological grouping.
  13. Bowern, Claire; Atkinson, Quentin (2012). "Computational phylogenetics and the internal structure of Pama-Nyungan". Language. 88 (4): 817–845. doi:10.1353/lan.2012.0081. hdl: 1885/61360 . S2CID   4375648.
  14. Hoijer, Harry (1956). "Lexicostatistics: a critique". Language. 32 (1): 49–60. doi:10.2307/410652. JSTOR   410652.
  15. Gudschinsky, Sarah (1956). "The ABCs of lexicostatistics (glottochronology)". Word. 12 (2): 175–210. doi:10.1080/00437956.1956.11659599.

Further reading