Lexical similarity

Last updated

In linguistics, lexical similarity is a measure of the degree to which the word sets of two given languages are similar. A lexical similarity of 1 (or 100%) would mean a total overlap between vocabularies, whereas 0 means there are no common words.

Contents

There are different ways to define the lexical similarity and the results vary accordingly. For example, Ethnologue's method of calculation consists in comparing a regionally standardized wordlist (comparable to the Swadesh list) and counting those forms that show similarity in both form and meaning. Using such a method, English was evaluated to have a lexical similarity of 60% with German and 27% with French.

Lexical similarity can be used to evaluate the degree of genetic relationship between two languages. Percentages higher than 85% usually indicate that the two languages being compared are likely to be related dialects. [1]

The lexical similarity is only one indication of the mutual intelligibility of the two languages, since the latter also depends on the degree of phonetical, morphological, and syntactical similarity. The variations due to differing wordlists weigh on this. For example, lexical similarity between French and English is considerable in lexical fields relating to culture, whereas their similarity is smaller as far as basic (function) words are concerned. Unlike mutual intelligibility, lexical similarity can only be symmetrical.

Indo-European languages

The table below shows some lexical similarity values for pairs of selected Romance, Germanic, and Slavic languages, as collected and published by Ethnologue . [2]

Lang.
code
Language 1
Lexical similarity coefficients
ItalianSpanishPortugueseFrenchRomanianCatalanRomanshSardinianEnglishGermanRussian
ita Italian 10.820.800.890.770.870.780.85---
spa Spanish 0.8210.890.750.710.850.740.76---
por Portuguese 0.800.8910.750.720.850.740.76---
fra French 0.890.750.7510.75-0.780.800.270.29-
ron Romanian 0.770.710.720.7510.730.720.74---
cat Catalan 0.870.850.85-0.7310.760.75---
roh Romansh 0.780.740.740.780.720.7610.74---
srd Sardinian 0.850.760.760.800.740.750.741---
eng English ---0.27----10.600.24
deu German ---0.29----0.601-
rus Russian --------0.24-1
ItalianSpanishPortugueseFrenchRomanianCatalanRomanshSardinianEnglishGermanRussian
Language 2 →itaspaporfraroncatrohsrdengdeurus

Notes:

See also

Related Research Articles

Dialect refers to two distinctly different types of linguistic relationships.

<span class="mw-page-title-main">Gallurese</span> Romance language spoken in northeastern Sardinia

Gallurese is a Romance dialect of the Italo-Dalmatian family spoken in the region of Gallura, northeastern Sardinia. Gallurese is variously described as a distinct southern dialect of Corsican or transitional language of the dialect continuum between Corsican and Sardinian. "Gallurese International Day" takes place each year in Palau (Sardinia) with the participation of orators from other areas, including Corsica.

<span class="mw-page-title-main">Logudorese Sardinian</span> Written standard of the Sardinian language

Logudorese Sardinian is one of the two written standards of the Sardinian language, which is often considered one of the most, if not the most conservative of all Romance languages. The orthography is based on the spoken dialects of central northern Sardinia, identified by certain attributes which are not found, or found to a lesser degree, among the Sardinian dialects centered on the other written form, Campidanese. Its ISO 639-3 code is src.

<span class="mw-page-title-main">Campidanese Sardinian</span> Written standard of the Sardinian language

Campidanese Sardinian is one of the two written standards of the Sardinian language, which is often considered one of the most, if not the most conservative of all the Romance languages. The orthography is based on the spoken dialects of central southern Sardinia, identified by certain attributes which are not found, or found to a lesser degree, among the Sardinian dialects centered on the other written form, Logudorese. Its ISO 639-3 code is sro.

<span class="mw-page-title-main">Marwari language</span> Language spoken in Rajasthan, India

Marwari is an Indo Aryan language of the Rajasthani languages group spoken in the Indian state of Rajasthan. It is also found in the neighbouring states of Gujarat and Haryana, some adjacent areas in Eastern parts of Pakistan, and some migrant communities in Nepal. With some 7.8 million or so speakers, it is the largest language in the Rajasthani languages group. Most speakers live in Rajasthan and a few in Nepal. There are two dozen varieties of Marwari.

<span class="mw-page-title-main">Mutual intelligibility</span> Closeness of linguistic varieties

In linguistics, mutual intelligibility is a relationship between languages or dialects in which speakers of different but related varieties can readily understand each other without prior familiarity or special effort. It is sometimes used as an important criterion for distinguishing languages from dialects, although sociolinguistic factors are often also used.

A pluricentric language or polycentric language is a language with several codified standard forms, often corresponding to different countries. Many examples of such languages can be found worldwide among the most-spoken languages, including but not limited to Chinese in mainland China, Taiwan and Singapore; English in the United States, United Kingdom, Canada, Australia, New Zealand, Ireland, South Africa, India, and elsewhere; and French in France, Canada, and elsewhere. The converse case is a monocentric language, which has only one formally standardized version. Examples include Japanese and Russian. In some cases, the different standards of a pluricentric language may be elaborated to appear as separate languages, e.g. Malaysian and Indonesian, Hindi and Urdu, while Serbo-Croatian is in an earlier stage of that process.

Ratagnon is a regional language spoken by the Ratagnon people, an indigenous group from Occidental Mindoro. It is a part of the Bisayan language family and is closely related to other Philippine languages. Its speakers are shifting to Tagalog. In 2000, there were only two to five speakers of the language. However, in 2010 Ethnologue had reported there were 310 new speakers.

Language contact occurs when speakers of two or more languages or varieties interact with and influence each other. The study of language contact is called contact linguistics. When speakers of different languages interact closely, it is typical for their languages to influence each other. Language contact can occur at language borders, between adstratum languages, or as the result of migration, with an intrusive language acting as either a superstratum or a substratum.

<span class="mw-page-title-main">Dogon languages</span> Dialect continuum of southeastern Mali

The Dogon languages are a small closely related language family that is spoken by the Dogon people of Mali and may belong to the proposed Niger–Congo family. There are about 600,000 speakers of its dozen languages. They are tonal languages, and most, like Dogul, have two tones, but some, like Donno So, have three. Their basic word order is subject–object–verb.

Dialectology is the scientific study of linguistic dialect, which today is considered a sub-field of sociolinguistics. It studies variations in language based primarily on geographic distribution and their associated features. Dialectology deals with such topics as divergence of two local dialects from a common ancestor and synchronic variation.

<span class="mw-page-title-main">Linguistic purism</span> Preferring a language variety as purer

Linguistic purism or linguistic protectionism is the prescriptive practice of defining or recognizing one variety of a language as being purer or of intrinsically higher quality than other varieties. Linguistic purism was institutionalized through language academies, and their decisions often have the force of law.

Cusco–Collao or Qusqu–Qullaw (Quechua) is a collective term used for Quechua dialects that have aspirated and ejective plosives, apparently borrowed from Aymaran languages. They include Cusco Quechua, Puno Quechua, North Bolivian Quechua, and South Bolivian Quechua. Together with Ayacucho Quechua, which is mutually intelligible, they form the Southern Quechua language.

<span class="mw-page-title-main">Yawa languages</span> Small language family of Indonesia

The Yawa languages, also known as Yapen languages, are a small family of two closely related Papuan languages, Yawa and Saweru, which are often considered to be divergent dialects of a single language. They are spoken on central Yapen Island and nearby islets, in Cenderawasih Bay, Indonesian Papua, which they share with the Austronesian Yapen languages.

Kunjen, or Uw, is a Paman language spoken on the Cape York Peninsula of Queensland, Australia, by the Uw Oykangand, Olkola, and related Aboriginal Australian peoples. It is closely related to Kuuk Thaayorre, and perhaps Kuuk Yak.

Khumi, or Khumi Chin, is a Kuki-Chin-Mizo language of Burma, with some speakers across the border in Bangladesh. Khumi shares 75%–87% lexical similarity with Eastern Khumi, and 78-81% similarity with Mro-Khimi.

Pyen is a Loloish language of Myanmar. It is spoken by about 700 people in two villages near Mong Yang, Shan State, Burma, just to the north of Kengtung.

<span class="mw-page-title-main">Dupaningan Agta</span> Austronesian language of the Philippines

Dupaningan Agta, or Eastern Cagayan Agta, is a language spoken by a semi-nomadic hunter-gatherer Negrito people of Cagayan and Isabela provinces in northern Luzon, Philippines. Its Yaga dialect is only partially intelligible.

<span class="mw-page-title-main">Southern Alta language</span> Austronesian language spoken in the Philippines

Southern Alta, is a distinctive Aeta language of the mountains of northern Philippines. Southern Alta is one of many endangered languages that risks being lost if it is not passed on by current speakers. Most speakers of Southern Alta also speak Tagalog.

Gumuz is a dialect cluster spoken along the border of Ethiopia and Sudan. It has been tentatively classified within the Nilo-Saharan family. Most Ethiopian speakers live in Kamashi Zone and Metekel Zone of the Benishangul-Gumuz Region, although a group of 1,000 reportedly live outside the town of Welkite. The Sudanese speakers live in the area east of Er Roseires, around Famaka and Fazoglo on the Blue Nile, extending north along the border. Dimmendaal et al. (2019) suspect that the poorly attested varieties spoken along the river constitute a distinct language, Kadallu.

References

Notes

  1. "About the Ethnologue". Ethnologue. 2012-09-25. Retrieved 2019-02-24.
  2. See, for instance, lexical similarity data for French, German, English
  3. 1 2 "Bolognesi, Roberto; Heeringa, Wilbert. Sardegna fra tante lingue, pp.123, 2005, Condaghes" (PDF). Archived from the original (PDF) on 2014-02-11. Retrieved 2017-04-14.
  4. Finkenstaedt, Thomas; Dieter Wolff (1973). Ordered profusion; studies in dictionaries and the English lexicon. C. Winter. ISBN   3-533-02253-6.
  5. "Joseph M. Willams, Origins of the English Language at". Amazon.com. Retrieved 2010-04-21.
  6. Nation, I.S.P. (2001). Learning Vocabulary in Another Language. Cambridge University Press. p. 477. ISBN   0-521-80498-1.