Lexical diversity

Last updated

Lexical diversity is one aspect of 'lexical richness' and refers to the ratio of different unique word stems (types) to the total number of words (tokens). The term is used in applied linguistics and is quantitatively calculated using numerous different measures including Type-Token Ratio (TTR), vocd, [1] and the measure of textual lexical diversity (MTLD). [2]

A common problem with lexical diversity measures, especially TTR, is that text samples containing large number of tokens give lower values for TTR since it is often necessary for the writer or speaker to re-use several function words. One consequence of this is that lexical diversity is better used for comparing texts of equal length. [3] Newer measures of lexical diversity attempt to account for sensitivity to text length. A survey of such measures is provided in a 2024 article by Yves Bestgen. [4]

Definitions

In a 2013 article Scott Jarvis proposed that lexical diversity, similar to diversity in ecology, is a perceptual phenomenon. Lexical redundancy is a positive counterpart of lexical diversity in the same way as lexical variability is the mirror image of repetition. According to Jarvis's model, lexical diversity includes variability, volume, evenness, rarity, dispersion and disparity. [5]

According to Jarvis, the six properties of lexical diversity should be measured by the following indices.

PropertyMeasure
VariabilityMeasure of Textual Lexical Diversity (MTLD)
VolumeTotal number of words in the text
EvennessStandard deviation of tokens per type
RarityMean BNC rank
DispersionMean distance between tokens of type
DisparityMean number of words per sense or Latent Semantic Analysis

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.

Readability is the ease with which a reader can understand a written text. The concept exists in both natural language and programming languages though in different forms. In natural language, the readability of text depends on its content and its presentation. In programming, things such as programmer comments, choice of loop structure, and choice of names can determine the ease with which humans can read computer program code.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

Language attrition is the process of decreasing proficiency in or losing a language. For first or native language attrition, this process is generally caused by both isolation from speakers of the first language ("L1") and the acquisition and use of a second language ("L2"), which interferes with the correct production and comprehension of the first. Such interference from a second language is likely experienced to some extent by all bilinguals, but is most evident among speakers for whom a language other than their first has started to play an important, if not dominant, role in everyday life; these speakers are more likely to experience language attrition. It is common among immigrants that travel to countries where languages foreign to them are used. Second language attrition can occur from poor learning, practice, and retention of the language after time has passed from learning. This often occurs with bilingual speakers who do not frequently engage with their L2.

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning narrow or wide context window. A lexical chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable disambiguation of concepts that the term represents.

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and information remain the same. Text simplification is an important area of research because of communication needs in an increasingly complex and interconnected world more dominated by science, technology, and new media. But natural human languages pose huge problems because they ordinarily contain large vocabularies and complex constructions that machines, no matter how fast and well-programmed, cannot easily process. However, researchers have discovered that, to reduce linguistic diversity, they can use methods of semantic compression to limit and simplify a set of words used in given texts.

Lexical density is a concept in computational linguistics that measures the structure and complexity of human communication in a language. Lexical density estimates the linguistic complexity in a written or spoken composition from the functional words and content words. One method to calculate the lexical density is to compute the ratio of lexical items to the total number of words. Another method is to compute the ratio of lexical items to the number of higher structural items in a composition, such as the total number of clauses in the sentences.

A word list is a list of a language's lexicon within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles has accelerated the research field.

Paul Nation is an internationally recognized scholar in the field of linguistics and teaching methodology. As a professor in the field of applied linguistics with a specialization in pedagogical methodology, he has been able to create a language teaching framework to identify key areas of language teaching focus. Paul Nation is best known for this framework, which has been labelled The Four Strands. He has also made notable contributions through his research in the field of language acquisition that focuses on the benefits of extensive reading and repetition as well as intensive reading. Nation's numerous contributions to the linguistics research community through his published work has allowed him to share his knowledge and experience so that others may adopt and adapt it. He is credited with bringing « legitimization to second language vocabulary researches » in 1990.

<span class="mw-page-title-main">Michael Hoey (linguist)</span> British linguist (1948–2021)

Michael Hoey was a British linguist and Baines Professor of English Language. He lectured in applied linguistics in over 40 countries.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

Complex dynamic systems theory in the field of linguistics is a perspective and approach to the study of second, third and additional language acquisition. The general term complex dynamic systems theory was recommended by Kees de Bot to refer to both complexity theory and dynamic systems theory.

<span class="mw-page-title-main">Marjolijn Verspoor</span> Dutch linguist

Marjolijn Verspoor is a Dutch linguist. She is a professor of English language and English as a second language at the University of Groningen, Netherlands. She is known for her work on Complex Dynamic Systems Theory and the application of dynamical systems theory to study second language development. Her interest is also in second language writing.

Scott Andrew Crossley is an American linguist. He is a professor of applied linguistics at Vanderbilt University, United States. His research focuses on natural language processing and the application of computational tools and machine learning algorithms in learning analytics including second language acquisition, second language writing, and readability. His main interest area is the development and use of natural language processing tools in assessing writing quality and text difficulty.

Scott Jarvis is an American linguist. He is a Professor of Applied Linguistics at Northern Arizona University, United States. His research focuses on second language acquisition more broadly, with a special focus on lexical diversity.

References

  1. McCarthy, Phillip; Jarvis, Scott (2007). "vocd: A theoretical and empirical evaluation". Language Testing. 24 (4): 459–488. doi:10.1177/0265532207080767.
  2. McCarthy, Phillip (2005). "An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD)". Doctoral Dissertation via Proquest Dissertations and Theses. (UMI No. 3199485).
  3. Lexical diversity and lexical density in speech and writing: A developmental perspective - V Johansson - Working Papers in Linguistics, 2009
  4. Bestgen, Yves (2024). "Measuring Lexical Diversity in Texts: The Twofold Length Problem". Language Learning. 74: 638–671. doi:10.1111/j.1467-9922.2012.00739.x.
  5. Jarvis, Scott (2013). "Capturing the Diversity in Lexical Diversity". Language Learning. 63: 87–106. doi:10.1111/lang.12630.