This article needs additional citations for verification .(September 2014) |
Lexical diversity is one aspect of 'lexical richness' and refers to the ratio of different unique word stems (types) to the total number of words (tokens). The term is used in applied linguistics and is quantitatively calculated using numerous different measures including Type-Token Ratio (TTR), vocd, [1] and the measure of textual lexical diversity (MTLD). [2]
A common problem with lexical diversity measures, especially TTR, is that text samples containing large number of tokens give lower values for TTR since it is often necessary for the writer or speaker to re-use many words. One consequence of this is that it is often assumed that lexical diversity can only be used to compare texts of the same length. [3] Yet, many measures of lexical diversity attempt to account for sensitivity to text length. Surveys of such measures are provided in Harald Baayen's book (2001) [4] and more recently in. [5]
In a 2013 article Scott Jarvis proposed that lexical diversity, similar to diversity in ecology, is a perceptual phenomenon. Lexical redundancy is a positive counterpart of lexical diversity in the same way as lexical variability is the mirror image of repetition. According to Jarvis's model, lexical diversity includes variability, volume, evenness, rarity, dispersion and disparity. [6]
According to Jarvis, the six properties of lexical diversity should be measured by the following indices.
Property | Measure |
---|---|
Variability | Measure of Textual Lexical Diversity (MTLD) |
Volume | Total number of words in the text |
Evenness | Standard deviation of tokens per type |
Rarity | Mean BNC rank |
Dispersion | Mean distance between tokens of type |
Disparity | Mean number of words per sense or Latent Semantic Analysis |
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
Readability is the ease with which a reader can understand a written text. The concept exists in both natural language and programming languages though in different forms. In natural language, the readability of text depends on its content and its presentation. In programming, things such as programmer comments, choice of loop structure, and choice of names can determine the ease with which humans can read computer program code.
Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.
In linguistics, productivity is the degree to which speakers of a language use a particular grammatical process, especially in word formation. It compares grammatical processes that are in frequent use to less frequently used ones that tend towards lexicalization. Generally the test of productivity concerns identifying which grammatical forms would be used in the coining of new words: these will tend to only be converted to other forms using productive processes.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".
In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.
The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning narrow or wide context window. A lexical chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable disambiguation of concepts that the term represents.
Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.
Lexical density is a concept in computational linguistics that measures the structure and complexity of human communication in a language. Lexical density estimates the linguistic complexity in a written or spoken composition from the functional words and content words. One method to calculate the lexical density is to compute the ratio of lexical items to the total number of words. Another method is to compute the ratio of lexical items to the number of higher structural items in a composition, such as the total number of clauses in the sentences.
A word list is a list of a language's lexicon within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles has accelerated the research field.
Paul Nation is an internationally recognized scholar in the field of linguistics and teaching methodology. As a professor in the field of applied linguistics with a specialization in pedagogical methodology, he has been able to create a language teaching framework to identify key areas of language teaching focus. Paul Nation is best known for this framework, which has been labelled The Four Strands. He has also made notable contributions through his research in the field of language acquisition that focuses on the benefits of extensive reading and repetition as well as intensive reading. Nation's numerous contributions to the linguistics research community through his published work has allowed him to share his knowledge and experience so that others may adopt and adapt it. He is credited with bringing « legitimization to second language vocabulary researches » in 1990.
SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.
Michael Hoey was a British linguist and Baines Professor of English Language. He lectured in applied linguistics in over 40 countries.
In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.
Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.
Marjolijn Verspoor is a Dutch linguist. She is a professor of English language and English as a second language at the University of Groningen, Netherlands. She is known for her work on Complex Dynamic Systems Theory and the application of dynamical systems theory to study second language development. Her interest is also in second language writing.
Scott Andrew Crossley is an American linguist. He is a professor of applied linguistics at Vanderbilt University, United States. His research focuses on natural language processing and the application of computational tools and machine learning algorithms in learning analytics including second language acquisition, second language writing, and readability. His main interest area is the development and use of natural language processing tools in assessing writing quality and text difficulty.
Scott Jarvis is an American linguist. He is a Professor of Applied Linguistics at Northern Arizona University, United States. His research focuses on second language acquisition more broadly, with a special focus on lexical diversity.