Lexical density

Last updated

Lexical density is a concept in computational linguistics that measures the structure and complexity of human communication in a language. [1] Lexical density estimates the linguistic complexity in a written or spoken composition from the functional words (grammatical units) and content words (lexical units, lexemes). One method to calculate the lexical density is to compute the ratio of lexical items to the total number of words. Another method is to compute the ratio of lexical items to the number of higher structural items in a composition, such as the total number of clauses in the sentences. [2] [3]

Contents

The lexical density for an individual evolves with age, education, communication style, circumstances, unusual injuries or medical condition, [4] and his or her creativity. The inherent structure of a human language and one's first language may impact the lexical density of the individual's writing and speaking style. Further, human communication in the written form is generally more lexically dense than in the spoken form after the early childhood stage. [5] [6] The lexical density impacts the readability of a composition and the ease with which the listener or reader can comprehend a communication. [7] [8] The lexical density may also impact the memorability and retention of a sentence and the message. [9]

Discussion

The lexical density is the proportion of content words (lexical items) in a given discourse. It can be measured either as the ratio of lexical items to total number of words, or as the ratio of lexical items to the number of higher structural items in the sentences (for example, clauses). [2] [3] A lexical item is typically the real content and it includes nouns, verbs, adjectives and adverbs. A grammatical item typically is the functional glue and thread that weaves the content and includes pronouns, conjunctions, prepositions, determiners, and certain classes of finite verbs and adverbs. [5]

Lexical density is one of the methods used in discourse analysis as a descriptive parameter which varies with register and genre. There are many proposed methods for computing the lexical density of any composition or corpus. Lexical density may be determined as:

Where:
= the analysed text's lexical density
= the number of lexical or grammatical tokens (nouns, adjectives, verbs, adverbs) in the analysed text
= the number of all tokens (total number of words) in the analysed text

Ure lexical density

Ure proposed the following formula in 1971 to compute the lexical density of a sentence:

Ld = The number of lexical items/The total number of words * 100

Biber terms this ratio as "type-token ratio". [10]

Halliday lexical density

In 1985, Halliday revised the denominator of the Ure formula and proposed the following to compute the lexical density of a sentence: [1]

Ld = The number of lexical items/The total number of clauses * 100

In some formulations, the Halliday proposed lexical density is computed as a simple ratio, without the "100" multiplier. [2] [1]

Characteristics

Lexical density measurements may vary for the same composition depending on how a "lexical item" is defined and which items are classified as lexical or as a grammatical item. Any adopted methodology when consistently applied across various compositions provides the lexical density of those compositions. Typically, the lexical density of a written composition is higher than a spoken composition. [2] [3] According to Ure, written forms of human communication in the English language typically have lexical densities above 40%, while spoken forms tend to have lexical densities below 40%. [2] In a survey of historical texts by Michael Stubbs, the typical lexical density of fictional literature ranged between 40% and 54%, while non-fiction ranged between 40% and 65%. [3] [11] [12]

The relation and intimacy between the participants of a particular communication impact the lexical density, states Ure, as do the circumstances prior to the start of communication for the same speaker or writer. The higher lexical density of written forms of communication, she proposed, is primarily because written forms of human communication involve greater preparation, reflection and revisions. [2] Human discussions and conversations involving or anticipating feedback tend to be sparser and have lower lexical density. In contrast, state Stubbs and Biber, instructions, law enforcement orders, news read from screen prompts within the allotted time, and literature that authors expect will be available to the reader for re-reading tend to maximize lexical density. [2] [13] In surveys of lexical density of spoken and written materials across different European countries and age groups, Johansson and Strömqvist report that the lexical density of population groups were similar and depended on the morphological structure of the native language and within a country, the age groups sampled. The lexical density was highest for adults, while the variations estimated as lexical diversity, states Johansson, were higher for teenagers for the same age group (13-year-olds, 17-year-olds). [14] [15]

See also

Related Research Articles

The following outline is provided as an overview and topical guide to linguistics:

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Systemic functional grammar</span> Primary tenets

Systemic functional grammar (SFG) is a form of grammatical description originated by Michael Halliday. It is part of a social semiotic approach to language called systemic functional linguistics. In these two terms, systemic refers to the view of language as "a network of systems, or interrelated sets of options for making meaning"; functional refers to Halliday's view that language is as it is because of what it has evolved to do. Thus, what he refers to as the multidimensional architecture of language "reflects the multidimensional nature of human experience and interpersonal relations."

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

In grammar, a part of speech or part-of-speech is a category of words that have similar grammatical properties. Words that are assigned to the same part of speech generally display similar syntactic behavior, sometimes similar morphological behavior in that they undergo inflection for similar properties and even similar semantic behavior. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, and determiner.

A vocabulary is a set of words, typically the set in a language or the set known to an individual. The word vocabulary originated from the Latin vocabulum, meaning "a word, name." It forms an essential component of language and communication, helping convey thoughts, ideas, emotions, and information. Vocabulary can be oral, written, or signed and can be categorized into two main types: active vocabulary and passive vocabulary. An individual's vocabulary continually evolves through various methods, including direct instruction, independent reading, and natural language exposure, but it can also shrink due to forgetting, trauma, or disease. Furthermore, vocabulary is a significant focus of study across various disciplines, like linguistics, education, psychology, and artificial intelligence. Vocabulary is not limited to single words; it also encompasses multi-word units known as collocations, idioms, and other types of phraseology. Acquiring an adequate vocabulary is one of the largest challenges in learning a second language.

That is an English language word used for several grammatical purposes. These include use as an adjective, conjunction, pronoun, adverb, and intensifier; it has distance from the speaker, as opposed to words like this. The word did not originally exist in Old English, and its concept was represented by þe. Once it came into being, it was spelt as þæt, taking the role of the modern that. It also took on the role of the modern word what, though this has since changed, and that has recently replaced some usage of the modern which. Pronunciation of the word varies according to its role within a sentence, with two main varieties, though there are also regional differences, such as where the sound is substituted instead by a in English spoken in Cameroon.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

Text linguistics is a branch of linguistics that deals with texts as communication systems. Its original aims lay in uncovering and describing text grammars. The application of text linguistics has, however, evolved from this approach to a point in which text is viewed in much broader terms that go beyond a mere extension of traditional grammar towards an entire text. Text linguistics takes into account the form of a text, but also its setting, i. e. the way in which it is situated in an interactional, communicative context. Both the author of a text as well as its addressee are taken into consideration in their respective roles in the specific communicative context. In general it is an application of discourse analysis at the much broader level of text, rather than just a sentence or word.

In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.

The sequence between semantic related ordered words is classified as a lexical chain. A lexical chain is a sequence of related words in writing, spanning short or long distances. A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept that the term represents.

Cohesion is the grammatical and lexical linking within a text or sentence that holds a text together and gives it meaning. It is related to the broader concept of coherence.

Corpus-assisted discourse studies is related historically and methodologically to the discipline of corpus linguistics. The principal endeavor of corpus-assisted discourse studies is the investigation, and comparison of features of particular discourse types, integrating into the analysis the techniques and tools developed within corpus linguistics. These include the compilation of specialised corpora and analyses of word and word-cluster frequency lists, comparative keyword lists and, above all, concordances.

Linguistics is the scientific study of language. It entails the comprehensive, systematic, objective, and precise analysis of all aspects of language — cognitive, social, environmental, biological as well as structural.

<span class="mw-page-title-main">Michael Hoey (linguist)</span> British linguist (1948–2021)

Michael Hoey was a British linguist and Baines Professor of English Language. He lectured in applied linguistics in over 40 countries.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

Lexical diversity is one aspect of 'lexical richness' and refers to the ratio of different unique word stems (types) to the total number of words (tokens). The term is used in applied linguistics and is quantitatively calculated using numerous different measures including Type-Token Ratio (TTR), vocd, and the measure of textual lexical diversity (MTLD).

The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

References

  1. 1 2 3 Michael Halliday (1985). Spoken and Written Language. Deakin University. pp. 61–64. ISBN   978-0-7300-0309-0.
  2. 1 2 3 4 5 6 7 Erik Castello (2008). Text Complexity and Reading Comprehension Tests. Peter Lang. pp. 49–51. ISBN   978-3-03911-717-8.
  3. 1 2 3 4 Belinda Crawford Camiciottoli (2007). The Language of Business Studies Lectures: A Corpus-assisted Analysis. John Benjamins Publishing. p. 73. ISBN   978-90-272-5400-9.
  4. Paul Yoder (2006). "Predicting Lexical Density Growth Rate in Young Children With Autism Spectrum Disorders". American Journal of Speech-Language Pathology. 15 (4): 362–373.
  5. 1 2 Michael Halliday (1985). Spoken and Written Language. Deakin University. pp. 61–75 (Chapter 5), 76–91 (Chapter 6). ISBN   978-0-7300-0309-0.
  6. Victoria Johansson (2009). Developmental aspects of text production in writing and speech. Department of Linguistics and Phonetics, Centre for Languages and Literature, Lund University. pp. 1–16. ISBN   978-91-974116-7-7.
  7. V To; S Fan; DP Thomas (2013). "Lexical density and Readability: A case study of English Textbooks". The International Journal of Language, Society and Culture. 37 (7): 61–71.
  8. O'Loughlin, Kieran (1995). "Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test". Language Testing. SAGE Publications. 12 (2): 217–237. doi:10.1177/026553229501200205. S2CID   145638000.
  9. Perfetti, Charles A. (1969). "Lexical density and phrase structure depth as variables in sentence retention". Journal of Verbal Learning and Verbal Behavior. Elsevier BV. 8 (6): 719–724. doi:10.1016/s0022-5371(69)80035-6. ISSN   0022-5371.
  10. Douglas Biber (2007). Discourse on the Move: Using Corpus Analysis to Describe Discourse Structure. John Benjamins Publishing. pp. 97–98 with footnote 7. ISBN   978-90-272-2302-9.
  11. Mark Warschauer; Richard Kern (2000). Network-Based Language Teaching: Concepts and Practice. Cambridge University Press. pp. 107–108. ISBN   978-0-521-66742-5.
  12. Michael Stubbs (1996). Text and Corpus Analysis: Computer Assisted Studies of Language and Culture. Wiley. pp. 71–73. ISBN   978-0-631-19512-2.
  13. Michael Stubbs (1986). "Lexical density: A technique and some findings". In Malcolm Coulthard (ed.). Talking about Text. University of Birmingham: English Language Research. pp. 27–42.
  14. Victoria Johansson (2008). "Lexical diversity and lexical density in speech and writing: a developmental perspective". Linguistics and Phonetics Working Papers. Lund University. 53: 61–79.
  15. Sven Strömqvist; Victoria Johansson; Sarah Kriz; H Ragnarsdottir; Ravid Aisenmann; Dorit Ravid (2002). "Toward a crosslinguistic comparison of lexical quanta in speech and writing". Written Language and Literacy. 5: 45–67. doi:10.1075/wll.5.1.03str.

Further reading