Lemmatization

Last updated November 15, 2024

Lemmatization (or less commonly lemmatisation) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.^[1]

In computational linguistics, lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighbouring sentences or even an entire document. As a result, developing efficient lemmatization algorithms is an open area of research.^[2]^[3]^[4]

Description

In many languages, words appear in several inflected forms. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word.

Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster. The reduced "accuracy" may not matter for some applications. In fact, when used within information retrieval systems, stemming improves query recall accuracy, or true positive rate, when compared to lemmatization. Nonetheless, stemming reduces precision, or the proportion of positively-labeled instances that are actually positive, for such systems.^[5]

For instance:

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatization.
The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

Document indexing software like Lucene ^[6] can store the base stemmed format of the word without the knowledge of meaning, but only considering word formation grammar rules. The stemmed word itself might not be a valid word: 'lazy', as seen in the example below, is stemmed by many stemmers to 'lazi'. This is because the purpose of stemming is not to produce the appropriate lemma – that is a more challenging task that requires knowledge of context. The main purpose of stemming is to map different forms of a word to a single form.^[7] As a rule-based algorithm, dependent only upon the spelling of a word, it sacrifices accuracy to ensure that, for example, when 'laziness' is stemmed to 'lazi', it has the same stem as 'lazy'.

Algorithms

A trivial way to do lemmatization is by simple dictionary lookup. This works well for straightforward inflected forms, but a rule-based system will be needed for other cases, such as in languages with long compound words. Such rules can be either hand-crafted or learned automatically from an annotated corpus.

Use in biomedicine

Morphological analysis of published biomedical literature can yield useful results. Morphological processing of biomedical text can be more effective by a specialized lemmatization program for biomedicine, and may improve the accuracy of practical information extraction tasks.^[8]

Related Research Articles

A lexeme is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken by a single root word. For example, in the English language, run, runs, ran and running are forms of the same lexeme, which can be represented as RUN.

In linguistics, morphology is the study of words, including the principles by which they are formed, and how they relate to one another within a language. Most approaches to morphology investigate the structure of words in terms of morphemes, which are the smallest units in a language with some independent meaning. Morphemes include roots that can exist as words by themselves, but also categories such as affixes that can only appear as part of a larger word. For example, in English the root catch and the suffix -ing are both morphemes; catch may appear as its own word, or it may be combined with -ing to form the new word catching. Morphology also analyzes how words behave as parts of speech, and how they may be inflected to express grammatical categories including number, tense, and aspect. Concepts such as productivity are concerned with how speakers create words in specific contexts, which evolves over the history of a language.

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.

A vocabulary is a set of words, typically the set in a language or the set known to an individual. The word vocabulary originated from the Latin vocabulum, meaning "a word, name". It forms an essential component of language and communication, helping convey thoughts, ideas, emotions, and information. Vocabulary can be oral, written, or signed and can be categorized into two main types: active vocabulary and passive vocabulary. An individual's vocabulary continually evolves through various methods, including direct instruction, independent reading, and natural language exposure, but it can also shrink due to forgetting, trauma, or disease. Furthermore, vocabulary is a significant focus of study across various disciplines, like linguistics, education, psychology, and artificial intelligence. Vocabulary is not limited to single words; it also encompasses multi-word units known as collocations, idioms, and other types of phraseology. Acquiring an adequate vocabulary is one of the largest challenges in learning a second language.

A root is the core of a word that is irreducible into more meaningful elements. In morphology, a root is a morphologically simple unit which can be left bare or to which a prefix or a suffix can attach. The root word is the primary lexical unit of a word, and of a word family, which carries aspects of semantic content and cannot be reduced into smaller constituents. Content words in nearly all languages contain, and may consist only of, root morphemes. However, sometimes the term "root" is also used to describe the word without its inflectional endings, but with its lexical endings in place. For example, chatters has the inflectional root or lemma chatter, but the lexical root chat. Inflectional roots are often called stems. A root, or a root morpheme, in the stricter sense, may be thought of as a monomorphemic stem.

Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Stop words are the words in a stop list which are filtered out before or after processing of natural language data (text) because they are deemed insignificant. There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in [information retrieval] systems over time has been from standard use of quite large stop lists to very small stop lists to no stop list whatsoever".

In linguistics, a word stem is a part of a word responsible for its lexical meaning. Typically, a stem remains unmodified during inflection with few exceptions due to apophony

<span class="mw-page-title-main">Word</span> Basic element of language

A word is a basic element of language that carries meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguists on its definition and numerous attempts to find specific criteria of the concept remain controversial. Different standards have been proposed, depending on the theoretical background and descriptive context; these do not converge on a single definition. Some specific definitions of the term "word" are employed to convey its different meanings at different levels of description, for example based on phonological, grammatical or orthographic basis. Others suggest that the concept is simply a convention used in everyday situations.

In morphology and lexicography, a lemma is the canonical form, dictionary form, or citation form of a set of word forms. In English, for example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lexeme, in this context, refers to the set of all the inflected or alternating forms in the paradigm of a single word, and lemma refers to the particular form that is chosen by convention to represent the lexeme. Lemmas have special significance in highly inflected languages such as Arabic, Turkish, and Russian. The process of determining the lemma for a given lexeme is called lemmatisation. The lemma can be viewed as the chief of the principal parts, although lemmatisation is at least partly arbitrary.

Document clustering is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.

In linguistic morphology, inflection is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness. The inflection of verbs is called conjugation, while the inflection of nouns, adjectives, adverbs, etc. can be called declension.

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Julie Beth Lovins was a computational linguist who published The Lovins Stemming Algorithm - a type of stemming algorithm for word matching - in 1968.

The Lovins Stemmer is a single pass, context sensitive stemmer, which removes endings based on the longest-match principle. The stemmer was the first to be published and was extremely well developed considering the date of its release, having been the main influence on a large amount of the future work in the area. -Adam G., et al

The following outline is provided as an overview of and topical guide to natural-language processing:

Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results. It is related to natural language processing but specifically focused on the understanding of search queries.

References

↑ Collins English Dictionary, entry for "lemmatize"
↑ "WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages".
↑ Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). Joint Lemmatization and Morphological Tagging with LEMMING (PDF). 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. pp. 2268–2274. doi: 10.18653/v1/D15-1272 .
↑ Bergmanis, Toms; Goldwater, Sharon. "Context Sensitive Neural Lemmatization with Lematus" (PDF).
↑ Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich. "Introduction to Information Retrieval". Cambridge University Press.
↑ "Lucene Snowball". Apache project.
↑ Martin Porter. "Porter Stemmer".
↑ Liu, H.; Christiansen, T.; Baumgartner, W. A.; Verspoor, K. (2012). "BioLemmatizer: A lemmatization tool for morphological processing of biomedical text". Journal of Biomedical Semantics . 3: 3. doi: 10.1186/2041-1480-3-3 . PMC 3359276 . PMID 22464129.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Collins English Dictionary, entry for "lemmatize"

[Semantic_Annotation_Research-2] "WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages".

[Muller,_University_of_Munich-3] Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). Joint Lemmatization and Morphological Tagging with LEMMING (PDF). 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. pp. 2268–2274. doi: 10.18653/v1/D15-1272 .

[4] Bergmanis, Toms; Goldwater, Sharon. "Context Sensitive Neural Lemmatization with Lematus" (PDF).

[Stanford_Information_Retrieval_Book-5] Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich. "Introduction to Information Retrieval". Cambridge University Press.

[Lucene_Snowball-6] "Lucene Snowball". Apache project.

[Porter_Stemmer-7] Martin Porter. "Porter Stemmer".

[8] Liu, H.; Christiansen, T.; Baumgartner, W. A.; Verspoor, K. (2012). "BioLemmatizer: A lemmatization tool for morphological processing of biomedical text". Journal of Biomedical Semantics . 3: 3. doi: 10.1186/2041-1480-3-3 . PMC 3359276 . PMID 22464129.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]