Lemma (morphology)

Last updated

In morphology and lexicography, a lemma (pl.: lemmas or lemmata) is the canonical form, [1] dictionary form, or citation form of a set of word forms. [2] In English, for example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lexeme, in this context, refers to the set of all the inflected or alternating forms in the paradigm of a single word, and lemma refers to the particular form that is chosen by convention to represent the lexeme. Lemmas have special significance in highly inflected languages such as Arabic, Turkish, and Russian. The process of determining the lemma for a given lexeme is called lemmatisation. The lemma can be viewed as the chief of the principal parts, although lemmatisation is at least partly arbitrary.

Contents

Morphology

The form of a word that is chosen to serve as the lemma is usually the least marked form, but there are several exceptions such as the use of the infinitive for verbs in some languages.

For English, the citation form of a noun is the singular (and non-possessive) form: mouse rather than mice. For multiword lexemes that contain possessive adjectives or reflexive pronouns, the citation form uses a form of the indefinite pronoun one: do one's best, perjure oneself. In European languages with grammatical gender, the citation form of regular adjectives and nouns is usually the masculine singular.[ citation needed ] If the language also has cases, the citation form is often the masculine singular nominative.

For many languages, the citation form of a verb is the infinitive: French aller , German gehen , Hindustani जाना / جانا , Spanish ir . English verbs usually have an infinitive, which in its bare form (without the particle to) is its least marked (for example, break is chosen over to break, breaks, broke, breaking, and broken); for defective verbs with no infinitive the present tense is used (for example, must has only one form while shall has no infinitive, and both lemmas are their lexemes' present tense forms). For Latin, Ancient Greek, Modern Greek, and Bulgarian, the first person singular present tense is traditionally used, but some modern dictionaries use the infinitive instead (except for Bulgarian, which lacks infinitives; for contracted verbs in Ancient Greek, an uncontracted first person singular present tense is used to reveal the contract vowel: φιλέωphiléō for φιλῶphilō "I love" [implying affection], ἀγαπάωagapáō for ἀγαπῶagapō "I love" [implying regard]). Finnish dictionaries list verbs not under their root, but under the first infinitive, marked with -(t)a, -(t)ä.

For Japanese, the non-past (present and future) tense is used. For Arabic the third-person singular masculine of the past/perfect tense is the least-marked form and is used for entries in modern dictionaries. In older dictionaries, which are still commonly used, the triliteral of the word, either a verb or a noun, is used. This is similar to Hebrew, which also uses the third-person singular masculine perfect form, e.g. ברא bara' create, כפר kaphar deny. Georgian uses the verbal noun. For Korean, -da is attached to the stem.

In Tamil, an agglutinative language, the verb stem (which is also the imperative form - the least marked one) is often cited, e.g., இரு

In Irish, words are highly inflected by case (genitive, nominative, dative and vocative) and by their place within a sentence because of initial mutations. The noun cainteoir, the lemma for the noun meaning "speaker", has a variety of forms: chainteoir, gcainteoir, cainteora, chainteora, cainteoirí, chainteoirí and gcainteoirí.

Some phrases are cited in a sort of lemma: Carthago delenda est (literally, "Carthage must be destroyed") is a common way of citing Cato, but what he said was nearer to censeo Carthaginem esse delendam ("I hold Carthage to be in need of destruction").

Lexicography

In a dictionary, the lemma "go" represents the inflected forms "go", "goes", "going", "went", and "gone". The relationship between an inflected form and its lemma is usually denoted by an angle bracket, e.g., "went" < "go". Of course, the disadvantage of such simplifications is the inability to look up a declined or conjugated form of the word, but some dictionaries, like Webster's Dictionary, list "went". Multilingual dictionaries vary in how they deal with this issue: the Langenscheidt dictionary of German does not list ging (< gehen), but the Cassell does.

Lemmas or word stems are used often in corpus linguistics for determining word frequency. In that usage, the specific definition of "lemma" is flexible depending on the task it is being used for.

Pronunciation

A word may have different pronunciations, depending on its phonetic environment (the neighbouring sounds) or on the degree of stress in a sentence. An example of the latter is the weak and strong forms of certain English function words like some and but (pronounced /sʌm/, /bʌt/ when stressed but /s(ə)m/, /bət/ when unstressed). Dictionaries usually give the pronunciation used when the word is pronounced alone (its isolation form) and with stress, but they may also note common weak forms of pronunciation.

Difference between stem and lemma

The stem is the part of the word that never changes even when morphologically inflected; a lemma is the least marked form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production. and producing [3] [ failed verification ] In linguistic analysis, the stem is defined more generally as the analyzed base form from which all inflected forms can be formed.[ citation needed ] When phonology is taken into account, the definition of the unchangeable part of the word is not useful, as can be seen in the phonological forms of the words in the preceding example: "produced" /prəˈdjst/ vs. "production" /prəˈdʌkʃən/ .

Some lexemes have several stems but one lemma. For instance the verb "to go" has the stems "go" and "went" due to suppletion: the past tense was co-opted from a different verb, "to wend".

Headword

A headword or catchword [4] is the lemma under which a set of related dictionary or encyclopaedia entries appears. The headword is used to locate the entry, and dictates its alphabetical position. Depending on the size and nature of the dictionary or encyclopedia, the entry may include alternative meanings of the word, its etymology, pronunciation and inflections, related lemmas such as compound words or phrases that contain the headword, and encyclopedic information about the concepts represented by the word.

For example, the headword bread may contain the following (simplified) definitions:

Bread
(noun)
  • A common food made from the combination of flour, water and yeast
  • Money (slang)
(verb)
  • To coat in breadcrumbs
to know which side your bread is buttered to know how to act in your own best interests.

The Academic Dictionary of Lithuanian contains around 500,000 headwords. The Oxford English Dictionary (OED) has around 273,000 headwords along with 220,000 other lemmas, [5] while Webster's Third New International Dictionary has about 470,000. [6] The Deutsches Wörterbuch (DWB), the largest lexicon of the German language, has around 330,000 headwords. [7] These values are cited by the dictionary makers and may not use exactly the same definition of a headword. In addition, headwords may not accurately reflect a dictionary's physical size. The OED and the DWB, for instance, include exhaustive historical reviews and exact citations from source documents not usually found in standard dictionaries.

The term 'lemma' comes from the practice in Greco-Roman antiquity of using the word to refer to the headwords of marginal glosses in scholia; for this reason, the Ancient Greek plural form is sometimes used, namely lemmata (Greek λῆμμα, pl. λήμματα).

See also

Related Research Articles

<span class="mw-page-title-main">Grammatical conjugation</span> Creation of derived forms of a verb from its principal parts by inflection

In linguistics, conjugation is the creation of derived forms of a verb from its principal parts by inflection. For instance, the verb break can be conjugated to form the words break, breaks, broke, broken and breaking. While English has a relatively simple conjugation, other languages such as French and Arabic or Spanish are more complex, with each verb having dozens of conjugated forms. Some languages such as Georgian and Basque have highly complex conjugation systems with hundreds of possible conjugations for every verb.

Infinitive is a linguistics term for certain verb forms existing in many languages, most often used as non-finite verbs. As with many linguistic concepts, there is not a single definition applicable to all languages. The name is derived from Late Latin [modus] infinitivus, a derivative of infinitus meaning "unlimited".

A lexeme is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken by a single root word. For example, in English, run, runs, ran and running are forms of the same lexeme, which can be represented as RUN.

A verb is a word that in syntax generally conveys an action, an occurrence, or a state of being. In the usual description of English, the basic form, with or without the particle to, is the infinitive. In many languages, verbs are inflected to encode tense, aspect, mood, and voice. A verb may also agree with the person, gender or number of some of its arguments, such as its subject, or object. Verbs have tenses: present, to indicate that an action is being carried out; past, to indicate that an action has been done; future, to indicate that an action will be done.

The Finnish language is spoken by the majority of the population in Finland and by ethnic Finns elsewhere. Unlike the languages spoken in neighbouring countries, such as Swedish and Norwegian, which are North Germanic languages, or Russian, which is a Slavic language, Finnish is a Uralic language of the Finnic languages group. Typologically, Finnish is agglutinative. As in some other Uralic languages, Finnish has vowel harmony, and like other Finnic languages, it has consonant gradation.

<span class="mw-page-title-main">Catalan grammar</span> Morphology and syntax of Catalan

Catalan grammar, the morphology and syntax of the Catalan language, is similar to the grammar of most other Romance languages. Catalan is a relatively synthetic, fusional language. Features include:

In linguistics, a marker is a free or bound morpheme that indicates the grammatical function of the marked word, phrase, or sentence. Most characteristically, markers occur as clitics or inflectional affixes. In analytic languages and agglutinative languages, markers are generally easily distinguished. In fusional languages and polysynthetic languages, this is often not the case. For example, in Latin, a highly fusional language, the word amō is marked by suffix for indicative mood, active voice, first person, singular, present tense. Analytic languages tend to have a relatively limited number of markers.

In linguistics, a word stem is a part of a word responsible for its lexical meaning. Typically, a stem remains unmodified during inflection with few exceptions due to apophony

Yiddish grammar is the system of principles which govern the structure of the Yiddish language. This article describes the standard form laid out by YIVO while noting differences in significant dialects such as that of many contemporary Hasidim. As a Germanic language descended from Middle High German, Yiddish grammar is fairly similar to that of German, though it also has numerous linguistic innovations as well as grammatical features influenced by or borrowed from Hebrew, Aramaic, and various Slavic languages.

The grammar of Old English is quite different from that of Modern English, predominantly by being much more inflected. As an old Germanic language, Old English has a morphological system that is similar to that of the Proto-Germanic reconstruction, retaining many of the inflections thought to have been common in Proto-Indo-European and also including constructions characteristic of the Germanic daughter languages such as the umlaut.

In Hebrew, verbs, which take the form of derived stems, are conjugated to reflect their tense and mood, as well as to agree with their subjects in gender, number, and person. Each verb has an inherent voice, though a verb in one voice typically has counterparts in other voices. This article deals mostly with Modern Hebrew, but to some extent, the information shown here applies to Biblical Hebrew as well.

Icelandic is an inflected language. Icelandic nouns can have one of three grammatical genders: masculine, feminine or neuter. Nouns, adjectives and pronouns are declined in four cases and two numbers, singular and plural.

The grammar of the Marathi language shares similarities with other modern Indo-Aryan languages such as Odia, Gujarati or Punjabi. The first modern book exclusively about the grammar of Marathi was printed in 1805 by Willam Carey.

Nepali grammar is the study of the morphology and syntax of Nepali, an Indo-European language spoken in South Asia.

<span class="mw-page-title-main">Inflection</span> Process of word formation

In linguistic morphology, inflection is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness. The inflection of verbs is called conjugation, and one can refer to the inflection of nouns, adjectives, adverbs, pronouns, determiners, participles, prepositions and postpositions, numerals, articles, etc, as declension.

This article deals with the grammar of the Udmurt language.

Historical linguistics has made tentative postulations about and multiple varyingly different reconstructions of Proto-Germanic grammar, as inherited from Proto-Indo-European grammar. All reconstructed forms are marked with an asterisk (*).

Zotung (Zobya) is a language spoken by the Zotung people, in Rezua Township, Chin State, Burma. It is a continuum of closely related dialects and accents. The language does not have a standard written form since it has dialects with multiple variations on its pronunciations. Instead, Zotung speakers use a widely accepted alphabet for writing with which they spell using their respective dialect. However, formal documents are written using the Lungngo dialect because it was the tongue of the first person to prescribe a standard writing, Sir Siabawi Khuamin.

The morphology of the Polish language is characterised by a fairly regular system of inflection as well as word formation. Certain regular or common alternations apply across the Polish morphological system, affecting word formation and inflection of various parts of speech. These are described below, mostly with reference to the orthographic rather than the phonological system for clarity.

The grammar of Old Saxon is highly inflected, similar to that of Old English or Latin. As an ancient Germanic language, the morphological system of Old Saxon is similar to that of the hypothetical Proto-Germanic reconstruction, retaining many of the inflections thought to have been common in Proto-Indo-European and also including characteristically Germanic constructions such as the umlaut. Among living languages, Old Saxon morphology most closely resembles that of modern High German.

References

  1. Zgusta, Ladislav (2006). Dolezal, Fredric F.M. (ed.). Lexicography then and now. p. 202. ISBN   3484391294. A minor... problem can arise when the canonical form of the headword, i.e. the form in which it is to be cited, is to be chosen.
  2. Francis, W.N.; Kučera, H (1982). Frequency Analysis of English Usage: Lexicon and Usage. Boston: Houghton Mifflin.
  3. "Natural Language Toolkit — NLTK 3.0 documentation". Nltk.org. 2015-09-05. Retrieved 2015-09-27.
  4. Oxford English Dictionary , 3rd. edition, 2018, s.v., definition 5
  5. "Glossary - Oxford English Dictionary". public.oed.com. Retrieved 3 October 2016.
  6. "Mwunabridged". www.merriam-webster.com. Retrieved 3 October 2016.
  7. The Deutsches Wörterbuch Archived 2016-08-12 at the Wayback Machine at the BBAW, retrieved 22-June-2012.