Morphological parsing

Last updated

Morphological parsing, in natural language processing, is the process of determining the morphemes from which a given word is constructed. It must be able to distinguish between orthographic rules and morphological rules. For example, the word 'foxes' can be decomposed into 'fox' (the stem), and 'es' (a suffix indicating plurality).

Contents

The generally accepted approach to morphological parsing is through the use of a finite state transducer (FST), which inputs words and outputs their stem and modifiers. The FST is initially created through algorithmic parsing of some word source, such as a dictionary, complete with modifier markups.

Another approach is through the use of an indexed lookup method, which uses a constructed radix tree. This is not an often-taken route because it breaks down for morphologically complex languages.

With the advancement of neural networks in natural language processing, it became less common to use FST for morphological analysis, especially for languages for which there is a lot of available training data. For such languages, it is possible to build character-level language models without explicit use of a morphological parser. [1]

Orthographic

Orthographic rules are general rules used when breaking a word into its stem and modifiers. An example would be: singular English words ending with -y, when pluralized, end with -ies. Contrast this to morphological rules which contain corner cases to these general rules. Both of these types of rules are used to construct systems that can do morphological parsing.

Morphological

Morphological rules are exceptions to the orthographic rules used when breaking a word into its stem and modifiers. An example would be while one normally pluralizes a word in English by adding 's' as a suffix, the word 'fish' does not change when pluralized. Contrast this to orthographic rules which contain general rules. Both of these types of rules are used to construct systems that can do morphological parsing.

Various models of natural morphological processing have been proposed. Some experimental studies suggest that monolingual speakers process words as wholes upon listening to them, while their late bilinguals peers break words down into their corresponding morphemes, because their lexical representations are not as specific, and because lexical processing in the second language may be less frequent than processing the mother tongue. [2]

Applications of morphological processing include machine translation, spell checker, and information retrieval.

Related Research Articles

In linguistics, an allomorph is a variant phonetic form of a morpheme, or, a unit of meaning that varies in sound and spelling without changing the meaning. The term allomorph describes the realization of phonological variations for a specific morpheme. The different allomorphs that a morpheme can become are governed by morphophonemic rules. These phonological rules determine what phonetic form, or specific pronunciation, a morpheme will take based on the phonological or morphological context in which they appear.

In linguistics, an affix is a morpheme that is attached to a word stem to form a new word or word form. The main two categories are derivational and inflectional affixes. The first ones, such as -un, -ation, anti-, pre- etc, introduce a semantic change to the word they are attached to. The latter ones introduce a syntactic change, such as singular into plural, or present simple tense into present continuous or past tense by adding -ing, -ed to a word. All of them are bound morphemes by definition; prefixes and suffixes may be separable affixes.

A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology.

<span class="mw-page-title-main">Morphology (linguistics)</span> Study of words, their formation, and their relationships in a word

In linguistics, morphology is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words such as stems, root words, prefixes, and suffixes. Morphology also looks at parts of speech, intonation and stress, and the ways context can change a word's pronunciation and meaning. Morphology differs from morphological typology, which is the classification of languages based on their use of words, and lexicology, which is the study of words and how they make up a language's vocabulary.

<span class="mw-page-title-main">Agglutination</span> Process of word formation by combining morphemes of singular meaning

In linguistics, agglutination is a morphological process in which words are formed by stringing together morphemes, each of which corresponds to a single syntactic feature. Languages that use agglutination widely are called agglutinative languages. For example, in the agglutinative language of Turkish, the word evlerinizden consists of the morphemes ev-ler-iniz-den, literally translated morpheme-by-morpheme as house-plural-your(plural)-from. Agglutinative languages are often contrasted with isolating languages, in which words are monomorphemic, and fusional languages, in which words can be complex, but morphemes may correspond to multiple features.

<span class="mw-page-title-main">Crow language</span> Missouri Valley Siouan language of Montana, US

Crow is a Missouri Valley Siouan language spoken primarily by the Crow Nation in present-day southeastern Montana. The word, Apsáalooke, translates to "children of the raven." It is one of the larger populations of American Indian languages with 2,480 speakers according to the 1990 US Census.

In linguistics, a word stem is a part of a word responsible for its lexical meaning. Typically, a stem remains unmodified during inflection with few exceptions due to apophony

<span class="mw-page-title-main">Word</span> Basic element of language

A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguists on its definition and numerous attempts to find specific criteria of the concept remain controversial. Different standards have been proposed, depending on the theoretical background and descriptive context; these do not converge on a single definition. Some specific definitions of the term "word" are employed to convey its different meanings at different levels of description, for example based on phonological, grammatical or orthographic basis. Others suggest that the concept is simply a convention used in everyday situations.

In linguistics, especially within generative grammar, phi features are the morphological expression of a semantic process in which a word or morpheme varies with the form of another word or phrase in the same sentence. This variation can include person, number, gender, and case, as encoded in pronominal agreement with nouns and pronouns. Several other features are included in the set of phi-features, such as the categorical features ±N (nominal) and ±V (verbal), which can be used to describe lexical categories and case features.

In linguistics, apophony is any alternation within a word that indicates grammatical information.

Tübatulabal is an Uto-Aztecan language, traditionally spoken in Kern County, California, United States. It is the traditional language of the Tübatulabal, who still speak the traditional language in addition to English. The language originally had three main dialects: Bakalanchi, Pakanapul and Palegawan.

Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.

<span class="mw-page-title-main">Aguaruna language</span> Chicham language of Peru

Aguaruna is an indigenous American language of the Chicham family spoken by the Aguaruna people in Northern Peru. According to Ethnologue, based on the 2007 Census, 53,400 people out of the 55,700 ethnic group speak Aguaruna, making up almost the entire population. It is used vigorously in all domains of life, both written and oral. It is written with the Latin script. The literacy rate in Aguaruna is 60-90%. However, there are few monolingual speakers today; nearly all speakers also speak Spanish. The school system begins with Aguaruna, and as the students progress, Spanish is gradually added. There is a positive outlook and connotation in regard to bilingualism. 50 to 75% of the Aguaruna population are literate in Spanish. A modest dictionary of the language has been published.

Nanosyntax is an approach to syntax where the terminal nodes of syntactic parse trees may be reduced to units smaller than a morpheme. Each unit may stand as an irreducible element and not be required to form a further "subtree." Due to its reduction to the smallest terminal possible, the terminals are smaller than morphemes. Therefore, morphemes and words cannot be itemised as a single terminal, and instead are composed by several terminals. As a result, Nanosyntax can serve as a solution to phenomena that are inadequately explained by other theories of syntax.

This article presents a brief overview of the grammar of the Sesotho and provides links to more detailed articles.

In the fields of computational linguistics and applied linguistics, a morphological dictionary is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in natural language text. The corresponding lexical form of a surface form is the lemma followed by grammatical information. In English give, gives, giving, gave and given are surface forms of the verb give. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: morpheme-aligned dictionaries and full-form (non-aligned) dictionaries.

<span class="mw-page-title-main">Inflection</span> Process of word formation

In linguistic morphology, inflection is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness. The inflection of verbs is called conjugation, and one can refer to the inflection of nouns, adjectives, adverbs, pronouns, determiners, participles, prepositions and postpositions, numerals, articles, etc., as declension.

<span class="mw-page-title-main">Stemming</span> Process of reducing words to word stems

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Tommo So is a language spoken in the eastern part of Mali's Mopti Region. It is placed under the Dogon language family, a subfamily of the Niger-Congo language family.

Mekéns (Mekem), or Amniapé, is a nearly extinct Tupian language of the state of Rondônia, in the Amazon region of Brazil.

References

  1. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. "Enriching Word Vectors with Subword Information"
  2. Durand López, Ezequiel M. (2021). "Morphological processing and individual frequency effects in L1 and L2 Spanish". Lingua. 257: 103093. doi:10.1016/j.lingua.2021.103093.