Automatic acquisition of lexicon

Last updated

Automatic acquisition of lexicon is a computerized process used for the development of a complex morphological lexicon of a language. The lexicon is essential for the NLP (Natural language processing), as well as a prerequisite to any wide-coverage parser. [1] The two main requirements represent raw corpus and the morphological description of the language. The aim is to provide lemmas that will serve to the explanation of all the words that occur within the corpus. For the achievement of a quality lexicon it is necessary to manually validate the generated lemmas and iterate the whole process several times. The process is focused on the open word classes (e.g. nouns, adjectives, verbs). Closed classes (e.g. prepositions, pronouns, numerals) are excluded. This method is applicable to the languages with a rich morphology, such as Slovak, Russian or Croatian.

Contents

Applied to Slovak, being an inflectional language, the automatic acquisition focuses on the inflectional morphology as well as on the derivational morphology. This fact enables the users to find out the information about derivational relations (e.g. adjectivizations, prefixes) in the lexicon. For example, Slovak word korpusový is an adjectivization of korpus (eng. corpus).

Three-step loop

Conformably to Benoît Sagot, [1] there are three stages involved in the acquisition of lemmas:

  1. Generation and inflection
  2. Ranking
  3. Manual validation

The more iteration will be performed, the more accurate lexicon will be obtained. For each iteration are essential the information given by a manual validator.

Generation and inflection

Firstly, all words which represent the closed word classes (pronouns, prepositions, numerals) are manually excluded from the given corpus. Number of their occurrences in the corpus is provided. Then the automatic generation comes, when the hypothetical lemmas according to the morphological description of a language are created. Generated lemmas are consequently being inflected, so that all of their inflected forms are built. Obtained forms are associated with the corresponding lemma and a morphological tag.

Ranking

There was created a probabilistic model, represented by a fix-point algorithm, to rank the hypothetical lemmas generated in the first step. Best ranked lemmas are expected to be ideally all correct, whereas the least ranked tend to be incorrect.

Manual validation

Correctness of the best- ranked lemmas created in the previous step are checked by the manual validator, who should be a native speaker. Lemmas are at this stage divided into three categories:

  1. valid lemmas, appended to lexicon
  2. erroneous lemmas generated by valid forms (later associated to another lemmas)
  3. erroneous lemmas generated by invalid forms (these need to be excluded)

Future development

Automatic acquisition, in comparison to a purely manual development of the lexicons, seems to be promising, considering the future development, because of the short validation time needed and the relatively small amount of human labor involved.

Related Research Articles

In linguistics, declension is the changing of the form of a word, generally to express its syntactic function in the sentence, by way of some inflection. Declensions may apply to nouns, pronouns, adjectives, adverbs, and articles to indicate number, case, gender, and a number of other grammatical categories. Meanwhile, the inflectional change of verbs is called conjugation.

A grammatical case is a category of nouns and noun modifiers which corresponds to one or more potential grammatical functions for a nominal group in a wording. In various languages, nominal groups consisting of a noun and its modifiers belong to one of a few such categories. For instance, in English, one says I see them and they see me: the nominative pronouns I/they represent the perceiver and the accusative pronouns me/them represent the phenomenon perceived. Here, nominative and accusative are cases, that is, categories of pronouns corresponding to the functions they have in representation.

A lexicon is the vocabulary of a language or branch of knowledge. In linguistics, a lexicon is a language's inventory of lexemes. The word lexicon derives from Greek word λεξικόν, neuter of λεξικός meaning 'of or for words'.

A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology.

In grammar, a part of speech or part-of-speech is a category of words that have similar grammatical properties. Words that are assigned to the same part of speech generally display similar syntactic behavior, sometimes similar morphological behavior in that they undergo inflection for similar properties and even similar semantic behavior. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, and determiner.

An analytic language is a type of natural language which breaks up concepts into a series of root/stem words accompanied by prepositions, postpositions, particles and modifiers, using affixes very rarely, as opposed to synthetic languages which synthesize many concepts into a single word, using affixes regularly. Syntactic roles are assigned to words primarily by the word order. For example, by changing the individual words in the Latin phrase fēl-is pisc-em cēpit "the cat caught the fish" to fēl-em pisc-is cēpit "the fish caught the cat", the fish becomes the subject, while the cat becomes the object. This transformation is not possible in an analytic language without altering the word order. Typically, analytic languages have a low morpheme-per-word ratio, especially with respect to inflectional morphemes. No natural language, however, is purely analytic or purely synthetic.

In linguistic morphology, an uninflected word is a word that has no morphological markers (inflection) such as affixes, ablaut, consonant gradation, etc., indicating declension or conjugation. If a word has an uninflected form, this is usually the form used as the lemma for the word.

Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the Outline of linguistics, the List of phonetics topics, the List of linguists, and the List of cognitive science topics. Articles related to linguistics include:

<span class="mw-page-title-main">Arabic grammar</span> Grammar of the Arabic language

Arabic grammar is the grammar of the Arabic language. Arabic is a Semitic language and its grammar has many similarities with the grammar of other Semitic languages. Classical Arabic and Modern Standard Arabic have largely the same grammar; colloquial spoken varieties of Arabic can vary in different ways.

In Portuguese grammar, nouns, adjectives, pronouns, and articles are moderately inflected: there are two genders and two numbers. The case system of the ancestor language, Latin, has been lost, but personal pronouns are still declined with three main types of forms: subject, object of verb, and object of preposition. Most nouns and many adjectives can take diminutive or augmentative derivational suffixes, and most adjectives can take a so-called "superlative" derivational suffix. Adjectives usually follow their respective nouns.

Slovak, like most Slavic languages and Latin, is an inflected language, meaning that the endings of most words change depending on the given combination of the grammatical gender, the grammatical number and the grammatical case of the particular word in the particular sentence:

In morphology and lexicography, a lemma is the canonical form, dictionary form, or citation form of a set of word forms. In English, for example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lexeme, in this context, refers to the set of all the inflected or alternating forms in the paradigm of a single word, and lemma refers to the particular form that is chosen by convention to represent the lexeme. Lemmas have special significance in highly inflected languages such as Arabic, Turkish, and Russian. The process of determining the lemma for a given lexeme is called lemmatisation. The lemma can be viewed as the chief of the principal parts, although lemmatisation is at least partly arbitrary.

The morphology of the Welsh language has many characteristics likely to be unfamiliar to speakers of English or continental European languages like French or German, but has much in common with the other modern Insular Celtic languages: Irish, Scottish Gaelic, Manx, Cornish, and Breton. Welsh is a moderately inflected language. Verbs inflect for person, number, tense, and mood, with affirmative, interrogative, and negative conjugations of some verbs. There is no case inflection in Modern Welsh.

The grammar of the Polish language is characterized by a high degree of inflection, and has relatively free word order, although the dominant arrangement is subject–verb–object (SVO). There commonly are no articles, and there is frequent dropping of subject pronouns. Distinctive features include the different treatment of masculine personal nouns in the plural, and the complex grammar of numerals and quantifiers.

Odia grammar is the study of the morphological and syntactic structures, word order, case inflections, verb conjugation and other grammatical structures of Odia, an Indo-Aryan language spoken in South Asia.

<span class="mw-page-title-main">Inflection</span> Process of word formation

In linguistic morphology, inflection is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness. The inflection of verbs is called conjugation, and one can refer to the inflection of nouns, adjectives, adverbs, pronouns, determiners, participles, prepositions and postpositions, numerals, articles, etc., as declension.

The morphology of the Welsh language shows many characteristics perhaps unfamiliar to speakers of English or continental European languages like French or German, but has much in common with the other modern Insular Celtic languages: Irish, Scottish Gaelic, Manx, Cornish, and Breton. Welsh is a moderately inflected language. Verbs conjugate for person, tense and mood with affirmative, interrogative and negative conjugations of some verbs. A majority of prepositions inflect for person and number. There are few case inflections in Literary Welsh, being confined to certain pronouns.

A word family is the base form of a word plus its inflected forms and derived forms made with suffixes and prefixes plus its cognates, i.e. all words that have a common etymological origin, some of which even native speakers don't recognize as being related. In the English language, inflectional affixes include third person -s, verbal -ed and -ing, plural -s, possessive -s, comparative -er and superlative -est. Derivational affixes include -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-, -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize/-ise, -ment, in-. The idea is that a base word and its inflected forms support the same core meaning, and can be considered learned words if a learner knows both the base word and the affix. Bauer and Nation proposed seven levels of affixes based on their frequency in English. It has been shown that word families can assist with deriving related words via affixes, along with decreasing the time needed to derive and recognize such words.

Quenya is a constructed language devised by J. R. R. Tolkien, and used in his fictional universe, Middle-earth.

The grammar of the Manx language has much in common with related Indo-European languages, such as nouns that display gender, number and case and verbs that take endings or employ auxiliaries to show tense, person or number. Other morphological features are typical of Insular Celtic languages but atypical of other Indo-European languages. These include initial consonant mutation, inflected prepositions and verb–subject–object word order.

References

  1. 1 2 Sagot, Benoît. Automatic acquisition of a Slovak Lexicon from a Raw Corpus.