Lexical choice

Last updated

Lexical choice is the subtask of Natural language generation that involves choosing the content words (nouns, verbs, adjectives, and adverbs) in a generated text. Function words (determiners, for example) are usually chosen during realisation.

Contents

Examples

The simplest type of lexical choice involves mapping a domain concept (perhaps represented in an ontology) to a word. For example, the concept Finger might be mapped to the word finger.

A more complex situation is when a domain concept is expressed using different words in different situations. For example, the domain concept Value-Change can be expressed in many ways

Sometimes words can communicate additional contextual information, for example

Contextual information is especially significant for vague terms such as tall. For example, a 2m tall man is tall, but a 2m tall horse is small.

Linguistic perspective

Lexical choice modules must be informed by linguistic knowledge of how the system's input data maps onto words. This is a question of semantics, but it is also influenced by syntactic factors (such as collocation effects) and pragmatic factors (such as context).

Hence NLG systems need linguistic models of how meaning is mapped to words in the target domain (genre) of the NLG system. Genre tends to be very important; for example the verb veer has a very specific meaning in weather forecasts (wind direction is changing in a clockwise direction) which it does not have in general English, and a weather-forecast generator must be aware of this genre-specific meaning.

In some cases there are major differences in how different people use the same word; [1] for example, some people use by evening to mean 6PM and others use it to mean midnight. Psycholinguists have shown that when people speak to each other, they agree on a common interpretation via lexical alignment; [2] this is not something which NLG systems can yet do.

Ultimately, lexical choice must deal with the fundamental issue of how language relates to the non-linguistic world. [3] For example, a system which chose colour terms such as red to describe objects in a digital image would need to know which RGB pixel values could generally be described as red; how this was influenced by visual (lighting, other objects in the scene) and linguistic (other objects being discussed) context; what pragmatic connotations were associated with red (for example, when an apple is called red, it is assumed to be ripe as well as have the colour red); and so forth.

Algorithms and models

A number of algorithms and models have been developed for lexical choice in the research community, [4] for example Edmonds developed a model for choosing between near-synonyms (words with similar core meanings but different connotations). [5] However such algorithms and models have not been widely used in applied NLG systems; such systems have instead often used quite simple computational models, and invested development effort in linguistic analysis instead of algorithm development.

Related Research Articles

Ambiguity Type of uncertainty of meaning in which several interpretations are plausible

Ambiguity is a type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement whose intended meaning cannot be definitively resolved according to a rule or process with a finite number of steps.

A lexicon is the vocabulary of a language or branch of knowledge. In linguistics, a lexicon is a language's inventory of lexemes. The word lexicon derives from Greek word λεξικόν, neuter of λεξικός meaning 'of or for words'.

Semantics is the study of meaning, reference, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and computer science.

In logic, fuzzy logic is a form of many-valued logic in which the truth value of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completely false. By contrast, in Boolean logic, the truth values of variables may only be the integer values 0 or 1.

Word-sense disambiguation (WSD) is an open problem in computational linguistics concerned with identifying which sense of a word is used in a sentence. The solution to this issue impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Lexical semantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.

Natural-language generation (NLG) is a software process that produces natural language output. While it is widely agreed that the output of any NLG process is text, there is some disagreement on whether the inputs of an NLG system need to be non-linguistic. Common applications of NLG methods include the production of various reports, for example weather and patient reports; image captions; and chatbots.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

The language of thought hypothesis (LOTH), sometimes known as thought ordered mental expression (TOME), is a view in linguistics, philosophy of mind and cognitive science, forwarded by American philosopher Jerry Fodor. It describes the nature of thought as possessing "language-like" or compositional structure. On this view, simple concepts combine in systematic ways to build thoughts. In its most basic form, the theory states that thought, like language, has syntax.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In sociolinguistics, a sociolect is a form of language or a set of lexical items used by a socioeconomic class, a profession, an age group or other social group.

In linguistics, a stem is a part of a word responsible for its lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question. In Athabaskan linguistics, for example, a verb stem is a root that cannot appear on its own, and that carries the tone of the word. Athabaskan verbs typically have two stems in this analysis, each preceded by prefixes.

Cognitive semantics is part of the cognitive linguistics movement. Semantics is the study of linguistic meaning. Cognitive semantics holds that language is part of a more general human cognitive ability, and can therefore only describe the world as people conceive of it. It is implicit that different linguistic communities conceive of simple things and processes in the world differently, not necessarily some difference between a person's conceptual world and the real world.

In linguistics, a semantic field is a lexical set of words grouped semantically that refers to a specific subject. The term is also used in anthropology, computational semiotics, and technical exegesis.

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called Distributional hypothesis: linguistic items with similar distributions have similar meanings.

In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.

Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.

Linguistics is the scientific study of language. It encompasses the analysis of every aspect of language, as well as the methods for studying and modelling them.

Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping of sentences in a generated text. It is closely related to the Content determination NLG task.

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

References

  1. E Reiter and S Sripada (2002). Human Variation and Lexical Choice. Computational Linguistics 28:545-553.
  2. S Brennan and H Clark (1996). Conceptual Pacts and Lexical Choice in Conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition 22:1482-1493
  3. D Roy and E Reiter (2005). Connecting Language to the World. Artificial Intelligence 167:1-12.
  4. Perera, R. and Nand, P 2015. A Multi-Strategy Approach for Lexicalizing Linked Open Data.
  5. P Edmonds and G Hirst (2002). Near-Synonymy and Lexical Choice. Computational Linguistics 28:105-144.