Lexical choice is the subtask of Natural language generation that involves choosing the content words (nouns, non-auxiliary verbs, adjectives, and adverbs) in a generated text. Function words (determiners, for example) are usually chosen during realisation.
The simplest type of lexical choice involves mapping a domain concept (perhaps represented in an ontology) to a word. For example, the concept Finger might be mapped to the word finger.
A more complex situation is when a domain concept is expressed using different words in different situations. For example, the domain concept Value-Change can be expressed in many ways:
Sometimes words can communicate additional contextual information, for example:
Contextual information is especially significant for vague terms such as tall. For example, a 2m tall man is tall, but a 2m tall horse is small.
Lexical choice modules must be informed by linguistic knowledge of how the system's input data maps onto words. This is a question of semantics, but it is also influenced by syntactic factors (such as collocation effects) and pragmatic factors (such as context).
Hence NLG systems need linguistic models of how meaning is mapped to words in the target domain (genre) of the NLG system. Genre tends to be very important; for example the verb veer has a very specific meaning in weather forecasts (wind direction is changing in a clockwise direction) which it does not have in general English, and a weather-forecast generator must be aware of this genre-specific meaning.
In some cases there are major differences in how different people use the same word; [1] for example, some people use by evening to mean 6PM and others use it to mean midnight. Psycholinguists have shown that when people speak to each other, they agree on a common interpretation via lexical alignment; [2] this is not something which NLG systems can yet do.
Ultimately, lexical choice must deal with the fundamental issue of how language relates to the non-linguistic world. [3] For example, a system which chose colour terms such as red to describe objects in a digital image would need to know which RGB pixel values could generally be described as red; how this was influenced by visual (lighting, other objects in the scene) and linguistic (other objects being discussed) context; what pragmatic connotations were associated with red (for example, when an apple is called red, it is assumed to be ripe as well as have the colour red); and so forth.
A number of algorithms and models have been developed for lexical choice in the research community, [4] for example Edmonds developed a model for choosing between near-synonyms (words with similar core meanings but different connotations). [5] However such algorithms and models have not been widely used in applied NLG systems; such systems have instead often used quite simple computational models, and invested development effort in linguistic analysis instead of algorithm development.
Ambiguity is the type of meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement whose intended meaning cannot be definitively resolved, according to a rule or process with a finite number of steps..
A lexicon is the vocabulary of a language or branch of knowledge. In linguistics, a lexicon is a language's inventory of lexemes. The word lexicon derives from Greek word λεξικόν, neuter of λεξικός meaning 'of or for words'.
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
Lexical semantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.
Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.
In linguistics, the minimalist program is a major line of inquiry that has been developing inside generative grammar since the early 1990s, starting with a 1993 paper by Noam Chomsky.
The language of thought hypothesis (LOTH), sometimes known as thought ordered mental expression (TOME), is a view in linguistics, philosophy of mind and cognitive science, forwarded by American philosopher Jerry Fodor. It describes the nature of thought as possessing "language-like" or compositional structure. On this view, simple concepts combine in systematic ways to build thoughts. In its most basic form, the theory states that thought, like language, has syntax.
A paraphrase or rephrase is the rendering of the same text in different words without losing the meaning of the text itself. More often than not, a paraphrased text can convey its meaning better than the original words. In other words, it is a copy of the text in meaning, but which is different from the original. For example, when someone tells a story they heard, in their own words, they paraphrase, with the meaning being the same. The term itself is derived via Latin paraphrasis, from Ancient Greek παράφρασις (paráphrasis) 'additional manner of expression'. The act of paraphrasing is also called paraphrasis.
In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
Cognitive semantics is part of the cognitive linguistics movement. Semantics is the study of linguistic meaning. Cognitive semantics holds that language is part of a more general human cognitive ability, and can therefore only describe the world as people conceive of it. It is implicit that different linguistic communities conceive of simple things and processes in the world differently, not necessarily some difference between a person's conceptual world and the real world.
In linguistics, a semantic field is a lexical set of words grouped semantically that refers to a specific subject. The term is also used in anthropology, computational semiotics, and technical exegesis.
Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.
In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.
Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax, semantics (meaning), morphology, phonetics, phonology, and pragmatics. Subdisciplines such as biolinguistics and psycholinguistics bridge many of these divisions.
Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping of sentences in a generated text. It is closely related to the Content determination NLG task.
Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task. RBMT has been progressively superseded by more efficient methods, particularly neural machine translation.
In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
The following outline is provided as an overview of and topical guide to natural-language processing: