Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions (noun phrases) that identify specific entities called targets.
This task can be split into two sections. The content selection part determines which set of properties distinguish the intended target and the linguistic realization part defines how these properties are translated into natural language. A variety of algorithms have been developed in the NLG community to generate different types of referring expressions.
A referring expression (RE), in linguistics, is any noun phrase, or surrogate for a noun phrase, whose function in discourse is to identify some individual object (thing, being, event...) The technical terminology for identify differs a great deal from one school of linguistics to another. The most widespread term is probably refer, and a thing identified is a referent, as for example in the work of John Lyons. In linguistics, the study of reference relations belongs to pragmatics, the study of language use, though it is also a matter of great interest to philosophers, especially those wishing to understand the nature of knowledge, perception and cognition more generally.
Various devices can be used for reference: determiners, pronouns, proper names... Reference relations can be of different kinds; referents can be in a "real" or imaginary world, in discourse itself, and they may be singular, plural, or collective.
The simplest type of referring expressions are pronoun such as he and it. The linguistics and natural language processing communities have developed various models for predicting anaphor referents, such as centering theory, [1] and ideally referring-expression generation would be based on such models. However most NLG systems use much simpler algorithms, for example using a pronoun if the referent was mentioned in the previous sentence (or sentential clause), and no other entity of the same gender was mentioned in this sentence.
There has been a considerable amount of research on generating definite noun phrases, such as the big red book. Much of this builds on the model proposed by Dale and Reiter. [2] This has been extended in various ways, for example Krahmer et al. [3] present a graph-theoretic model of definite NP generation with many nice properties. In recent years a shared-task event has compared different algorithms for definite NP generation, using the TUNA [4] corpus.
Recently there has been more research on generating referring expressions for time and space. Such references tend to be imprecise (what is the exact meaning of tonight?), and also to be interpreted in different ways by different people. [5] Hence it may be necessary to explicitly reason about false positive vs false negative tradeoffs, and even calculate the utility of different possible referring expressions in a particular task context. [6]
Ideally, a good referring expression should satisfy a number of criteria:
REG goes back to the early days of NLG. One of the first approaches was done by Winograd [7] in 1972 who developed an "incremental" REG algorithm for his SHRDLU program. Afterwards researchers started to model the human abilities to create referring expressions in the 1980s. This new approach to the topic was influenced by the researchers Appelt and Kronfeld who created the programs KAMP and BERTRAND [8] [9] [10] and considered referring expressions as parts of bigger speech acts.
Some of their most interesting findings were the fact that referring expressions can be used to add information beyond the identification of the referent [9] as well as the influence of communicative context and the Gricean maxims on referring expressions. [8] Furthermore, its skepticism concerning the naturalness of minimal descriptions made Appelt and Kronfeld's research a foundation of later work on REG.
The search for simple, well-defined problems changed the direction of research in the early 1990s. This new approach was led by Dale and Reiter who stressed the identification of the referent as the central goal. [11] [12] [13] [14] Like Appelt [8] they discuss the connection between the Gricean maxims and referring expressions in their culminant paper [2] in which they also propose a formal problem definition. Furthermore, Reiter and Dale discuss the Full Brevity and Greedy Heuristics algorithms as well as their Incremental Algorithm(IA) which became one of the most important algorithms in REG. [note 1]
After 2000 the research began to lift some of the simplifying assumptions, that had been made in early REG research in order to create more simple algorithms. Different research groups concentrated on different limitations creating several expanded algorithms. Often these extend the IA in a single perspective for example in relation to:
Many simplifying assumptions are still in place or have just begun to be worked on. Also a combination of the different extensions has yet to be done and is called a "non-trivial enterprise" by Krahmer and van Deemter. [33]
Another important change after 2000 was the increasing use of empirical studies in order to evaluate algorithms. This development took place due to the emergence of transparent corpora. Although there are still discussions about what the best evaluation metrics are, the use of experimental evaluation has already led to a better comparability of algorithms, a discussion about the goals of REG and more task-oriented research.
Furthermore, research has extended its range to related topics such as the choice of Knowledge Representation(KR) Frameworks. In this area the main question, which KR framework is most suitable for the use in REG remains open. The answer to this question depends on how well descriptions can be expressed or found. A lot of the potential of KR frameworks has been left unused so far.
Some of the different approaches are the usage of:
Dale and Reiter (1995) think about referring expressions as distinguishing descriptions.
They define:
Each entity in the domain can be characterised as a set of attribute–value pairs for example type, dog, gender, female or age, 10 years.
The problem then is defined as follows:
Let be the intended referent, and be the contrast set. Then, a set of attribute–value pairs will represent a distinguishing description if the following two conditions hold:
In other words, to generate a referring expression one is looking for a set of properties that apply to the referent but not to the distractors. [2]
The problem could be easily solved by conjoining all the properties of the referent which often leads to long descriptions violating the second Gricean Maxim of Quantity. Another approach would be to find the shortest distinguishing description like the Full Brevity algorithm does. Yet in practice it is most common to instead include the condition that referring expressions produced by an algorithm should be as similar to human-produced ones as possible although this is often not explicitly mentioned. [note 1]
The Full Brevity algorithm always finds a minimal distinguishing description meaning there is no shorter distinguishing description in regard to properties used.
Therefore, it iterates over and checks every description of a length of properties until a distinguishing description is found.
Two problems arise from this way of creating referring expressions. Firstly the algorithm has a high complexity meaning it is NP-hard which makes it impractical to use. [40] Secondly human speakers produce descriptions that are not minimal in many situations. [41] [42] [43] [44] [note 1]
The Greedy Heuristics algorithm [11] [12] approximates the Full Brevity algorithm by iteratively adding the most distinguishing property to the description. The most distinguishing property means the property that rules out most of the remaining distractors. The Greedy Heuristics algorithm is more efficient than the Full Brevity algorithm. [note 1]
Dale and Reiter(1995) [2] present the following algorithm for the Greedy Heuristic:
Let be the set of properties to be realised in our description; let be the set of properties known to be true of our intended referent (we assume that is non-empty); and let be the set of distractors (the contrast set). The initial conditions are thus as follows:
all distractors; all properties true of ;
In order to describe the intended referent with respect to the contrast set , we do the following:
1. Check Success: ifthen return as a distinguishing description elseifthen fail elsegoto Step 2. 2. Choose Property: for eachdo: Chosen property is , where is the smallest set. goto Step 3. 3. Extend Description (wrt the chosen ): goto Step 1.
The Incremental Algorithm (IA) by Dale and Reiter [2] was the most influential algorithm before 2000. It is based on the idea of a preferential order of attributes or properties that speakers go by. So in order to run the Incremental Algorithm, first a preference order of attributes has to be given. Now the algorithm follows that order and adds those properties to the description which rule out any remaining distractors. Furthermore, Dale and Reiter [2] stress the attribute type which is always included in their descriptions even if it does not rule out any distractors.
Also the type values are part of a subsumption hierarchy including some basic level values. For example, in the pet domain chihuahua is subsumed by dog and dog by animal. Because dog is defined as a basic level dog would be preferred by the algorithms, if chihuahua does not rule out any distractors.
The Incremental Algorithm is easy to implement and also computationally efficient running in polynomial time. The description generated by the IA can contain redundant properties that are superfluous because of later added properties. The creators do not consider this as a weakness, but rather as making the expressions less "psycholinguistically implausible". [2]
The following algorithm is a simplified version of Dale and Reiter's Incremental Algorithm [2] by Krahmer and van Deemter [33] that takes as input the referent r, the D containing a collection of domain objects and a domain-specific ordered list Pref of preferred attributes. In the notation L is the description, C the context set of distractors and the function RulesOut(⟨Ai, V⟩) returns the set of objects which have a value different to V for attribute Ai.
IncrementalAlgorithm ({r}, D, Pref) L ← ∅ C ← D - {r} for each Ai in list Pref do V = Value(r, Ai) if C ∩ RulesOut(⟨Ai, V⟩) ≠ ∅ then L ← L ∪ {⟨Ai, V⟩} C ← C - RulesOut(⟨Ai, V⟩) endifif C = ∅ thenreturn L endifreturn failure [note 1]
Before 2000 evaluation of REG systems has been of theoretical nature like the one done by Dale and Reiter. [2] More recently, empirical studies have become popular which are mostly based on the assumption that the generated expressions should be similar to human-produced ones. Corpus-based evaluation began quite late in REG due to a lack of suitable data sets. Still corpus-based evaluation is the most dominant method at the moment though there is also evaluation by human judgement. [note 1]
First the distinction between text corpora and experimental corpora has to be made. Text corpora like the GNOME corpus [1] can contain texts from all kind of domains. In REG they are used to evaluate the realization part of algorithms. The content selection part of REG on the other hand requires a corpus that contains the properties of all domain objects as well as the properties used in references. Typically those fully "semantically transparent" [45] created in experiments using simple and controlled settings.
These experimental corpora once again can be separated into General-Purpose Corpora that were collected for another purpose but have been analysed for referring expressions and Dedicated Corpora that focus specifically on referring expressions. Examples of General-Purpose Corpora are the Pear Stories, [46] the Map Task corpus [47] or the Coconut corpus [48] while the Bishop corpus, [49] the Drawer corpus [50] and the TUNA corpus [51] count to the Dedicated Corpora. The TUNA corpus which contains web-collected data on the two domains furniture and people has been used in three shared REG challenges already. [note 1]
To measure the correspondence between corpora and the results of REG algorithms several Metrics have been developed.
To measure the content selection part the Dice coefficient [52] or the MASI (Measuring Agreement on Set-valued Items) [53] metric are used. These measure the overlap of properties in two descriptions. In an evaluation the scores are usually averaged over references made by different human participants in the corpus. Also sometimes a measure called Perfect Recall Percentage (PRP) [51] or Accuracy [54] is used which calculates the percentage of perfect matches between an algorithm-produced and a human-produced reference.
For the linguistic realization part of REG the overlap between strings has been measured using metrics like BLEU [55] or NIST. [56] A problem that occurs with string-based metrics is that for example "The small monkey" is measured closer to "The small donkey" than to "The little monkey".
A more time consuming way to evaluate REG algorithms is by letting humans judge the Adequacy (How clear is the description?) and Fluency (Is the description given in good and clear English?) of the generated expression. Also Belz and Gatt [57] evaluated referring expressions using an experimental setup. The participants get a generated description and then have to click on the target. Here the extrinsic metrics reading time, identification time and error rate could be evaluated. [note 1]
Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.
Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.
Dr. Hermann Moisl is a retired senior lecturer and visiting fellow in Linguistics at Newcastle University. He was educated at various institutes, including Trinity College Dublin and the University of Oxford.
The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
Statistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural network approach.
In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.
Linguistic categories include
Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.
The knowledge acquisition bottleneck is perhaps the major impediment to solving the word-sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises.
In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.
Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word and their usage in context. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.
Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.