Aggregation (linguistics)

Last updated November 25, 2023

In linguistics, aggregation is a subtask of natural language generation, which involves merging syntactic constituents (such as sentences and phrases) together. Sometimes aggregation can be done at a conceptual level.

Examples

A simple example of syntactic aggregation is merging the two sentences John went to the shop and John bought an apple into the single sentence John went to the shop and bought an apple.

Syntactic aggregation can be much more complex than this. For example, aggregation can embed one of the constituents in the other; e.g., we can aggregate John went to the shop and The shop was closed into the sentence John went to the shop, which was closed.

From a pragmatic perspective, aggregating sentences together often suggests to the reader that these sentences are related to each other. If this is not the case, the reader may be confused. For example, someone who reads John went to the shop and bought an apple may infer that the apple was bought in the shop; if this is not the case, then these sentences should not be aggregated.

Algorithms and issues

Aggregation algorithms must do two things:

Decide when two constituents should be aggregated
Decide how two constituents should be aggregated, and create the aggregated structure

The first issue, deciding when to aggregate, is poorly understood. Aggegration decisions certainly depend on the semantic relations between the constituents, as mentioned above; they also depend on the genre (e.g., bureaucratic texts tend to be more aggregated than instruction manuals). They probably should depend on rhetorical and discourse structure.^[1] The literacy level of the reader is also probably important (poor readers need shorter sentences).^[2] But we have no integrated model which brings all these factors together into a single algorithm.

With regard to the second issue, there have been some studies of different types of aggregation, and how they should be carried out. Harbusch and Kempen describe several syntactic aggregation strategies. In their terminology, John went to the shop and bought an apple is an example of forward conjunction Reduction ^[3] Much less is known about conceptual aggregation. Di Eugenio et al. show how conceptual aggregation can be done in an intelligent tutoring system, and demonstrate that performing such aggregation makes the system more effective (and that conceptual aggregation make a bigger impact than syntactic aggregation).^[4]

Software

Unfortunately there is not much software available for performing aggregation.^{[ citation needed ]} However the SimpleNLG system^[5] does include limited support for basic aggregation. For example, the following code causes SimpleNLG to print out The man is hungry and buys an apple.

SPhraseSpecs1=nlgFactory.createClause("the man","be","hungry");SPhraseSpecs2=nlgFactory.createClause("the man","buy","an apple");NLGElementresult=newClauseCoordinationRule().apply(s1,s2);System.out.println(realiser.realiseSentence(result));

Related Research Articles

In linguistics, syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), agreement, the nature of crosslinguistic variation, and the relationship between form and meaning (semantics). There are numerous approaches to syntax that differ in their central assumptions and goals.

A transitive verb is a verb that accepts one or more objects, for example, 'to enjoy' in Donald enjoys music. This contrasts with intransitive verbs, which do not have objects, for example, 'to arise' in Donald arose.

In language, a clause is a constituent that comprises a semantic predicand and a semantic predicate. A typical clause consists of a subject and a syntactic predicate, the latter typically a verb phrase composed of a verb with any objects and other modifiers. However, the subject is sometimes unvoiced if it is retrievable from context, especially in null-subject language but also in other languages, including English instances of the imperative mood.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

In linguistics, the minimalist program is a major line of inquiry that has been developing inside generative grammar since the early 1990s, starting with a 1993 paper by Noam Chomsky.

A subject is one of the two main parts of a sentence.

A cleft sentence is a complex sentence that has a meaning that could be expressed by a simple sentence. Clefts typically put a particular constituent into focus. In spoken language, this focusing is often accompanied by a special intonation.

In linguistics, wh-movement is the formation of syntactic dependencies involving interrogative words. An example in English is the dependency formed between what and the object position of doing in "What are you doing?" Interrogative forms are sometimes known within English linguistics as wh-words, such as what, when, where, who, and why, but also include other interrogative words, such as how. This dependency has been used as a diagnostic tool in syntactic studies as it can be observed to interact with other grammatical constraints.

In linguistics, coordination is a complex syntactic structure that links together two or more elements; these elements are called conjuncts or conjoins. The presence of coordination is often signaled by the appearance of a coordinator, e.g. and, or, but. The totality of coordinator(s) and conjuncts forming an instance of coordination is called a coordinate structure. The unique properties of coordinate structures have motivated theoretical syntax to draw a broad distinction between coordination and subordination. It is also one of the many constituency tests in linguistics. Coordination is one of the most studied fields in theoretical syntax, but despite decades of intensive examination, theoretical accounts differ significantly and there is no consensus on the best analysis.

The term linguistic performance was used by Noam Chomsky in 1960 to describe "the actual use of language in concrete situations". It is used to describe both the production, sometimes called parole, as well as the comprehension of language. Performance is defined in opposition to "competence"; the latter describes the mental knowledge that a speaker or listener has of language.

In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.

Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.

Lexical choice is the subtask of Natural language generation that involves choosing the content words in a generated text. Function words are usually chosen during realisation.

Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping of sentences in a generated text. It is closely related to the Content determination NLG task.

In linguistics, a catena is a unit of syntax and morphology, closely associated with dependency grammars. It is a more flexible and inclusive unit than the constituent and its proponents therefore consider it to be better suited than the constituent to serve as the fundamental unit of syntactic and morphosyntactic analysis.

In syntax, shifting occurs when two or more constituents appearing on the same side of their common head exchange positions in a sense to obtain non-canonical order. The most widely acknowledged type of shifting is heavy NP shift, but shifting involving a heavy NP is just one manifestation of the shifting mechanism. Shifting occurs in most if not all European languages, and it may in fact be possible in all natural languages including sign languages. Shifting is not inversion, and inversion is not shifting, but the two mechanisms are similar insofar as they are both present in languages like English that have relatively strict word order. The theoretical analysis of shifting varies in part depending on the theory of sentence structure that one adopts. If one assumes relatively flat structures, shifting does not result in a discontinuity. Shifting is often motivated by the relative weight of the constituents involved. The weight of a constituent is determined by a number of factors: e.g., number of words, contrastive focus, and semantic content.

This article describes the syntax of clauses in the English language, chiefly in Modern English. A clause is often said to be the smallest grammatical unit that can express a complete proposition. But this semantic idea of a clause leaves out much of English clause syntax. For example, clauses can be questions, but questions are not propositions. A syntactic description of an English clause is that it is a subject and a verb. But this too fails, as a clause need not have a subject, as with the imperative, and, in many theories, an English clause may be verbless. The idea of what qualifies varies between theories and has changed over time.

In grammar, sentence and clause structure, commonly known as sentence composition, is the classification of sentences based on the number and kind of clauses in their syntactic structure. Such division is an element of traditional grammar.

References

↑ D Scott and C de Souza (1990). Getting the Message Across in RST-based Text Generation. In Dale et al (eds)Current Research in Natural Language Generation. Academic Press
↑ S Williams and E Reiter (2008). Generating basic skills reports for low-skilled readers. Natural Language Engineering 14:495-535
↑ K Harbusch and G Kempen (2009). Generating clausal coordinate ellipsis multilingually: A uniform approach based on postediting. In Proc of ENLG-2009 28:105-144.
↑ B Di Eugenio, D Fossati, D Yu (2005). Aggregation improves learning: experiments in natural language generation for intelligent tutoring systems. In Proc of ACL-2005 pp 50–57.
↑ A Gatt and E Reiter (2009). SimpleNLG: A realisation engine for practical applications. Proceedings of ENLG09

External links

simplenlg at GoogleCode

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] D Scott and C de Souza (1990). Getting the Message Across in RST-based Text Generation. In Dale et al (eds)Current Research in Natural Language Generation. Academic Press

[2] S Williams and E Reiter (2008). Generating basic skills reports for low-skilled readers. Natural Language Engineering 14:495-535

[3] K Harbusch and G Kempen (2009). Generating clausal coordinate ellipsis multilingually: A uniform approach based on postediting. In Proc of ENLG-2009 28:105-144.

[4] B Di Eugenio, D Fossati, D Yu (2005). Aggregation improves learning: experiments in natural language generation for intelligent tutoring systems. In Proc of ACL-2005 pp 50–57.

[5] A Gatt and E Reiter (2009). SimpleNLG: A realisation engine for practical applications. Proceedings of ENLG09

[1]

[2]

[3]

[4]

[5]