Content determination

Last updated

Content determination is the subtask of natural language generation (NLG) that involves deciding on the information to be communicated in a generated text. It is closely related to the task of document structuring.

Contents

Example

Consider an NLG system which summarises information about sick babies. [1] Suppose this system has four pieces of information it can communicate

  1. The baby is being given morphine via an IV drop
  2. The baby's heart rate shows bradycardia's (temporary drops)
  3. The baby's temperature is normal
  4. The baby is crying

Which of these bits of information should be included in the generated texts?

Issues

There are three general issues which almost always impact the content determination task, and can be illustrated with the above example.

Perhaps the most fundamental issue is the communicative goal of the text, i.e. its purpose and reader. In the above example, for instance, a doctor who wants to make a decision about medical treatment would probably be most interested in the heart rate bradycardias, while a parent who wanted to know how her child was doing would probably be more interested in the fact that the baby was being given morphine and was crying.

The second issue is the size and level of detail of the generated text. For instance, a short summary which was sent to a doctor as a 160 character SMS text message might only mention the heart rate bradycardias, while a longer summary which was printed out as a multipage document might also mention the fact that the baby is on a morphine IV.

The final issue is how unusual and unexpected the information is. For example, neither doctors nor parents would place a high priority on being told that the baby's temperature was normal, if they expected this to be the case.

Regardless, content determination is very important to users, indeed in many cases the quality of content determination is the most important factor (from the user's perspective) in determining the overall quality of the generated text.

Techniques

There are three basic approaches to document structuring: schemas (content templates), statistical approaches, and explicit reasoning.

Schemas [2] are templates which explicitly specify the content of a generated text (as well as document structuring information). Typically they are constructed by manually analysing a corpus of human-written texts in the target genre, and extracting a content template from these texts. Schemas work well in practice in domains where content is somewhat standardised, but work less well in domains where content is more fluid (such as the medical example above).

Statistical techniques use statistical corpus analysis techniques to automatically determine the content of the generated texts. Such work is in its infancy, and has mostly been applied to contexts where the communicative goal, reader, size, and level of detail are fixed. For example, generation of newswire summaries of sporting events. [3] [4]

Explicit reasoning approaches have probably attracted the most attention from researchers. The basic idea is to use AI reasoning techniques (such as knowledge-based rules, [1] planning, [5] pattern detection, [6] case-based reasoning, [7] etc.) to examine the information available to be communicated (including how unusual/unexpected it is), the communicative goal and reader, and the characteristics of the generated text (including target size), and decide on the optimal content for the generated text. A very wide range of techniques has been explored, but there is no consensus as to which is most effective.

Related Research Articles

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.

XSD, a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

<span class="mw-page-title-main">Screen reader</span> Assistive technology that converts text or images to speech or Braille

A screen reader is a form of assistive technology (AT) that renders text and image content as speech or braille output. Screen readers are essential to people who are blind, and are useful to people who are visually impaired, illiterate, or have a learning disability. Screen readers are software applications that attempt to convey what people with normal eyesight see on a display to their users via non-visual means, like text-to-speech, sound icons, or a braille device. They do this by applying a wide variety of techniques that include, for example, interacting with dedicated accessibility APIs, using various operating system features, and employing hooking techniques.

XSL-FO is a markup language for XML document formatting that is most often used to generate PDF files. XSL-FO is part of XSL, a set of W3C technologies designed for the transformation and formatting of XML data. The other parts of XSL are XSLT and XPath. Version 1.1 of XSL-FO was published in 2006.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

Content Assembly Mechanism (CAM) is an XML-based standard for creating and managing information exchanges that are interoperable and deterministic descriptions of machine-processable information content flows into and out of XML structures. CAM is a product of the OASIS Content Assembly Technical Committee.

In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.

Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.

Lexical choice is the subtask of Natural language generation that involves choosing the content words in a generated text. Function words are usually chosen during realisation.

XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.

In linguistics, aggregation is a subtask of natural language generation, which involves merging syntactic constituents together. Sometimes aggregation can be done at a conceptual level.

<span class="mw-page-title-main">Text annotation</span> Adding a note or gloss to a text

Text annotation is the practice and the result of adding a note or gloss to a text, which may include highlights or underlining, comments, footnotes, tags, and links. Text annotations can include notes written for a reader's private purposes, as well as shared annotations written for the purposes of collaborative writing and editing, commentary, or social reading and sharing. In some fields, text annotation is comparable to metadata insofar as it is added post hoc and provides information about a text without fundamentally altering that original text. Text annotations are sometimes referred to as marginalia, though some reserve this term specifically for hand-written notes made in the margins of books or manuscripts. Annotations have been found to be useful and help to develop knowledge of English literature.

Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping of sentences in a generated text. It is closely related to the Content determination NLG task.

Form and Document Creation is one of the things that technical communicators do as part of creating deliverables for their companies or clients. Document design is: "the field of theory and practice aimed at creating comprehensible, persuasive and usable functional documents". These forms and documents can have many different purposes such as collecting or providing information.

Arria NLG plc is a New Zealand based company with headquarters in the US. Arria offers Artificial Intelligence technology in data analytics and information delivery. It is one of the pioneering companies in the space of automatic text generation, with a focus on Natural Language Generation (NLG). When it floated on London's Alternative Investment Market (AIM) in December 2013, it was valued at over £160 million. However, Arria was later delisted from the stock exchange. Subsequently, Arria has raised over US$100 million from private sources. Arria's technology is based on three decades of scientific research in the field of Natural Language Generation (NLG).

References

  1. 1 2 Portet F, Reiter E, Gatt A, Hunter J, Sripada S, Freer Y, Sykes C (2009). "Automatic Generation of Textual Summaries from Neonatal Intensive Care Data". Artificial Intelligence. 173 (7–8): 789–816. doi: 10.1016/j.artint.2008.12.002 .
  2. K McKeown (1985). Text Generation. Cambridge University Press
  3. R Barzilay and M Lapata (2005). Collective content selection for concept-to-text generation. Proceedings of EMNLP-2005
  4. R Perera and P Nand (2014). The Role of Linked Data in Content Selection. Proceedings of PRICAI-2014
  5. J Moore and C Paris (1993). Planning Text for Advisory Dialogues: Capturing Intentional and Rhetorical Information Using. Computational Linguistics 19:651-694 Archived 2011-09-30 at the Wayback Machine
  6. J Yu, E Reiter, J Hunter, C Mellish (2007). Choosing the content of textual summaries of large time-series data sets. Natural Language Engineering 13:25-49
  7. P Gervás, B Díaz-Agudo, F Peinado, R Hervás (2005) Story plot generation based on CBR. Knowledge-Based Systems 18:235-242