Document structuring

Last updated

Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping (for example into paragraphs) of sentences in a generated text. It is closely related to the Content determination NLG task.

Contents

Example

Assume we have four sentences which we want to include in a generated text

  1. It will rain on Saturday
  2. It will be sunny on Sunday
  3. Max temperature will be 10 °C on Saturday
  4. Max temperature will be 15 °C on Sunday

There are 24 (4!) orderings of these messages, including

Some of these orderings are better than others. For example, of the texts shown above, human readers prefer (1234) over (2314) and (4321).

For any ordering, there are also many ways in which sentences can be grouped into paragraphs and higher-level structures such as sections. For example, there are 8 (2**3) ways in which the sentences in (1234) can be grouped into paragraphs, including

It will rain on Saturday. It will be sunny on Sunday.
Max temperature will be 10 °C on Saturday. Max temperature will be 15 °C on Sunday.
It will rain on Saturday.
It will be sunny on Sunday. Max temperature will be 10 °C on Saturday.
Max temperature will be 15 °C on Sunday.

As with ordering, human readers prefer some groupings over others; for example, (12)(34) is preferred over (1)(23)(4).

The document structuring task is to choose an ordering and grouping of sentences which results in a coherent and well-organised text from the reader's perspective.

Algorithms and models

There are three basic approaches to document structuring: schemas, corpus-based, and heuristic.

Schemas [1] are templates which explicitly specify sentence ordering and grouping for a document (as well as Content determination information). Typically they are constructed by manually analysing a corpus of human-written texts in the target genre, and extracting a document template from these texts. Schemas work well in practice for texts which are short (5 sentences or less) and/or have a standardised structure, but have problems in generating texts which are longer and do not have a fixed structure.

Corpus-based structuring techniques use statistical corpus analysis techniques to automatically build ordering and/or grouping models. Such techniques are common in Automatic summarisation, where a computer program automatically generates a summary of a textual document. [2] In principle they could be applied to text generated from non-linguistic data, but this work is in its infancy; part of the challenge is that texts generated by Natural Language Generation systems are generally expected to be of fairly high quality, which is not always the case for texts generated by automatic summarisation systems.

The final approach is heuristic-based structuring. Such algorithms perform the structuring task based on heuristic rules, which can come from theories of rhetoric, [3] psycholinguistic models, [4] and/or a combination of intuition and feedback from pilot experiments with potential users. [5] Heuristic-based structuring is appealing intellectually, but it can be difficult to get it to work well in practice, in part because heuristics often depend on semantic information (how sentences relate to each other) which is not always available. [6] On the other hand, heuristic rules can focus on what is best for text readers, whereas the other approaches focus on imitating authors (and many human-authored texts are not well-structured).

Narrative

Perhaps the ultimate document structuring challenge is to generate a good narrative—in other words, a text which starts by setting the scene and giving an introduction/overview; then describes a set of events in a clear fashion, so readers can easily see how the individual events are related and link together; and concludes with a summary/ending. Note that narrative in this sense applies to factual texts as well as stories. Current NLG systems do not do a good job of generating narratives, and this is a major source of user criticism. [7]

Generating good narratives is a challenge for all aspects of NLG, but the most fundamental challenge is probably in document structuring.

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches of machine learning and deep learning.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

In computing, RELAX NG is a schema language for XML—a RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema is itself an XML document but RELAX NG also offers a popular compact, non-XML syntax. Compared to other XML schema languages RELAX NG is considered relatively simple.

Readability is the ease with which a reader can understand a written text. The concept exists in both natural language and programming languages though in different forms. In natural language, the readability of text depends on its content and its presentation. In programming, things such as programmer comments, choice of loop structure, and choice of names can determine the ease with which humans can read computer program code.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

<span class="mw-page-title-main">Text Encoding Initiative</span> Academic community concerned with text encoding

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

Sentence extraction is a technique used for automatic summarization of a text. In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more knowledge-intensive deeper approaches which require additional knowledge bases such as ontologies or linguistic knowledge. In short "sentence extraction" works as a filter that allows only meaningful sentences to pass.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.

Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.

Lexical choice is the subtask of Natural language generation that involves choosing the content words in a generated text. Function words are usually chosen during realisation.

XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.

In linguistics, aggregation is a subtask of natural language generation, which involves merging syntactic constituents together. Sometimes aggregation can be done at a conceptual level.

Content determination is the subtask of natural language generation (NLG) that involves deciding on the information to be communicated in a generated text. It is closely related to the task of document structuring.

<span class="mw-page-title-main">Business Intelligence Markup Language</span>

BusinessIntelligence Markup Language (BIML) is a domain-specific XML dialect for defining business intelligence (BI) assets. Biml-authored BI assets can currently be used by the BIDS Helper add-on for Microsoft SQL Server Business Intelligence Development Studio (BIDS) and the Varigence Mist integrated development environment; both tools translate Biml metadata into SQL Server Integration Services (SSIS) and SQL Server Analysis Services (SSAS) assets for the Microsoft SQL Server platform; however, emitters can be created to compile Biml for any desired BI platform.

<span class="mw-page-title-main">Narrative Science</span> American natural language generation company

Narrative Science was a natural language generation company based in Chicago, Illinois, that specialized in data storytelling. As of December 17, 2021, Narrative Science was acquired by Salesforce and has been integrated into Salesforce's Tableau Software.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

References

  1. K McKeown (1985). Text Generation. Cambridge University Press
  2. M Lapata (2003). Probabilistic Text Structuring: Experiments with Sentence Ordering. Proceedings of ACL-2003
  3. D Scott and C de Souza (1990). Getting the message across in RST-based text generation . In Dale, Mellish, Zock (eds) Current research in natural language generation, pages 47-73
  4. N Karamanis, M Poesio, C Mellish, J Oberlander (2004). Evaluating Centering-based metrics of coherence for text structuring using a reliably annotated corpus. Proceedings of ACL-2004
  5. S Williams and E Reiter. Generating basic skills reports for low-skilled readers. Natural Language Engineering 14:495-535
  6. Raue, Martina; Scholl, Sabine G. (2018), Raue, Martina; Lermer, Eva; Streicher, Bernhard (eds.), "The Use of Heuristics in Decision Making Under Risk and Uncertainty", Psychological Perspectives on Risk and Risk Analysis: Theory, Models, and Applications, Cham: Springer International Publishing, pp. 153–179, doi:10.1007/978-3-319-92478-6_7, ISBN   978-3-319-92478-6 , retrieved 2023-05-08
  7. E Reiter, A Gatt, F Portet, M van der Meulen (2008).The Importance of Narrative and Other Lessons from an Evaluation of an NLG System that Summarises Clinical Data. In Proceedings of INLG-2008