Natural language generation

Last updated

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information". [1]

Contents

While it is widely agreed that the output of any NLG process is text, there is some disagreement about whether the inputs of an NLG system need to be non-linguistic. [2] Common applications of NLG methods include the production of various reports, for example weather [3] and patient reports; [4] image captions; [5] and chatbots.

Automated NLG can be compared to the process humans use when they turn ideas into writing or speech. Psycholinguists prefer the term language production for this process, which can also be described in mathematical terms, or modeled in a computer for psychological research. NLG systems can also be compared to translators of artificial computer languages, such as decompilers or transpilers, which also produce human-readable code generated from an intermediate representation. Human languages tend to be considerably more complex and allow for much more ambiguity and variety of expression than programming languages, which makes NLG more challenging.

NLG may be viewed as complementary to natural-language understanding (NLU): whereas in natural-language understanding, the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a representation into words. The practical considerations in building NLU vs. NLG systems are not symmetrical. NLU needs to deal with ambiguous or erroneous user input, whereas the ideas the system wants to express through NLG are generally known precisely. NLG needs to choose a specific, self-consistent textual representation from many potential representations, whereas NLU generally tries to produce a single, normalized representation of the idea expressed. [6]

NLG has existed since ELIZA was developed in the mid 1960s, but the methods were first used commercially in the 1990s. [7] NLG techniques range from simple template-based systems like a mail merge that generates form letters, to systems that have a complex understanding of human grammar. NLG can also be accomplished by training a statistical model using machine learning, typically on a large corpus of human-written texts. [8]

Example

The Pollen Forecast for Scotland system [9] is a simple example of a simple NLG system that could essentially be a template. This system takes as input six numbers, which give predicted pollen levels in different parts of Scotland. From these numbers, the system generates a short textual summary of pollen levels as its output.

For example, using the historical data for July 1, 2005, the software produces:

Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country. However, in Northern areas, pollen levels will be moderate with values of 4.

In contrast, the actual forecast (written by a human meteorologist) from this data was:

Pollen counts are expected to remain high at level 6 over most of Scotland, and even level 7 in the south east. The only relief is in the Northern Isles and far northeast of mainland Scotland with medium levels of pollen count.

Comparing these two illustrates some of the choices that NLG systems must make; these are further discussed below.

Stages

The process to generate text can be as simple as keeping a list of canned text that is copied and pasted, possibly linked with some glue text. The results may be satisfactory in simple domains such as horoscope machines or generators of personalised business letters. However, a sophisticated NLG system needs to include stages of planning and merging of information to enable the generation of text that looks natural and does not become repetitive. The typical stages of natural-language generation, as proposed by Dale and Reiter, [6] are:

Content determination : Deciding what information to mention in the text. For instance, in the pollen example above, deciding whether to explicitly mention that pollen level is 7 in the south east.

Document structuring : Overall organisation of the information to convey. For example, deciding to describe the areas with high pollen levels first, instead of the areas with low pollen levels.

Aggregation : Merging of similar sentences to improve readability and naturalness. For instance, merging the two following sentences:

into the following single sentence:

Lexical choice : Putting words to the concepts. For example, deciding whether medium or moderate should be used when describing a pollen level of 4.

Referring expression generation : Creating referring expressions that identify objects and regions. For example, deciding to use in the Northern Isles and far northeast of mainland Scotland to refer to a certain region in Scotland. This task also includes making decisions about pronouns and other types of anaphora.

Realization : Creating the actual text, which should be correct according to the rules of syntax, morphology, and orthography. For example, using will be for the future tense of to be.

An alternative approach to NLG is to use "end-to-end" machine learning to build a system, without having separate stages as above. [10] In other words, we build an NLG system by training a machine learning algorithm (often an LSTM) on a large data set of input data and corresponding (human-written) output texts. The end-to-end approach has perhaps been most successful in image captioning, [11] that is automatically generating a textual caption for an image.

Applications

Automatic report generation

From a commercial perspective, the most successful NLG applications have been data-to-text systems which generate textual summaries of databases and data sets; these systems usually perform data analysis as well as text generation. Research has shown that textual summaries can be more effective than graphs and other visuals for decision support, [12] [13] [14] and that computer-generated texts can be superior (from the reader's perspective) to human-written texts. [15]

The first commercial data-to-text systems produced weather forecasts from weather data. The earliest such system to be deployed was FoG, [3] which was used by Environment Canada to generate weather forecasts in French and English in the early 1990s. The success of FoG triggered other work, both research and commercial. Recent applications include the UK Met Office's text-enhanced forecast. [16]

Data-to-text systems have since been applied in a range of settings. Following the minor earthquake near Beverly Hills, California on March 17, 2014, The Los Angeles Times reported details about the time, location and strength of the quake within 3 minutes of the event. This report was automatically generated by a 'robo-journalist', which converted the incoming data into text via a preset template. [17] [18] Currently there is considerable commercial interest in using NLG to summarise financial and business data. Indeed, Gartner has said that NLG will become a standard feature of 90% of modern BI and analytics platforms. [19] NLG is also being used commercially in automated journalism, chatbots, generating product descriptions for e-commerce sites, summarising medical records, [20] [4] and enhancing accessibility (for example by describing graphs and data sets to blind people [21] ).

An example of an interactive use of NLG is the WYSIWYM framework. It stands for What you see is what you meant and allows users to see and manipulate the continuously rendered view (NLG output) of an underlying formal language document (NLG input), thereby editing the formal language without learning it.

Looking ahead, the current progress in data-to-text generation paves the way for tailoring texts to specific audiences. For example, data from babies in neonatal care can be converted into text differently in a clinical setting, with different levels of technical detail and explanatory language, depending on intended recipient of the text (doctor, nurse, patient). The same idea can be applied in a sports setting, with different reports generated for fans of specific teams. [22]

Image captioning

Over the past few years, there has been an increased interest in automatically generating captions for images, as part of a broader endeavor to investigate the interface between vision and language. A case of data-to-text generation, the algorithm of image captioning (or automatic image description) involves taking an image, analyzing its visual content, and generating a textual description (typically a sentence) that verbalizes the most prominent aspects of the image.

An image captioning system involves two sub-tasks. In Image Analysis, features and attributes of an image are detected and labelled, before mapping these outputs to linguistic structures. Recent research utilizes deep learning approaches through features from a pre-trained convolutional neural network such as AlexNet, VGG or Caffe, where caption generators use an activation layer from the pre-trained network as their input features. Text Generation, the second task, is performed using a wide range of techniques. For example, in the Midge system, input images are represented as triples consisting of object/stuff detections, action/pose detections and spatial relations. These are subsequently mapped to <noun, verb, preposition> triples and realized using a tree substitution grammar. [22]

Despite advancements, challenges and opportunities remain in image capturing research. Notwithstanding the recent introduction of Flickr30K, MS COCO and other large datasets have enabled the training of more complex models such as neural networks, it has been argued that research in image captioning could benefit from larger and diversified datasets. Designing automatic measures that can mimic human judgments in evaluating the suitability of image descriptions is another need in the area. Other open challenges include visual question-answering (VQA), [23] as well as the construction and evaluation multilingual repositories for image description. [22]

Chatbots

Another area where NLG has been widely applied is automated dialogue systems, frequently in the form of chatbots. A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. While natural language processing (NLP) techniques are applied in deciphering human input, NLG informs the output part of the chatbot algorithms in facilitating real-time dialogues.

Early chatbot systems, including Cleverbot created by Rollo Carpenter in 1988 and published in 1997,[ citation needed ] reply to questions by identifying how a human has responded to the same question in a conversation database using information retrieval (IR) techniques.[ citation needed ] Modern chatbot systems predominantly rely on machine learning (ML) models, such as sequence-to-sequence learning and reinforcement learning to generate natural language output. Hybrid models have also been explored. For example, the Alibaba shopping assistant first uses an IR approach to retrieve the best candidates from the knowledge base, then uses the ML-driven seq2seq model re-rank the candidate responses and generate the answer. [24]

Creative writing and computational humor

Creative language generation by NLG has been hypothesized since the field's origins. A recent pioneer in the area is Phillip Parker, who has developed an arsenal of algorithms capable of automatically generating textbooks, crossword puzzles, poems and books on topics ranging from bookbinding to cataracts. [25] The advent of large pretrained transformer-based language models such as GPT-3 has also enabled breakthroughs, with such models demonstrating recognizable ability for creating-writing tasks. [26]

A related area of NLG application is computational humor production.  JAPE (Joke Analysis and Production Engine) is one of the earliest large, automated humor production systems that uses a hand-coded template-based approach to create punning riddles for children. HAHAcronym creates humorous reinterpretations of any given acronym, as well as proposing new fitting acronyms given some keywords. [27]

Despite progresses, many challenges remain in producing automated creative and humorous content that rival human output. In an experiment for generating satirical headlines, outputs of their best BERT-based model were perceived as funny 9.4% of the time (while real headlines from The Onion were 38.4%) and a GPT-2 model fine-tuned on satirical headlines achieved 6.9%. [28]   It has been pointed out that two main issues with humor-generation systems are the lack of annotated data sets and the lack of formal evaluation methods, [27] which could be applicable to other creative content generation. Some have argued relative to other applications, there has been a lack of attention to creative aspects of language production within NLG. NLG researchers stand to benefit from insights into what constitutes creative language production, as well as structural features of narrative that have the potential to improve NLG output even in data-to-text systems. [22]

Evaluation

As in other scientific fields, NLG researchers need to test how well their systems, modules, and algorithms work. This is called evaluation. There are three basic techniques for evaluating NLG systems:

An ultimate goal is how useful NLG systems are at helping people, which is the first of the above techniques. However, task-based evaluations are time-consuming and expensive, and can be difficult to carry out (especially if they require subjects with specialised expertise, such as doctors). Hence (as in other areas of NLP) task-based evaluations are the exception, not the norm.

Recently researchers are assessing how well human-ratings and metrics correlate with (predict) task-based evaluations. Work is being conducted in the context of Generation Challenges [29] shared-task events. Initial results suggest that human ratings are much better than metrics in this regard. In other words, human ratings usually do predict task-effectiveness at least to some degree (although there are exceptions), while ratings produced by metrics often do not predict task-effectiveness well. These results are preliminary. In any case, human ratings are the most popular evaluation technique in NLG; this is contrast to machine translation, where metrics are widely used.

An AI can be graded on faithfulness to its training data or, alternatively, on factuality. A response that reflects the training data but not reality is faithful but not factual. A confident but unfaithful response is a hallucination . In Natural Language Processing, a hallucination is often defined as "generated content that is nonsensical or unfaithful to the provided source content". [30]

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Chatbot</span> Program that simulates conversation

A chatbot is a software application or web interface that is designed to mimic human conversation through text or voice interactions. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. Such chatbots often use deep learning and natural language processing, but simpler chatbots have existed for decades.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data.

<span class="mw-page-title-main">Dialogue system</span>

A dialogue system, or conversational agent (CA), is a computer system intended to converse with a human. Dialogue systems employed one or more of text, speech, graphics, haptics, gestures, and other modes for communication on both the input and output channel.

In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.

Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the creation of referring expressions that identify specific entities called targets.

Lexical choice is the subtask of Natural language generation that involves choosing the content words in a generated text. Function words are usually chosen during realisation.

Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping of sentences in a generated text. It is closely related to the Content determination NLG task.

Content determination is the subtask of natural language generation (NLG) that involves deciding on the information to be communicated in a generated text. It is closely related to the task of document structuring.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Arria NLG plc is a New Zealand-based company with headquarters in the US. Arria offers artificial intelligence technology in data analytics and information delivery. It is one of the pioneering companies in the space of automatic text generation, with a focus on Natural Language Generation (NLG). When it floated on London's Alternative Investment Market (AIM) in December 2013, it was valued at over £160 million. However, Arria was later delisted from the stock exchange. Subsequently, Arria has raised over US$100 million from private sources. Arria's technology is based on three decades of scientific research in the field of Natural Language Generation (NLG).

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to selectively focus on segments of input text it predicts to be most relevant. It uses a 2048-tokens-long context, float16 (16-bit) precision, and a hitherto-unprecedented 175 billion parameters, requiring 350GB of storage space as each parameter takes 2 bytes of space, and has demonstrated strong "zero-shot" and "few-shot" learning abilities on many tasks.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Confident unjustified claim by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI which contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there’s a key difference: AI hallucination is associated with unjustified responses or beliefs rather than perceptual experiences.

References

  1. Reiter, Ehud; Dale, Robert (March 1997). "Building applied natural language generation systems". Natural Language Engineering. 3 (1): 57–87. doi:10.1017/S1351324997001502. ISSN   1469-8110. S2CID   8460470.
  2. Gatt A, Krahmer E (2018). "Survey of the state of the art in natural language generation: Core tasks, applications and evaluation". Journal of Artificial Intelligence Research. 61 (61): 65–170. arXiv: 1703.09902 . doi:10.1613/jair.5477. S2CID   16946362.
  3. 1 2 Goldberg E, Driedger N, Kittredge R (1994). "Using Natural-Language Processing to Produce Weather Forecasts". IEEE Expert. 9 (2): 45–53. doi:10.1109/64.294135. S2CID   9709337.
  4. 1 2 3 Portet F, Reiter E, Gatt A, Hunter J, Sripada S, Freer Y, Sykes C (2009). "Automatic Generation of Textual Summaries from Neonatal Intensive Care Data" (PDF). Artificial Intelligence. 173 (7–8): 789–816. doi:10.1016/j.artint.2008.12.002.
  5. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010-09-05). Every picture tells a story: Generating sentences from images (PDF). European conference on computer vision. Berlin, Heidelberg: Springer. pp. 15–29. doi:10.1007/978-3-642-15561-1_2.
  6. 1 2 Dale, Robert; Reiter, Ehud (2000). Building natural language generation systems. Cambridge, U.K.: Cambridge University Press. ISBN   978-0-521-02451-8.
  7. Ehud Reiter (2021-03-21). History of NLG. Archived from the original on 2021-12-12.
  8. Perera R, Nand P (2017). "Recent Advances in Natural Language Generation: A Survey and Classification of the Empirical Literature". Computing and Informatics. 36 (1): 1–32. doi:10.4149/cai_2017_1_1. hdl: 10292/10691 .
  9. R Turner, S Sripada, E Reiter, I Davy (2006). Generating Spatio-Temporal Descriptions in Pollen Forecasts. Proceedings of EACL06
  10. "E2E NLG Challenge".
  11. "DataLabCup: Image Caption".
  12. Law A, Freer Y, Hunter J, Logie R, McIntosh N, Quinn J (2005). "A Comparison of Graphical and Textual Presentations of Time Series Data to Support Medical Decision Making in the Neonatal Intensive Care Unit". Journal of Clinical Monitoring and Computing. 19 (3): 183–94. doi:10.1007/s10877-005-0879-3. PMID   16244840. S2CID   5569544.
  13. Gkatzia D, Lemon O, Reiser V (2017). "Data-to-Text Generation Improves Decision-Making Under Uncertainty" (PDF). IEEE Computational Intelligence Magazine. 12 (3): 10–17. doi:10.1109/MCI.2017.2708998. S2CID   9544295.
  14. "Text or Graphics?". 2016-12-26.
  15. Reiter E, Sripada S, Hunter J, Yu J, Davy I (2005). "Choosing Words in Computer-Generated Weather Forecasts". Artificial Intelligence. 167 (1–2): 137–69. doi: 10.1016/j.artint.2005.06.006 .
  16. S Sripada, N Burnett, R Turner, J Mastin, D Evans(2014). Generating A Case Study: NLG meeting Weather Industry Demand for Quality and Quantity of Textual Weather Forecasts. Proceedings of INLG 2014
  17. Schwencke, Ken Schwencke Ken; Journalist, A.; Programmer, Computer; in 2014, left the Los Angeles Times (2014-03-17). "Earthquake aftershock: 2.7 quake strikes near Westwood". Los Angeles Times. Retrieved 2022-06-03.{{cite web}}: CS1 maint: numeric names: authors list (link)
  18. Levenson, Eric (2014-03-17). "L.A. Times Journalist Explains How a Bot Wrote His Earthquake Story for Him". The Atlantic. Retrieved 2022-06-03.
  19. "Neural Networks and Modern BI Platforms Will Evolve Data and Analytics".
  20. Harris MD (2008). "Building a Large-Scale Commercial NLG System for an EMR" (PDF). Proceedings of the Fifth International Natural Language Generation Conference. pp. 157–60.
  21. "Welcome to the iGraph-Lite page". www.inf.udec.cl. Archived from the original on 2010-03-16.
  22. 1 2 3 4 Gatt, Albert; Krahmer, Emiel (2018-01-29). "Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation". arXiv: 1703.09902 [cs.CL].
  23. Kodali, Venkat; Berleant, Daniel (2022). "Recent, Rapid Advancement in Visual Question Answering Architecture: a Review". Proceedings of the 22nd IEEE International Conference on EIT. pp. 133–146. arXiv: 2203.01322 .
  24. Mnasri, Maali (2019-03-21). "Recent advances in conversational NLP: Towards the standardization of Chatbot building". arXiv: 1903.09025 [cs.CL].
  25. "How To Author Over 1 Million Books". HuffPost. 2013-02-11. Retrieved 2022-06-03.
  26. "Exploring GPT-3: A New Breakthrough in Language Generation". KDnuggets. Retrieved 2022-06-03.
  27. 1 2 Winters, Thomas (2021-04-30). "Computers Learning Humor Is No Joke". Harvard Data Science Review. 3 (2). doi: 10.1162/99608f92.f13a2337 . S2CID   235589737.
  28. Horvitz, Zachary; Do, Nam; Littman, Michael L. (July 2020). "Context-Driven Satirical News Generation". Proceedings of the Second Workshop on Figurative Language Processing. Online: Association for Computational Linguistics: 40–50. doi: 10.18653/v1/2020.figlang-1.5 . S2CID   220330989.
  29. Generation Challenges
  30. Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Madotto, Andrea; Fung, Pascale (17 November 2022). "Survey of Hallucination in Natural Language Generation". ACM Computing Surveys. 55 (12): 3571730. arXiv: 2202.03629 . doi: 10.1145/3571730 . S2CID   246652372.

Further reading