Outline of natural language processing

Last updated

The following outline is provided as an overview of and topical guide to natural-language processing:

Contents

natural-language processing computer activity in which computers are entailed to analyze, understand, alter, or generate natural language. This includes the automation of any or all linguistic forms, activities, or methods of communication, such as conversation, correspondence, reading, written composition, dictation, publishing, translation, lip reading, and so on. Natural-language processing is also the name of the branch of computer science, artificial intelligence, and linguistics concerned with enabling computers to engage in communication using natural language(s) in all forms, including but not limited to speech, print, writing, and signing.

Natural-language processing

Natural-language processing can be described as all of the following:

Prerequisite technologies

The following technologies make natural-language processing possible:

Subfields of natural-language processing

Natural-language processing contributes to, and makes use of (the theories, tools, and methodologies from), the following fields:

Structures used in natural-language processing

Processes of NLP

Applications

Component processes

Component processes of natural-language understanding

  • Automatic document classification (text categorization)
  • Compound term processing category of techniques that identify compound terms and match them to their definitions. Compound terms are built by combining two (or more) simple terms, for example "triple" is a single word term but "triple heart bypass" is a compound term.
  • Automatic taxonomy induction
  • Corpus processing
  • Deep linguistic processing
  • Discourse analysis includes a number of related tasks. One task is identifying the discourse structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes–no questions, content questions, statements, assertions, orders, suggestions, etc.).
  • Information extraction
    • Text mining process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
      • Biomedical text mining (also known as BioNLP), this is text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field drawing elements from natural-language processing, bioinformatics, medical informatics and computational linguistics. There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.
      • Decision tree learning
      • Sentence extraction
    • Terminology extraction
  • Latent semantic indexing
  • Lemmatisation groups together all like terms that share a same lemma such that they are classified as a single item.
  • Morphological segmentation separates words into individual morphemes and identifies the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as Turkish, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.
  • Named-entity recognition (NER) given a stream of text, determines which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do not capitalize names that serve as adjectives.
  • Ontology learning automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between those concepts from a corpus of natural-language text, and encoding them with an ontology language for easy retrieval. Also called "ontology extraction", "ontology generation", and "ontology acquisition".
  • Parsing determines the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human).
  • Part-of-speech tagging given a sentence, determines the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning.
  • Query expansion
  • Relationship extraction given a chunk of text, identifies the relationships among named entities (e.g. who is the wife of whom).
  • Semantic analysis (computational) – formal analysis of meaning, and "computational" refers to approaches that in principle support effective implementation.
  • Sentence breaking (also known as sentence boundary disambiguation and sentence detection) given a chunk of text, finds the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g. marking abbreviations).
  • Speech segmentation given a sound clip of a person or people speaking, separates it into words. A subtask of speech recognition and typically grouped with it.
  • Stemming reduces an inflected or derived word into its word stem, base, or root form.
  • Text chunking
  • Tokenization given a chunk of text, separates it into distinct words, symbols, sentences, or other units
  • Topic segmentation and recognition given a chunk of text, separates it into segments each of which is devoted to a topic, and identifies the topic of the segment.
  • Truecasing
  • Word segmentation separates a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language.
  • Word-sense disambiguation (WSD) because many words have more than one meaning, word-sense disambiguation is used to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet.
    • Word-sense induction – open problem of natural-language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sense induction is a set of senses for the target word (sense inventory), this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.
    • Automatic acquisition of sense-tagged corpora
  • W-shingling – set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

Component processes of natural-language generation

Natural-language generation task of converting information from computer databases into readable human language.

  • Automatic taxonomy induction (ATI) automated building of tree structures from a corpus. While ATI is used to construct the core of ontologies (and doing so makes it a component process of natural-language understanding), when the ontologies being constructed are end user readable (such as a subject outline), and these are used for the construction of further documentation (such as using an outline as the basis to construct a report or treatise) this also becomes a component process of natural-language generation.
  • Document structuring

History of natural-language processing

History of natural-language processing

Timeline of NLP software

Software Year  CreatorDescriptionReference
Georgetown experiment 1954 Georgetown University and IBM involved fully automatic translation of more than sixty Russian sentences into English.
STUDENT 1964 Daniel Bobrow could solve high school algebra word problems. [10]
ELIZA 1964 Joseph Weizenbaum a simulation of a Rogerian psychotherapist, rephrasing her (referred to as her not it) response with a few grammar rules. [11]
SHRDLU 1970 Terry Winograd a natural-language system working in restricted "blocks worlds" with restricted vocabularies, worked extremely well
PARRY 1972 Kenneth Colby A chatterbot
KL-ONE 1974Sondheimer et al.a knowledge representation system in the tradition of semantic networks and frames; it is a frame language.
MARGIE1975 Roger Schank
TaleSpin (software)1976Meehan
QUALMLehnert
LIFER/LADDER 1978Hendrixa natural-language interface to a database of information about US Navy ships.
SAM (software)1978Cullingford
PAM (software)1978 Robert Wilensky
Politics (software)1979Carbonell
Plot Units (software)1981Lehnert
Jabberwacky 1982 Rollo Carpenter chatterbot with stated aim to "simulate natural human chat in an interesting, entertaining and humorous manner".
MUMBLE (software)1982McDonald
Racter 1983William Chamberlain and Thomas Etter chatterbot that generated English language prose at random.
MOPTRANS1984Lytinen
KODIAK (software)1986Wilensky
Absity (software)1987Hirst
AeroText 1999 Lockheed Martin Originally developed for the U.S. intelligence community (Department of Defense) for information extraction & relational link analysis
Watson 2006 IBM A question answering system that won the Jeopardy! contest, defeating the best human players in February 2011.
MeTA2014Sean Massung, Chase Geigle, Cheng{X}iang ZhaiMeTA is a modern C++ data sciences toolkit featuringL text tokenization, including deep semantic features like parse trees; inverted and forward indexes with compression and various caching strategies; a collection of ranking functions for searching the indexes; topic models; classification algorithms; graph algorithms; language models; CRF implementation (POS-tagging, shallow parsing); wrappers for liblinear and libsvm (including libsvm dataset parsers); UTF8 support for analysis on various languages; multithreaded algorithms
Tay 2016 Microsoft An artificial intelligence chatterbot that caused controversy on Twitter by releasing inflammatory tweets and was taken offline shortly after.

General natural-language processing concepts

Natural-language processing tools

Corpora

Natural-language processing toolkits

The following natural-language processing toolkits are notable collections of natural-language processing software. They are suites of libraries, frameworks, and applications for symbolic, statistical natural-language and speech processing.

NameLanguageLicenseCreators
Apertium C++, Java GPL (various)
ChatScript C++ GPL Bruce Wilcox
Deeplearning4j Java, Scala Apache 2.0 Adam Gibson, Skymind
DELPH-IN LISP, C++ LGPL, MIT, ...Deep Linguistic Processing with HPSG Initiative
Distinguo C++ CommercialUltralingua Inc.
DKPro Core Java Apache 2.0 / Varying for individual modules Technische Universität Darmstadt / Online community
General Architecture for Text Engineering (GATE) Java LGPL GATE open source community
Gensim Python LGPL Radim Řehůřek
LinguaStream Java Free for research University of Caen, France
Mallet Java Common Public License University of Massachusetts Amherst
Modular Audio Recognition Framework Java BSD The MARF Research and Development Group, Concordia University
MontyLingua Python, Java Free for research MIT
Natural Language Toolkit (NLTK) Python Apache 2.0
Apache OpenNLP Java Apache License 2.0 Online community
spaCy Python, Cython MIT Matthew Honnibal, Explosion AI
UIMA Java / C++ Apache 2.0 Apache

Named-entity recognizers

Translation software

Other software

Chatterbots

Chatterbot a text-based conversation agent that can interact with human users through some medium, such as an instant message service. Some chatterbots are designed for specific purposes, while others converse with human users on a wide range of topics.

Classic chatterbots

General chatterbots

Instant messenger chatterbots

Natural-language processing organizations

Companies involved in natural-language processing

Natural-language processing publications

Books

Book series

Journals

People influential in natural-language processing

See also

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

<span class="mw-page-title-main">Machine translation</span> Use of software for language translation

Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subset of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem.

Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the Outline of linguistics, the List of phonetics topics, the List of linguists, and the List of cognitive science topics. Articles related to linguistics include:

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

<span class="mw-page-title-main">Yorick Wilks</span> British computer scientist (1939–2023)

Yorick Alexander Wilks FBCS was a British computer scientist. He was an emeritus professor of artificial intelligence at the University of Sheffield, visiting professor of artificial intelligence at Gresham College, senior research fellow at the Oxford Internet Institute, senior scientist at the Florida Institute for Human and Machine Cognition, and a member of the Epiphany Philosophers.

In natural language processing, semantic role labeling is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, such as that of an agent, goal, or result.

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

In natural language processing (NLP), a text graph is a graph representation of a text item. It is typically created as a preprocessing step to support NLP tasks such as text condensation term disambiguation (topic-based) text summarization, relation extraction and textual entailment.

NooJ is a linguistic development environment software as well as a corpus processor constructed by Max Silberztein. NooJ allows linguists to construct the four classes of the Chomsky-Schützenberger hierarchy of generative grammars: Finite-State Grammars, Context-Free Grammars, Context-Sensitive Grammars as well as Unrestricted Grammars, using either a text editor, or a Graph editor.

<span class="mw-page-title-main">Semantic parsing</span>

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. Semantic parsing is one of the important tasks in computational linguistics and natural language processing.

References

  1. "... modern science is a discovery as well as an invention. It was a discovery that nature generally acts regularly enough to be described by laws and even by mathematics; and required invention to devise the techniques, abstractions, apparatus, and organization for exhibiting the regularities and securing their law-like descriptions." —p.vii, J. L. Heilbron, (2003, editor-in-chief) The Oxford Companion to the History of Modern Science New York: Oxford University Press ISBN   0-19-511229-6
    • "science". Merriam-Webster Online Dictionary. Merriam-Webster, Inc. Retrieved 2011-10-16. 3 a: knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method b: such knowledge or such a system of knowledge concerned with the physical world and its phenomena
  2. SWEBOK Pierre Bourque; Robert Dupuis, eds. (2004). Guide to the Software Engineering Body of Knowledge - 2004 Version. executive editors, Alain Abran, James W. Moore ; editors, Pierre Bourque, Robert Dupuis. IEEE Computer Society. p. 1. ISBN   0-7695-2330-7.
  3. ACM (2006). "Computing Degrees & Careers". ACM. Archived from the original on 2011-06-17. Retrieved 2010-11-23.
  4. Laplante, Phillip (2007). What Every Engineer Should Know about Software Engineering. Boca Raton: CRC. ISBN   978-0-8493-7228-5 . Retrieved 2011-01-21.
  5. Input device Computer Hope
  6. McQuail, Denis. (2005). Mcquail's Mass Communication Theory. 5th ed. London: SAGE Publications.
  7. Yucong Duan, Christophe Cruz (2011), [http //www.ijimt.org/abstract/100-E00187.htm Formalizing Semantic of Natural Language through Conceptualization from Existence]. International Journal of Innovation, Management and Technology(2011) 2 (1), pp. 37–42.
  8. "Tool Module: Chomsky's Universal Grammar". thebrain.mcgill.ca.
  9. Roger Schank, 1969, A conceptual dependency parser for natural language Proceedings of the 1969 conference on Computational linguistics, Sång-Säby, Sweden pages 1-3
  10. McCorduck 2004 , p. 286, Crevier 1993 , pp. 76−79, Russell & Norvig 2003 , p. 19
  11. McCorduck 2004 , pp. 291–296, Crevier 1993 , pp. 134−139
  12. "МНОГОЦЕЛЕВОЙ ЛИНГВИСТИЧЕСКИЙ ПРОЦЕССОР ЭТАП-3". Iitp.ru. Retrieved 2012-02-14.
  13. "Aiming to Learn as We Do, a Machine Teaches Itself". New York Times . October 4, 2010. Retrieved 2010-10-05. Since the start of the year, a team of researchers at Carnegie Mellon University — supported by grants from the Defense Advanced Research Projects Agency and Google, and tapping into a research supercomputing cluster provided by Yahoo — has been fine-tuning a computer system that is trying to master semantics by learning more like a human.
  14. Project Overview, Carnegie Mellon University. Accessed October 5, 2010.
  15. "Loebner Prize Contest 2013". People.exeter.ac.uk. 2013-09-14. Retrieved 2013-12-02.
  16. Gibes, Al (2002-03-25). "Circle of buddies grows ever wider". Las Vegas Review-Journal (Nevada).
  17. "ActiveBuddy Introduces Software to Create and Deploy Interactive Agents for Text Messaging; ActiveBuddy Developer Site Now Open: www.BuddyScript.com". Business Wire. 2002-07-15. Retrieved 2014-01-16.
  18. Lenzo, Kevin (Summer 1998). "Infobots and Purl". The Perl Journal. 3 (2). Retrieved 2010-07-26.
  19. Laorden, Carlos; Galan-Garcia, Patxi; Santos, Igor; Sanz, Borja; Hidalgo, Jose Maria Gomez; Bringas, Pablo G. (23 August 2012). Negobot: A conversational agent based on game theory for the detection of paedophile behaviour (PDF). ISBN   978-3-642-33018-6. Archived from the original (PDF) on 2013-09-17.
  20. Wermter, Stephan; Ellen Riloff; Gabriele Scheler (1996). Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer.
  21. Jurafsky, Dan; James H. Martin (2008). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Upper Saddle River (N.J.): Prentice Hall. p. 2.
  22. "SEM1A5 - Part 1 - A brief history of NLP" . Retrieved 2010-06-25.
  23. Roger Schank, 1969, A conceptual dependency parser for natural language Proceedings of the 1969 conference on Computational linguistics, Sång-Säby, Sweden, pages 1-3
  24. Ibrahim, Amr Helmy. 2002. "Maurice Gross (1934-2001). À la mémoire de Maurice Gross". Hermès 34.
  25. Dougherty, Ray. 2001. Maurice Gross Memorial Letter.
  26. "Programming with Natural Language Is Actually Going to Work—Wolfram Blog". 16 November 2010.

Bibliography