The following outline is provided as an overview of and topical guide to natural-language processing:
natural-language processing – computer activity in which computers are entailed to analyze, understand, alter, or generate natural language. This includes the automation of any or all linguistic forms, activities, or methods of communication, such as conversation, correspondence, reading, written composition, dictation, publishing, translation, lip reading, and so on. Natural-language processing is also the name of the branch of computer science, artificial intelligence, and linguistics concerned with enabling computers to engage in communication using natural language(s) in all forms, including but not limited to speech, print, writing, and signing.
Natural-language processing can be described as all of the following:
The following technologies make natural-language processing possible:
Natural-language processing contributes to, and makes use of (the theories, tools, and methodologies from), the following fields:
Natural-language generation – task of converting information from computer databases into readable human language.
History of natural-language processing
Software | Year | Creator | Description | Reference |
---|---|---|---|---|
Georgetown experiment | 1954 | Georgetown University and IBM | involved fully automatic translation of more than sixty Russian sentences into English. | |
STUDENT | 1964 | Daniel Bobrow | could solve high school algebra word problems. [10] | |
ELIZA | 1964 | Joseph Weizenbaum | a simulation of a Rogerian psychotherapist, rephrasing her (referred to as her not it) response with a few grammar rules. [11] | |
SHRDLU | 1970 | Terry Winograd | a natural-language system working in restricted "blocks worlds" with restricted vocabularies, worked extremely well | |
PARRY | 1972 | Kenneth Colby | A chatterbot | |
KL-ONE | 1974 | Sondheimer et al. | a knowledge representation system in the tradition of semantic networks and frames; it is a frame language. | |
MARGIE | 1975 | Roger Schank | ||
TaleSpin (software) | 1976 | Meehan | ||
QUALM | Lehnert | |||
LIFER/LADDER | 1978 | Hendrix | a natural-language interface to a database of information about US Navy ships. | |
SAM (software) | 1978 | Cullingford | ||
PAM (software) | 1978 | Robert Wilensky | ||
Politics (software) | 1979 | Carbonell | ||
Plot Units (software) | 1981 | Lehnert | ||
Jabberwacky | 1982 | Rollo Carpenter | chatterbot with stated aim to "simulate natural human chat in an interesting, entertaining and humorous manner". | |
MUMBLE (software) | 1982 | McDonald | ||
Racter | 1983 | William Chamberlain and Thomas Etter | chatterbot that generated English language prose at random. | |
MOPTRANS | 1984 | Lytinen | ||
KODIAK (software) | 1986 | Wilensky | ||
Absity (software) | 1987 | Hirst | ||
AeroText | 1999 | Lockheed Martin | Originally developed for the U.S. intelligence community (Department of Defense) for information extraction & relational link analysis | |
Watson | 2006 | IBM | A question answering system that won the Jeopardy! contest, defeating the best human players in February 2011. | |
MeTA | 2014 | Sean Massung, Chase Geigle, Cheng{X}iang Zhai | MeTA is a modern C++ data sciences toolkit featuringL text tokenization, including deep semantic features like parse trees; inverted and forward indexes with compression and various caching strategies; a collection of ranking functions for searching the indexes; topic models; classification algorithms; graph algorithms; language models; CRF implementation (POS-tagging, shallow parsing); wrappers for liblinear and libsvm (including libsvm dataset parsers); UTF8 support for analysis on various languages; multithreaded algorithms | |
Tay | 2016 | Microsoft | An artificial intelligence chatterbot that caused controversy on Twitter by releasing inflammatory tweets and was taken offline shortly after. |
The following natural-language processing toolkits are notable collections of natural-language processing software. They are suites of libraries, frameworks, and applications for symbolic, statistical natural-language and speech processing.
Name | Language | License | Creators |
---|---|---|---|
Apertium | C++, Java | GPL | (various) |
ChatScript | C++ | GPL | Bruce Wilcox |
Deeplearning4j | Java, Scala | Apache 2.0 | Adam Gibson, Skymind |
DELPH-IN | LISP, C++ | LGPL, MIT, ... | Deep Linguistic Processing with HPSG Initiative |
Distinguo | C++ | Commercial | Ultralingua Inc. |
DKPro Core | Java | Apache 2.0 / Varying for individual modules | Technische Universität Darmstadt / Online community |
General Architecture for Text Engineering (GATE) | Java | LGPL | GATE open source community |
Gensim | Python | LGPL | Radim Řehůřek |
LinguaStream | Java | Free for research | University of Caen, France |
Mallet | Java | Common Public License | University of Massachusetts Amherst |
Modular Audio Recognition Framework | Java | BSD | The MARF Research and Development Group, Concordia University |
MontyLingua | Python, Java | Free for research | MIT |
Natural Language Toolkit (NLTK) | Python | Apache 2.0 | |
Apache OpenNLP | Java | Apache License 2.0 | Online community |
spaCy | Python, Cython | MIT | Matthew Honnibal, Explosion AI |
UIMA | Java / C++ | Apache 2.0 | Apache |
Chatterbot – a text-based conversation agent that can interact with human users through some medium, such as an instant message service. Some chatterbots are designed for specific purposes, while others converse with human users on a wide range of topics.
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.
Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.
Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subset of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem.
Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the Outline of linguistics, the List of phonetics topics, the List of linguists, and the List of cognitive science topics. Articles related to linguistics include:
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.
Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
Yorick Alexander Wilks FBCS was a British computer scientist. He was an emeritus professor of artificial intelligence at the University of Sheffield, visiting professor of artificial intelligence at Gresham College, senior research fellow at the Oxford Internet Institute, senior scientist at the Florida Institute for Human and Machine Cognition, and a member of the Epiphany Philosophers.
In natural language processing, semantic role labeling is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, such as that of an agent, goal, or result.
In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.
SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.
Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.
Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.
In natural language processing (NLP), a text graph is a graph representation of a text item. It is typically created as a preprocessing step to support NLP tasks such as text condensation term disambiguation (topic-based) text summarization, relation extraction and textual entailment.
NooJ is a linguistic development environment software as well as a corpus processor constructed by Max Silberztein. NooJ allows linguists to construct the four classes of the Chomsky-Schützenberger hierarchy of generative grammars: Finite-State Grammars, Context-Free Grammars, Context-Sensitive Grammars as well as Unrestricted Grammars, using either a text editor, or a Graph editor.
Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. Semantic parsing is one of the important tasks in computational linguistics and natural language processing.
3 a: knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method b: such knowledge or such a system of knowledge concerned with the physical world and its phenomena
Since the start of the year, a team of researchers at Carnegie Mellon University — supported by grants from the Defense Advanced Research Projects Agency and Google, and tapping into a research supercomputing cluster provided by Yahoo — has been fine-tuning a computer system that is trying to master semantics by learning more like a human.