ThoughtTreasure

Last updated

ThoughtTreasure is a commonsense knowledge base and architecture for natural language processing. It contains both declarative and procedural knowledge.

Contents

Declarative knowledge

ThoughtTreasure's knowledge base consists of concepts, which are linked to one another by assertions. An assertion is represented in the form

@timestamp:timestamp|[concept ...]

Some examples of assertions in ThoughtTreasure are:

[isa soda drink] (A soda is a drink.)  [part-of phone-ringer phone] (A phone ringer is part of a phone.)  [green green-pea] (A green pea is green.)  [diameter-of green-pea .25in] (The diameter of a green pea is .25 inches.)  [duration attend-play NUMBER:second:10800] (The duration of a play is 10,800 seconds.)  [product-of Intel-8080 Intel] (An Intel 8080 is a product of Intel.)  @19770120:19810120|[President-of country-USA Jimmy-Carter] (Jimmy Carter was the President of the USA from January 20, 1977 to January 20, 1981.) 

ThoughtTreasure contains a total of 27,000 concepts and 51,000 assertions. It has an upper ontology and several domain-specific lower ontologies such as for clothing, food, and music.

Each concept is associated with zero or more lexical entries (words and phrases). Two languages are supported: English and French. ThoughtTreasure has 35,000 English lexical entries and 21,000 French lexical entries. In addition to open-class lexical entries such as nouns, verbs, adjectives, and adverbs, ThoughtTreasure also contains closed-class lexical entries such as conjunctions, determiners, interjections, prepositions, and pronouns. It also contains a dictionary of names.

Zero or more features are attached to each lexical entry. There are 118 features. Examples are ZEROART (zero article taker), SING (singular), FML (formal), CAN (Canadian), ENG (English), and N (noun). Argument structure is provided for verbs. For example, the argument structure for the concept walk-into is

*> S ---- (from IO[2]) into IO

ThoughtTreasure contains 93 scripts, or representations of typical activities.

ThoughtTreasure contains 29 grids, which represent the arrangement of objects in typical locations such as hotel rooms, kitchens, and theaters. Grids are connected together by wormholes.

Procedural knowledge

ThoughtTreasure includes a planning agency for achieving goals in a simulated world and an understanding agency for understanding stories and asking and answering questions.

ThoughtTreasure contains the following procedures for natural language processing:

ThoughtTreasure contains the following procedures that deal with space:

It contains operations dealing with parts and wholes of objects, grids (distance, subspace), large space (planetary distance, polity containment), and nested space (room, floor, building, city, planet).

Other procedures in ThoughtTreasure include:

Use

ThoughtTreasure can be used to add common sense to applications by using its knowledge base or by communicating with a ThoughtTreasure server.

ThoughtTreasure has been used to build various applications such as a DJ's assistant, a movie review question answering program, and a smart calendar.

History

ThoughtTreasure was begun by Erik Mueller in December 1993. The first version was released on April 28, 1996. Mueller established the company Signiform in 1997 to pursue commercial applications of ThoughtTreasure. However, the company was unsuccessful and Signiform closed its doors in 2000. In 2000, Erik Mueller moved to IBM Research, where he was a member of the team that developed Watson (computer). On July 31, 2015, ThoughtTreasure was made available on GitHub.

See also

Related Research Articles

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language to create an executable program.

Cyc knowledge base and artificial intelligence project

Cyc is a long-term artificial intelligence project that aims to assemble a comprehensive ontology and knowledge base that spans the basic concepts and rules about how the world works. Hoping to capture common sense knowledge, Cyc focuses on implicit knowledge that other AI platforms may take for granted. This is contrasted with facts one might find somewhere on the internet or retrieve via a search engine or Wikipedia. Cyc enables AI applications to perform human-like reasoning and be less "brittle" when confronted with novel situations.

Formal language Words whose letters are taken from an alphabet and are well-formed according to a specific set of rules

In mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.

Intel 8080 8-bit microprocessor

The Intel 8080 ("eighty-eighty") is the second 8-bit microprocessor designed and manufactured by Intel. It first appeared in April 1974 and is an extended and enhanced variant of the earlier 8008 design, although without binary compatibility. The initial specified clock rate or frequency limit was 2 MHz, and with common instructions using 4, 5, 7, 10, or 11 cycles this meant that it operated at a typical speed of a few hundred thousand instructions per second. A faster variant 8080A-1 became available later with clock frequency limit up to 3.125 MHz.

Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic programming language families include Prolog, answer set programming (ASP) and Datalog. In all of these languages, rules are written in the form of clauses:

In computer science, LR parsers are a type of bottom-up parser that analyses deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parsers, GLR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Semantics is the linguistic and philosophical study of meaning in language, programming languages, formal logic, and semiotics. It is concerned with the relationship between signifiers—like words, phrases, signs, and symbols—and what they stand for in reality, their denotation.

Semantic network knowledge representation scheme that uses a directed graph to encode knowledge

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as a graph database.

WordNet Computational lexicon of English

WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website.

In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject.

Wiktionary Free online dictionary that anyone can edit

Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, pronunciations, etymologies, inflections, usage examples, related terms, and images for illustrations, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 171 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.

Open Mind Common Sense (OMCS) is an artificial intelligence project based at the Massachusetts Institute of Technology (MIT) Media Lab whose goal is to build and utilize a large commonsense knowledge base from the contributions of many thousands of people across the Web.

Intel iAPX 432

The iAPX 432 is a discontinued computer architecture introduced in 1981. It was Intel's first 32-bit processor design. The main processor of the architecture, the general data processor, is implemented as a set of two separate integrated circuits, due to technical limitations at the time. Although some early 8086, 80186 and 80286-based systems and manuals also used the iAPX prefix for marketing reasons, the iAPX 432 and the 8086 processor lines are completely separate designs with completely different instruction sets.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences. It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts. UMLS further provides facilities for natural language processing. It is intended to be used mainly by developers of systems in medical informatics.

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

Natural-language programming (NLP) is an ontology-assisted way of programming in terms of natural-language sentences, e.g. English. A structured document with Content, sections and subsections for explanations of sentences forms a NLP document, which is actually a computer program. Natural languages and natural-language user interfaces include Inform 7, a natural programming language for making interactive fiction, Ring, a general-purpose language, Shakespeare, an esoteric natural programming language in the style of the plays of William Shakespeare, and Wolfram Alpha, a computational knowledge engine, using natural-language input. Some methods for program synthesis are based on natural-language programming.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

References