ThoughtTreasure

Last updated

ThoughtTreasure is a commonsense knowledge base and architecture for natural language processing. It contains both declarative and procedural knowledge.

Contents

Declarative knowledge

ThoughtTreasure's knowledge base consists of concepts, which are linked to one another by assertions. An assertion is represented in the form

@timestamp:timestamp|[concept ...]

Some examples of assertions in ThoughtTreasure are:

[isa soda drink] (A soda is a drink.)  [part-of phone-ringer phone] (A phone ringer is part of a phone.)  [green green-pea] (A green pea is green.)  [diameter-of green-pea .25in] (The diameter of a green pea is .25 inches.)  [duration attend-play NUMBER:second:10800] (The duration of a play is 10,800 seconds.)  [product-of Intel-8080 Intel] (An Intel 8080 is a product of Intel.)  @19770120:19810120|[President-of country-USA Jimmy-Carter] (Jimmy Carter was the President of the USA from January 20, 1977 to January 20, 1981.) 

ThoughtTreasure contains a total of 27,000 concepts and 51,000 assertions. It has an upper ontology and several domain-specific lower ontologies such as for clothing, food, and music.

Each concept is associated with zero or more lexical entries (words and phrases). Two languages are supported: English and French. ThoughtTreasure has 35,000 English lexical entries and 21,000 French lexical entries. In addition to open-class lexical entries such as nouns, verbs, adjectives, and adverbs, ThoughtTreasure also contains closed-class lexical entries such as conjunctions, determiners, interjections, prepositions, and pronouns. It also contains a dictionary of names.

Zero or more features are attached to each lexical entry. There are 118 features. Examples are ZEROART (zero article taker), SING (singular), FML (formal), CAN (Canadian), ENG (English), and N (noun). Argument structure is provided for verbs. For example, the argument structure for the concept walk-into is

*> S ---- (from IO[2]) into IO

ThoughtTreasure contains 93 scripts, or representations of typical activities.

ThoughtTreasure contains 29 grids, which represent the arrangement of objects in typical locations such as hotel rooms, kitchens, and theaters. Grids are connected together by wormholes.

Procedural knowledge

ThoughtTreasure includes a planning agency for achieving goals in a simulated world and an understanding agency for understanding stories and asking and answering questions.

ThoughtTreasure contains the following procedures for natural language processing:

ThoughtTreasure contains the following procedures that deal with space:

It contains operations dealing with parts and wholes of objects, grids (distance, subspace), large space (planetary distance, polity containment), and nested space (room, floor, building, city, planet).

Other procedures in ThoughtTreasure include:

Use

ThoughtTreasure can be used to add common sense to applications by using its knowledge base or by communicating with a ThoughtTreasure server.

ThoughtTreasure has been used to build various applications such as a DJ's assistant, a movie review question answering program, and a smart calendar.

History

ThoughtTreasure was begun by Erik Mueller in December 1993. The first version was released on April 28, 1996. Mueller established the company Signiform in 1997 to pursue commercial applications of ThoughtTreasure. However, the company was unsuccessful and Signiform closed its doors in 2000. In 2000, Erik Mueller moved to IBM Research, where he was a member of the team that developed Watson (computer). On July 31, 2015, ThoughtTreasure was made available on GitHub.

See also

Related Research Articles

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language to create an executable program.

Cyc

Cyc is a long-term artificial intelligence project that aims to assemble a comprehensive ontology and knowledge base that spans the basic concepts and rules about how the world works. Hoping to capture common sense knowledge, Cyc focuses on implicit knowledge that other AI platforms may take for granted. This is contrasted with facts one might find somewhere on the internet or retrieve via a search engine or Wikipedia. Cyc enables semantic reasoners to perform human-like reasoning and be less "brittle" when confronted with novel situations.

Intel 8080 8-bit microprocessor

The Intel 8080 ("eighty-eighty") is the second 8-bit microprocessor designed and manufactured by Intel. It first appeared in April 1974 and is an extended and enhanced variant of the earlier 8008 design, although without binary compatibility. The initial specified clock rate or frequency limit was 2 MHz, and with common instructions using 4, 5, 7, 10, or 11 cycles this meant that it operated at a typical speed of a few hundred thousand instructions per second. A faster variant 8080A-1 became available later with clock frequency limit up to 3.125 MHz.

Knowledge representation and reasoning is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can utilize to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings from psychology about how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning, such as the application of rules or the relations of sets and subsets.

Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic programming language families include Prolog, answer set programming (ASP) and Datalog. In all of these languages, rules are written in the form of clauses:

In computer science, LR parsers are a type of bottom-up parser that analyses deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parsers, GLR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

Semantic network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map.

WordNet Computational lexicon of English

WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website.

Zilog Z80 8-bit microprocessor

The Z80 is an 8-bit microprocessor introduced by Zilog as the startup company's first product. The Z80 was conceived by Federico Faggin in late 1974 and developed by him and his 11 employees starting in early 1975. The first working samples were delivered in March 1976, and it was officially introduced on the market in July 1976. With the revenue from the Z80, the company built its own chip factories and grew to over a thousand employees over the following two years.

In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject.

Wiktionary Multilingual, web-based project to create a free content dictionary

Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustrations, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of words into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 171 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.

Description logics (DL) are a family of formal knowledge representation languages. Many DLs are more expressive than propositional logic but less expressive than first-order logic. In contrast to the latter, the core reasoning problems for DLs are (usually) decidable, and efficient decision procedures have been designed and implemented for these problems. There are general, spatial, temporal, spatiotemporal, and fuzzy description logics, and each description logic features a different balance between expressive power and reasoning complexity by supporting different sets of mathematical constructors.

Open Mind Common Sense (OMCS) is an artificial intelligence project based at the Massachusetts Institute of Technology (MIT) Media Lab whose goal is to build and utilize a large commonsense knowledge base from the contributions of many thousands of people across the Web.

Intel iAPX 432

The iAPX 432 is a discontinued computer architecture introduced in 1981. It was Intel's first 32-bit processor design. The main processor of the architecture, the general data processor, is implemented as a set of two separate integrated circuits, due to technical limitations at the time. Although some early 8086, 80186 and 80286-based systems and manuals also used the iAPX prefix for marketing reasons, the iAPX 432 and the 8086 processor lines are completely separate designs with completely different instruction sets.

The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences. It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts. UMLS further provides facilities for natural language processing. It is intended to be used mainly by developers of systems in medical informatics.

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

Natural-language programming (NLP) is an ontology-assisted way of programming in terms of natural-language sentences, e.g. English. A structured document with Content, sections and subsections for explanations of sentences forms a NLP document, which is actually a computer program. Natural languages and natural-language user interfaces include Inform 7, a natural programming language for making interactive fiction, Shakespeare, an esoteric natural programming language in the style of the plays of William Shakespeare, and Wolfram Alpha, a computational knowledge engine, using natural-language input. Some methods for program synthesis are based on natural-language programming.

Ontology engineering field which studies the methods and methodologies for building ontologies

In computer science, information science and systems engineering, ontology engineering is a field which studies the methods and methodologies for building ontologies, which encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities. In a broader sense, this field also includes a knowledge construction of the domain using formal ontology representations such as OWL/RDF. A large-scale representation of abstract concepts such as actions, time, physical objects and beliefs would be an example of ontological engineering. Ontology engineering is one of the areas of applied ontology, and can be seen as an application of philosophical ontology. Core ideas and objectives of ontology engineering are also central in conceptual modeling.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

References