IBM SystemT

Last updated
IBM SystemT
Developer(s) IBM
Written in AQL  [ de ], Java
Operating system Linux, macOS, Windows
Type Information Extraction, Text mining
Website SystemT

IBM SystemT is a declarative information extraction system. It was first built in 2005, as a research project at IBM's IBM Almaden Research Center. Its name is partially inspired by System R, a seminal project from the same research center.

SystemT [1] comprises the following three main components: (1) AQL, a declarative rule language with a similar syntax to SQL; (2) Optimizer, which accepts AQL statements as input and generates high-performance algebraic execution plans; and (3) Executing engine, which executes the plan generated by the Optimizer and performs information extraction over input documents.

SystemT is available as part of IBM BigInsights, [2] and has also been taught in multiple universities around the globe. A version of SystemT was available (starting in September 2016) as a companion to a sequence of online courses in Text Analytics. [3] [4]

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Computerized batch processing is a method of running software programs called jobs in batches automatically. While users are required to submit the jobs, no other interaction by the user is required to process the batch. Batches may automatically be run at scheduled times as well as being run contingent on the availability of computer resources.

Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

Algebraic modeling languages (AML) are high-level computer programming languages for describing and solving high complexity problems for large scale mathematical computation. One particular advantage of some algebraic modeling languages like AIMMS, AMPL, GAMS, Gekko, MathProg, Mosel, and OPL is the similarity of their syntax to the mathematical notation of optimization problems. This allows for a very concise and readable definition of problems in the domain of optimization, which is supported by certain language elements like sets, indices, algebraic expressions, powerful sparse index and data handling variables, constraints with arbitrary names. The algebraic formulation of a model does not contain any hints how to process it.

<span class="mw-page-title-main">General Architecture for Text Engineering</span>

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Marti Hearst</span> American computer scientist

Marti Hearst is a professor in the School of Information at the University of California, Berkeley. She did early work in corpus-based computational linguistics, including some of the first work in automating sentiment analysis, and word sense disambiguation. She invented an algorithm that became known as "Hearst patterns" which applies lexico-syntactic patterns to recognize hyponymy (ISA) relations with high accuracy in large text collections, including an early application of it to WordNet; this algorithm is widely used in commercial text mining applications including ontology learning. Hearst also developed early work in automatic segmentation of text into topical discourse boundaries, inventing a now well-known approach called TextTiling.

<span class="mw-page-title-main">Word embedding</span> Method in natural language processing

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Apache SystemDS is an open source ML system for the end-to-end data science lifecycle.

<span class="mw-page-title-main">Glossary of artificial intelligence</span> List of definitions of terms and concepts commonly used in the study of artificial intelligence

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

Heng Ji is a computer scientist who works on information extraction and natural language processing. She is well known for her work on joined named entity recognition and relation extraction, as well as for her work on cross-document event extraction. She has been coordinating the popular NIST TAC Knowledge Base Population task since 2010. She has been recognised as one of AI's 10 to watch by IEEE Intelligent Systems in 2013, and has won multiple awards, including a NSF Career Award in 2009, Google Research awards in 2009 and 2014, and an IBM Watson Faculty Award in 2012.

Ellen Riloff is an American computer scientist currently serving as a professor at the School of Computing at the University of Utah. Her research focuses on Natural Language Processing and Computational Linguistics, specifically information extraction, sentiment analysis, semantic class induction, and bootstrapping methods that learn from unannotated texts.

References

  1. Chiticariu, Laura; Krishnamurthy, Rajasekar; Li, Yunyao; Raghavan, Sriram; Reiss, Frederick R.; Vaithyanathan, Shivakumar (2010-01-01). "SystemT: An Algebraic Approach to Declarative Information Extraction". Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. ACL '10. Uppsala, Sweden: Association for Computational Linguistics: 128–137.
  2. IBM BigInsights
  3. Text Analytics: Getting Results with SystemT
  4. Chiticariu, Laura; Danilevsky, Marina; Li, Yunyao; Reiss, Frederick; Zhu, Huaiyu. "SystemT: Declarative Text Understanding for Enterprise". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). Association for Computational Linguistics: 76–83.