General Architecture for Text Engineering

GATE
	GATE Developer v5 main window
Developer(s)	GATE research team, Dept. Computer Science, University of Sheffield
Initial release	1995;28 years ago
Stable release	8.6.1 (January 17, 2020;3 years ago) [±]
Preview release	9.0-SNAPSHOT (March 25, 2023 (Nightly builds released every day)) [±]
Repository	github.com/GateNLP ;
Written in	Java
Operating system	Cross-platform
Available in	English
Type	Text mining Information Extraction
License	LGPL
Website	gate.ac.uk

Last updated March 26, 2023

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.^[1]

As of May 28, 2011, 881 people are on the gate-users mailing list at SourceForge.net, and 111,932 downloads from SourceForge are recorded since the project moved to SourceForge in 2005.^[2] The paper "GATE: A framework and graphical development environment for robust NLP tools and applications"^[3] has received over 2000 citations since publication (according to Google Scholar). Books covering the use of GATE, in addition to the GATE User Guide,^[4] include "Building Search Applications: Lucene, LingPipe, and Gate", by Manu Konchady,^[5] and "Introduction to Linguistic Annotation and Text Analytics", by Graham Wilcock.^[6]

GATE community and research has been involved in several European research projects including: Transitioning Applications to Ontologies, SEKT, NeOn, Media-Campaign, Musing, Service-Finder, LIRICS and KnowledgeWeb.

Features

GATE includes an information extraction system called ANNIE (A Nearly-New Information Extraction System) which is a set of modules comprising a tokenizer, a gazetteer, a sentence splitter, a part of speech tagger, a named entities transducer and a coreference tagger. ANNIE can be used as-is to provide basic information extraction functionality, or provide a starting point for more specific tasks.

Languages currently handled in GATE include English, Chinese, Arabic, Bulgarian, French, German, Hindi, Italian, Cebuano, Romanian, Russian, Danish.

Plugins are included for machine learning with Weka, RASP, MAXENT, SVM Light, as well as a LIBSVM integration and an in-house perceptron implementation, for managing ontologies like WordNet, for querying search engines like Google or Yahoo, for part of speech tagging with Brill or TreeTagger, and many more. Many external plugins are also available, for handling e.g. tweets.^[7]

GATE accepts input in various formats, such as TXT, HTML, XML, Doc, PDF documents, and Java Serial, PostgreSQL, Lucene, Oracle Databases with help of RDBMS storage over JDBC.

JAPE transducers are used within GATE to manipulate annotations on text. Documentation is provided in the GATE User Guide.^[8] A tutorial has also been written by Press Association Images.^[9]

GATE Developer

The screenshot shows the document viewer used to display a document and its annotations. In pink are <a> hyperlink annotations from an HTML file. The right list is the annotation sets list, and the bottom table is the annotation list. In the center is the annotation editor window.

GATE Mímir

GATE generates vast quantities of information including; natural language text, semantic annotations, and ontological information. Sometimes the data itself is the end product of an application but often the information would be more useful if it could be efficiently searched. GATE Mimir provides support for indexing and searching the linguistic and semantic information generated by such applications and allows for querying the information using arbitrary combinations of text, structural information, and SPARQL.

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

Web annotation refers to

online annotations of web resources such as web pages or parts of them, and
a set of W3C standards developed for this purpose.

An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Ontotext is a software company with offices in Europe and USA. It is the semantic technology branch of Sirma Group. Its main domain of activity is the development of software based on the Semantic Web languages and standards, in particular RDF, OWL and SPARQL. Ontotext is best known for the Ontotext GraphDB semantic graph database engine. Another major business line is the development of enterprise knowledge management and analytics systems that involve big knowledge graphs. Those systems are developed on top of the Ontotext Platform that builds on top of GraphDB capabilities for text mining using big knowledge graphs.

Linguistic categories include

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and relationship extraction, and semantic analysis and disambiguation. The analysis engine uses Finite State Machine approach at multiple levels, which aids its performance characteristics, while maintaining a reasonably small footprint.

In computational linguistics, JAPE is the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions. Thus, it is useful for pattern-matching, semantic extraction, and many other operations over syntactic trees such as those produced by natural language parsers.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.

The following outline is provided as an overview of and topical guide to natural-language processing:

In natural language processing (NLP), a text graph is a graph representation of a text item. It is typically created as a preprocessing step to support NLP tasks such as text condensation term disambiguation (topic-based) text summarization, relation extraction and textual entailment.

The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components developed specifically to provide a complete Web application framework. OSF is made available under the Apache 2 license.

In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

References

↑ Languages mentioned on https://gate.ac.uk/gate/plugins/ include Arabic, Bulgarian, Cebuano, Chinese, French, German, Hindi, Italian, Romanian and Russian.
↑ "GATE" . Retrieved 17 December 2016.
↑ "GATE: A framework and graphical development environment for robust NLP tools and applications", by Cunningham H., Maynard D., Bontcheva K. and Tablan V. (In proc. of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002)
↑ "GATE.ac.uk - sale/tao/split.html" . Retrieved 17 December 2016.
↑ Konchady, Manu. Building Search Applications: Lucene, LingPipe, and Gate. Mustru Publishing. 2008.
↑ Wilcock, Graham (1 January 2009). Introduction to Linguistic Annotation and Text Analytics. Morgan & Claypool Publishers. ISBN 9781598297386 . Retrieved 17 December 2016– via Google Books.
↑ "GATE.ac.uk - wiki/twitie.html" . Retrieved 17 December 2016.
↑ "GATE.ac.uk - sale/tao/splitch8.html" . Retrieved 17 December 2016.
↑ Thakker, Dhavalkumar (17 July 2009). "Realizing Semantic Web: JAPE grammar tutorial" . Retrieved 17 December 2016.

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Languages mentioned on https://gate.ac.uk/gate/plugins/ include Arabic, Bulgarian, Cebuano, Chinese, French, German, Hindi, Italian, Romanian and Russian.

[2] "GATE" . Retrieved 17 December 2016.

[3] "GATE: A framework and graphical development environment for robust NLP tools and applications", by Cunningham H., Maynard D., Bontcheva K. and Tablan V. (In proc. of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002)

[4] "GATE.ac.uk - sale/tao/split.html" . Retrieved 17 December 2016.

[5] Konchady, Manu. Building Search Applications: Lucene, LingPipe, and Gate. Mustru Publishing. 2008.

[6] Wilcock, Graham (1 January 2009). Introduction to Linguistic Annotation and Text Analytics. Morgan & Claypool Publishers. ISBN 9781598297386 . Retrieved 17 December 2016– via Google Books.

[7] "GATE.ac.uk - wiki/twitie.html" . Retrieved 17 December 2016.

[8] "GATE.ac.uk - sale/tao/splitch8.html" . Retrieved 17 December 2016.

[9] Thakker, Dhavalkumar (17 July 2009). "Realizing Semantic Web: JAPE grammar tutorial" . Retrieved 17 December 2016.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]