Calais (Reuters product)

Last updated

Calais is a service created by Thomson Reuters that automatically extracts semantic information from web pages in a format that can be used on the semantic web. [1] Calais was launched in January 2008, and is free to use. [2] [3] The technology is now available via the website of Refinitiv, a provider of financial market data and infrastructure founded in 2018, that is a subsidiary of London Stock Exchange Group. [4]

The Calais Web service reads unstructured text and returns Resource Description Framework formatted results identifying entities, facts and events within the text. [5] The service appears to be based on technology acquired when Reuters purchased ClearForest in 2007. [6]

The technology has also been used to automatically tag blog articles, [7] and organize museum collections. [8]

Calais uses natural language processing technologies delivered via a web service interface.

Related Research Articles

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the meaning and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.

Automatic identification and data capture (AIDC) refers to the methods of automatically identifying objects, collecting data about them, and entering them directly into computer systems, without human involvement. Technologies typically considered as part of AIDC include QR codes, bar codes, radio frequency identification (RFID), biometrics, magnetic stripes, optical character recognition (OCR), smart cards, and voice recognition. AIDC is also commonly referred to as "Automatic Identification", "Auto-ID" and "Automatic Data Capture".

<span class="mw-page-title-main">Tag (metadata)</span> Keyword assigned to information

In information systems, a tag is a keyword or term assigned to a piece of information. This kind of metadata helps describe an item and allows it to be found again by browsing or searching. Tags are generally chosen informally and personally by the item's creator or by its viewer, depending on the system, although they may also be chosen from a controlled vocabulary.

<span class="mw-page-title-main">Geotagging</span> Act of associating geographic coordinates to digital media

Geotagging, or GeoTagging, is the process of adding geographical identification metadata to various media such as a geotagged photograph or video, websites, SMS messages, QR Codes or RSS feeds and is a form of geospatial metadata. This data usually consists of latitude and longitude coordinates, though they can also include altitude, bearing, distance, accuracy data, and place names, and perhaps a time stamp.

A semantic wiki is a wiki that has an underlying model of the knowledge described in its pages. Regular, or syntactic, wikis have structured text and untyped hyperlinks. Semantic wikis, on the other hand, provide the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like a database through semantic queries.

<span class="mw-page-title-main">Semantic MediaWiki</span> Software for creating, managing and sharing structured data in MediaWiki

Semantic MediaWiki (SMW) is an extension to MediaWiki that allows for annotating semantic data within wiki pages, thus turning a wiki that incorporates the extension into a semantic wiki. Data that has been encoded can be used in semantic searches, used for aggregation of pages, displayed in formats like maps, calendars and graphs, and exported to the outside world via formats like RDF and CSV.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.

<span class="mw-page-title-main">Jaikoz</span> Java tagging program

Jaikoz is a Java program used for editing and mass tagging music file tags.

ClearForest was an Israeli software company that developed and marketed text analytics and text mining solutions.

<span class="mw-page-title-main">DBpedia</span> Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

NEPOMUK is an open-source software specification that is concerned with the development of a social semantic desktop that enriches and interconnects data from different desktop applications using semantic metadata stored as RDF. Between 2006 and 2008 it was funded by a European Union research project of the same name that grouped together industrial and academic actors to develop various Semantic Desktop technologies.

Smart tags are an early selection-based search feature, found in later versions of Microsoft Word and beta versions of the Internet Explorer 6 web browser, by which the application recognizes certain words or types of data and converts it to a hyperlink. It is also included in other Microsoft Office programs as well as Visual Web Developer. Selection-based search allows a user to invoke an online service from any other page using only the mouse. Microsoft had initially intended the technology to be built into its Windows XP operating system but changed its plans due to public criticism.

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:

<span class="mw-page-title-main">Twine (social network)</span>

Twine was an online, social web service for information storage, authoring and discovery, located at twine.com, that existed from 2007 to 2010. It was created and run by Radar Networks. The service was announced on October 19, 2007 and made open to the public on October 21, 2008. On March 11, 2010, Radar Networks was acquired by Evri Inc. along with Twine.com. On May 14, 2010, twine.com was shut down, becoming a redirect to evri.com.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components developed specifically to provide a complete Web application framework. OSF is made available under the Apache 2 license.

References

  1. "Calais Overview". Archived from the original on 2008-10-24. Retrieved 2008-10-24.
  2. "Reuters Wants The World To Be Tagged". Archived from the original on 2008-04-11. Retrieved 2008-04-12.
  3. Start making sense
  4. "Intelligent Tagging". Refinitiv.
  5. "Calais API guide". Archived from the original on 2008-10-23. Retrieved 2008-10-24.
  6. "ClearForest". Archived from the original on 2013-01-24. Retrieved 2012-11-27.
  7. WP Calais Auto Tagger
  8. OpenCalais meets our museum collection /auto-tagging and semantic parsing of collection data