Calais is a service created by Thomson Reuters that automatically extracts semantic information from web pages in a format that can be used on the semantic web. [1] Calais was launched in January 2008, and is free to use. [2] [3] The technology is now available via the website of Refinitiv, a provider of financial market data and infrastructure founded in 2018, that is a subsidiary of London Stock Exchange Group. [4]
The Calais Web service reads unstructured text and returns Resource Description Framework formatted results identifying entities, facts and events within the text. [5] The service appears to be based on technology acquired when Reuters purchased ClearForest in 2007. [6]
The technology has also been used to automatically tag blog articles, [7] and organize museum collections. [8]
Calais uses natural language processing technologies delivered via a web service interface.
HyperText Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic, but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.
Automatic identification and data capture (AIDC) refers to the methods of automatically identifying objects, collecting data about them, and entering them directly into computer systems, without human involvement. Technologies typically considered as part of AIDC include QR codes, bar codes, radio frequency identification (RFID), biometrics, magnetic stripes, optical character recognition (OCR), smart cards, and voice recognition. AIDC is also commonly referred to as "Automatic Identification", "Auto-ID" and "Automatic Data Capture".
In information systems, a tag is a keyword or term assigned to a piece of information. This kind of metadata helps describe an item and allows it to be found again by browsing or searching. Tags are generally chosen informally and personally by the item's creator or by its viewer, depending on the system, although they may also be chosen from a controlled vocabulary.
Geotagging, or GeoTagging, is the process of adding geographical identification metadata to various media such as a geotagged photograph or video, websites, SMS messages, QR Codes or RSS feeds and is a form of geospatial metadata. This data usually consists of latitude and longitude coordinates, though they can also include altitude, bearing, distance, accuracy data, and place names, and perhaps a time stamp.
A semantic wiki is a wiki that has an underlying model of the knowledge described in its pages. Regular, or syntactic, wikis have structured text and untyped hyperlinks. Semantic wikis, on the other hand, provide the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like a database through semantic queries.
Semantic MediaWiki (SMW) is an extension to MediaWiki that allows for annotating semantic data within wiki pages, thus turning a wiki that incorporates the extension into a semantic wiki. Data that has been encoded can be used in semantic searches, used for aggregation of pages, displayed in formats like maps, calendars and graphs, and exported to the outside world via formats like RDF and CSV.
RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.
Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.
Jaikoz is a Java program used for editing and mass tagging music file tags.
ClearForest was an Israeli software company that developed and marketed text analytics and text mining solutions.
DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.
NEPOMUK is an open-source software specification that is concerned with the development of a social semantic desktop that enriches and interconnects data from different desktop applications using semantic metadata stored as RDF. Between 2006 and 2008 it was funded by a European Union research project of the same name that grouped together industrial and academic actors to develop various Semantic Desktop technologies.
Smart tags are an early selection-based search feature, found in later versions of Microsoft Word and beta versions of the Internet Explorer 6 web browser, by which the application recognizes certain words or types of data and converts it to a hyperlink. It is also included in other Microsoft Office programs as well as Visual Web Developer. Selection-based search allows a user to invoke an online service from any other page using only the mouse. Microsoft had initially intended the technology to be built into its Windows XP operating system but changed its plans due to public criticism.
Business Intelligence 2.0 is a development of the existing business intelligence model that began in the mid-2000s, where data can be obtained from many sources. The process allows for querying real-time corporate data by employees but approaches the data with a web browser-based solution. This is in contrast to previous proprietary querying tools that characterized previous BI software.
Twine was an online social web service for information storage, authoring and discovery that existed from 2007 to 2010. It was created and run by Radar Networks. It was announced on October 19, 2007 and opened to the public on October 21, 2008. On March 11, 2010, Radar Networks was acquired by Evri Inc. along with Twine.com. On May 14, 2010, twine.com was shut down, becoming a redirect to evri.com.
Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.
The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components developed specifically to provide a complete Web application framework. OSF is made available under the Apache 2 license.
Case Study of Open Calais
{{cite book}}
: |work=
ignored (help)