Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.
The RDB2RDF W3C group [1] is currently standardizing a language for extraction of resource description frameworks (RDF) from relational databases. Another popular example for knowledge extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia and Freebase).
After the standardization of knowledge representation languages such as RDF and OWL, much research has been conducted in the area, especially regarding transforming relational databases into RDF, identity resolution, knowledge discovery and ontology learning. The general process uses traditional methods from information extraction and extract, transform, and load (ETL), which transform the data from the sources into structured formats.
The following criteria can be used to categorize approaches in this topic (some of them only account for extraction from relational databases): [2]
Source | Which data sources are covered: Text, Relational Databases, XML, CSV |
---|---|
Exposition | How is the extracted knowledge made explicit (ontology file, semantic database)? How can you query it? |
Synchronization | Is the knowledge extraction process executed once to produce a dump or is the result synchronized with the source? Static or dynamic. Are changes to the result written back (bi-directional) |
Reuse of vocabularies | The tool is able to reuse existing vocabularies in the extraction. For example, the table column 'firstName' can be mapped to foaf:firstName. Some automatic approaches are not capable of mapping vocab. |
Automatization | The degree to which the extraction is assisted/automated. Manual, GUI, semi-automatic, automatic. |
Requires a domain ontology | A pre-existing ontology is needed to map to it. So either a mapping is created or a schema is learned from the source (ontology learning). |
President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.
Name | marriedTo | homepage | status_id |
---|---|---|---|
Peter | Mary | http://example.org/Peters_page%5B%5D | 1 |
Claus | Eva | http://example.org/Claus_page%5B%5D | 2 |
:Peter:marriedTo:Mary.:marriedToaowl:SymmetricProperty.:Peterfoaf:homepage<http://example.org/Peters_page>.:Peterafoaf:Person.:Petera:Student.:Clausa:Teacher.
When building a RDB representation of a problem domain, the starting point is frequently an entity-relationship diagram (ERD). Typically, each entity is represented as a database table, each attribute of the entity becomes a column in that table, and relationships between entities are indicated by foreign keys. Each table typically defines a particular class of entity, each column one of its attributes. Each row in the table describes an entity instance, uniquely identified by a primary key. The table rows collectively describe an entity set. In an equivalent RDF representation of the same entity set:
So, to render an equivalent view based on RDF semantics, the basic mapping algorithm would be as follows:
Early mentioning of this basic or direct mapping can be found in Tim Berners-Lee's comparison of the ER model to the RDF model. [4]
The 1:1 mapping mentioned above exposes the legacy data as RDF in a straightforward way, additional refinements can be employed to improve the usefulness of RDF output respective the given Use Cases. Normally, information is lost during the transformation of an entity-relationship diagram (ERD) to relational tables (Details can be found in object-relational impedance mismatch) and has to be reverse engineered. From a conceptual view, approaches for extraction can come from two directions. The first direction tries to extract or learn an OWL schema from the given database schema. Early approaches used a fixed amount of manually created mapping rules to refine the 1:1 mapping. [5] [6] [7] More elaborate methods are employing heuristics or learning algorithms to induce schematic information (methods overlap with ontology learning). While some approaches try to extract the information from the structure inherent in the SQL schema [8] (analysing e.g. foreign keys), others analyse the content and the values in the tables to create conceptual hierarchies [9] (e.g. a columns with few values are candidates for becoming categories). The second direction tries to map the schema and its contents to a pre-existing domain ontology (see also: ontology alignment). Often, however, a suitable domain ontology does not exist and has to be created first.
As XML is structured as a tree, any data can be easily represented in RDF, which is structured as a graph. XML2RDF is one example of an approach that uses RDF blank nodes and transforms XML elements and attributes to RDF properties. The topic however is more complex as in the case of relational databases. In a relational table the primary key is an ideal candidate for becoming the subject of the extracted triples. An XML element, however, can be transformed - depending on the context- as a subject, a predicate or object of a triple. XSLT can be used a standard transformation language to manually convert XML to RDF.
Name | Data Source | Data Exposition | Data Synchronisation | Mapping Language | Vocabulary Reuse | Mapping Automat. | Req. Domain Ontology | Uses GUI |
---|---|---|---|---|---|---|---|---|
A Direct Mapping of Relational Data to RDF | Relational Data | SPARQL/ETL | dynamic | — | false | automatic | false | false |
CSV2RDF4LOD | CSV | ETL | static | RDF | true | manual | false | false |
CoNLL-RDF | TSV, CoNLL | SPARQL/ RDF stream | static | none | true | automatic (domain-specific, for use cases in language technology, preserves relations between rows) | false | false |
Convert2RDF | Delimited text file | ETL | static | RDF/DAML | true | manual | false | true |
D2R Server | RDB | SPARQL | bi-directional | D2R Map | true | manual | false | false |
DartGrid | RDB | own query language | dynamic | Visual Tool | true | manual | false | true |
DataMaster | RDB | ETL | static | proprietary | true | manual | true | true |
Google Refine's RDF Extension | CSV, XML | ETL | static | none | semi-automatic | false | true | |
Krextor | XML | ETL | static | xslt | true | manual | true | false |
MAPONTO | RDB | ETL | static | proprietary | true | manual | true | false |
METAmorphoses | RDB | ETL | static | proprietary xml based mapping language | true | manual | false | true |
MappingMaster | CSV | ETL | static | MappingMaster | true | GUI | false | true |
ODEMapster | RDB | ETL | static | proprietary | true | manual | true | true |
OntoWiki CSV Importer Plug-in - DataCube & Tabular | CSV | ETL | static | The RDF Data Cube Vocaublary | true | semi-automatic | false | true |
Poolparty Extraktor (PPX) | XML, Text | LinkedData | dynamic | RDF (SKOS) | true | semi-automatic | true | false |
RDBToOnto | RDB | ETL | static | none | false | automatic, the user furthermore has the chance to fine-tune results | false | true |
RDF 123 | CSV | ETL | static | false | false | manual | false | true |
RDOTE | RDB | ETL | static | SQL | true | manual | true | true |
Relational.OWL | RDB | ETL | static | none | false | automatic | false | false |
T2LD | CSV | ETL | static | false | false | automatic | false | false |
The RDF Data Cube Vocabulary | Multidimensional statistical data in spreadsheets | Data Cube Vocabulary | true | manual | false | |||
TopBraid Composer | CSV | ETL | static | SKOS | false | semi-automatic | false | true |
Triplify | RDB | LinkedData | dynamic | SQL | true | manual | false | false |
Ultrawrap | RDB | SPARQL/ETL | dynamic | R2RML | true | semi-automatic | false | true |
Virtuoso RDF Views | RDB | SPARQL | dynamic | Meta Schema Language | true | semi-automatic | false | true |
Virtuoso Sponger | structured and semi-structured data sources | SPARQL | dynamic | Virtuoso PL & XSLT | true | semi-automatic | false | false |
VisAVis | RDB | RDQL | dynamic | SQL | true | manual | true | true |
XLWrap: Spreadsheet to RDF | CSV | ETL | static | TriG Syntax | true | manual | false | false |
XML to RDF | XML | ETL | static | false | false | automatic | false | false |
The largest portion of information contained in business documents (about 80% [10] ) is encoded in natural language and therefore unstructured. Because unstructured data is rather a challenge for knowledge extraction, more sophisticated methods are required, which generally tend to supply worse results compared to structured data. The potential for a massive acquisition of extracted knowledge, however, should compensate the increased complexity and decreased quality of extraction. In the following, natural language sources are understood as sources of information, where the data is given in an unstructured fashion as plain text. If the given text is additionally embedded in a markup document (e. g. HTML document), the mentioned systems normally remove the markup elements automatically.
As a preprocessing step to knowledge extraction, it can be necessary to perform linguistic annotation by one or multiple NLP tools. Individual modules in an NLP workflow normally build on tool-specific formats for input and output, but in the context of knowledge extraction, structured formats for representing linguistic annotations have been applied.
Typical NLP tasks relevant to knowledge extraction include:
In NLP, such data is typically represented in TSV formats (CSV formats with TAB as separators), often referred to as CoNLL formats. For knowledge extraction workflows, RDF views on such data have been created in accordance with the following community standards:
Other, platform-specific formats include
Traditional information extraction [20] is a technology of natural language processing, which extracts information from typically natural language texts and structures these in a suitable manner. The kinds of information to be identified must be specified in a model before beginning the process, which is why the whole process of traditional Information Extraction is domain dependent. The IE is split in the following five subtasks.
The task of named entity recognition is to recognize and to categorize all named entities contained in a text (assignment of a named entity to a predefined category). This works by application of grammar based methods or statistical models.
Coreference resolution identifies equivalent entities, which were recognized by NER, within a text. There are two relevant kinds of equivalence relationship. The first one relates to the relationship between two different represented entities (e.g. IBM Europe and IBM) and the second one to the relationship between an entity and their anaphoric references (e.g. it and IBM). Both kinds can be recognized by coreference resolution.
During template element construction the IE system identifies descriptive properties of entities, recognized by NER and CO. These properties correspond to ordinary qualities like red or big.
Template relation construction identifies relations, which exist between the template elements. These relations can be of several kinds, such as works-for or located-in, with the restriction, that both domain and range correspond to entities.
In the template scenario production events, which are described in the text, will be identified and structured with respect to the entities, recognized by NER and CO and relations, identified by TR.
Ontology-based information extraction [10] is a subfield of information extraction, with which at least one ontology is used to guide the process of information extraction from natural language text. The OBIE system uses methods of traditional information extraction to identify concepts, instances and relations of the used ontologies in the text, which will be structured to an ontology after the process. Thus, the input ontologies constitute the model of information to be extracted. [21]
Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms from natural language text. As building ontologies manually is extremely labor-intensive and time consuming, there is great motivation to automate the process.
During semantic annotation, [22] natural language text is augmented with metadata (often represented in RDFa), which should make the semantics of contained terms machine-understandable. At this process, which is generally semi-automatic, knowledge is extracted in the sense, that a link between lexical terms and for example concepts from ontologies is established. Thus, knowledge is gained, which meaning of a term in the processed context was intended and therefore the meaning of the text is grounded in machine-readable data with the ability to draw inferences. Semantic annotation is typically split into the following two subtasks.
At the terminology extraction level, lexical terms from the text are extracted. For this purpose a tokenizer determines at first the word boundaries and solves abbreviations. Afterwards terms from the text, which correspond to a concept, are extracted with the help of a domain-specific lexicon to link these at entity linking.
In entity linking [23] a link between the extracted lexical terms from the source text and the concepts from an ontology or knowledge base such as DBpedia is established. For this, candidate-concepts are detected appropriately to the several meanings of a term with the help of a lexicon. Finally, the context of the terms is analyzed to determine the most appropriate disambiguation and to assign the term to the correct concept.
Note that "semantic annotation" in the context of knowledge extraction is not to be confused with semantic parsing as understood in natural language processing (also referred to as "semantic annotation"): Semantic parsing aims a complete, machine-readable representation of natural language, whereas semantic annotation in the sense of knowledge extraction tackles only a very elementary aspect of that.
The following criteria can be used to categorize tools, which extract knowledge from natural language text.
Source | Which input formats can be processed by the tool (e.g. plain text, HTML or PDF)? |
Access Paradigm | Can the tool query the data source or requires a whole dump for the extraction process? |
Data Synchronization | Is the result of the extraction process synchronized with the source? |
Uses Output Ontology | Does the tool link the result with an ontology? |
Mapping Automation | How automated is the extraction process (manual, semi-automatic or automatic)? |
Requires Ontology | Does the tool need an ontology for the extraction? |
Uses GUI | Does the tool offer a graphical user interface? |
Approach | Which approach (IE, OBIE, OL or SA) is used by the tool? |
Extracted Entities | Which types of entities (e.g. named entities, concepts or relationships) can be extracted by the tool? |
Applied Techniques | Which techniques are applied (e.g. NLP, statistical methods, clustering or machine learning)? |
Output Model | Which model is used to represent the result of the tool (e. g. RDF or OWL)? |
Supported Domains | Which domains are supported (e.g. economy or biology)? |
Supported Languages | Which languages can be processed (e.g. English or German)? |
The following table characterizes some tools for Knowledge Extraction from natural language sources.
Name | Source | Access Paradigm | Data Synchronization | Uses Output Ontology | Mapping Automation | Requires Ontology | Uses GUI | Approach | Extracted Entities | Applied Techniques | Output Model | Supported Domains | Supported Languages |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[24] | plain text, HTML, XML, SGML | dump | no | yes | automatic | yes | yes | IE | named entities, relationships, events | linguistic rules | proprietary | domain-independent | English, Spanish, Arabic, Chinese, indonesian |
AlchemyAPI [25] | plain text, HTML | automatic | yes | SA | multilingual | ||||||||
ANNIE [26] | plain text | dump | yes | yes | IE | finite state algorithms | multilingual | ||||||
ASIUM [27] | plain text | dump | semi-automatic | yes | OL | concepts, concept hierarchy | NLP, clustering | ||||||
Attensity Exhaustive Extraction [28] | automatic | IE | named entities, relationships, events | NLP | |||||||||
Dandelion API | plain text, HTML, URL | REST | no | no | automatic | no | yes | SA | named entities, concepts | statistical methods | JSON | domain-independent | multilingual |
DBpedia Spotlight [29] | plain text, HTML | dump, SPARQL | yes | yes | automatic | no | yes | SA | annotation to each word, annotation to non-stopwords | NLP, statistical methods, machine learning | RDFa | domain-independent | English |
EntityClassifier.eu | plain text, HTML | dump | yes | yes | automatic | no | yes | IE, OL, SA | annotation to each word, annotation to non-stopwords | rule-based grammar | XML | domain-independent | English, German, Dutch |
FRED [30] | plain text | dump, REST API | yes | yes | automatic | no | yes | IE, OL, SA, ontology design patterns, frame semantics | (multi-)word NIF or EarMark annotation, predicates, instances, compositional semantics, concept taxonomies, frames, semantic roles, periphrastic relations, events, modality, tense, entity linking, event linking, sentiment | NLP, machine learning, heuristic rules | RDF/OWL | domain-independent | English, other languages via translation |
iDocument [31] | HTML, PDF, DOC | SPARQL | yes | yes | OBIE | instances, property values | NLP | personal, business | |||||
NetOwl Extractor [32] | plain text, HTML, XML, SGML, PDF, MS Office | dump | No | Yes | Automatic | yes | Yes | IE | named entities, relationships, events | NLP | XML, JSON, RDF-OWL, others | multiple domains | English, Arabic Chinese (Simplified and Traditional), French, Korean, Persian (Farsi and Dari), Russian, Spanish |
OntoGen [33] | semi-automatic | yes | OL | concepts, concept hierarchy, non-taxonomic relations, instances | NLP, machine learning, clustering | ||||||||
OntoLearn [34] | plain text, HTML | dump | no | yes | automatic | yes | no | OL | concepts, concept hierarchy, instances | NLP, statistical methods | proprietary | domain-independent | English |
OntoLearn Reloaded | plain text, HTML | dump | no | yes | automatic | yes | no | OL | concepts, concept hierarchy, instances | NLP, statistical methods | proprietary | domain-independent | English |
OntoSyphon [35] | HTML, PDF, DOC | dump, search engine queries | no | yes | automatic | yes | no | OBIE | concepts, relations, instances | NLP, statistical methods | RDF | domain-independent | English |
ontoX [36] | plain text | dump | no | yes | semi-automatic | yes | no | OBIE | instances, datatype property values | heuristic-based methods | proprietary | domain-independent | language-independent |
OpenCalais | plain text, HTML, XML | dump | no | yes | automatic | yes | no | SA | annotation to entities, annotation to events, annotation to facts | NLP, machine learning | RDF | domain-independent | English, French, Spanish |
PoolParty Extractor [37] | plain text, HTML, DOC, ODT | dump | no | yes | automatic | yes | yes | OBIE | named entities, concepts, relations, concepts that categorize the text, enrichments | NLP, machine learning, statistical methods | RDF, OWL | domain-independent | English, German, Spanish, French |
Rosoka | plain text, HTML, XML, SGML, PDF, MS Office | dump | Yes | Yes | Automatic | no | Yes | IE | named entity extraction, entity resolution, relationship extraction, attributes, concepts, multi-vector sentiment analysis, geotagging, language identification | NLP, machine learning | XML, JSON, POJO, RDF | multiple domains | Multilingual 200+ Languages |
SCOOBIE | plain text, HTML | dump | no | yes | automatic | no | no | OBIE | instances, property values, RDFS types | NLP, machine learning | RDF, RDFa | domain-independent | English, German |
SemTag [38] [39] | HTML | dump | no | yes | automatic | yes | no | SA | machine learning | database record | domain-independent | language-independent | |
smart FIX | plain text, HTML, PDF, DOC, e-Mail | dump | yes | no | automatic | no | yes | OBIE | named entities | NLP, machine learning | proprietary | domain-independent | English, German, French, Dutch, polish |
Text2Onto [40] | plain text, HTML, PDF | dump | yes | no | semi-automatic | yes | yes | OL | concepts, concept hierarchy, non-taxonomic relations, instances, axioms | NLP, statistical methods, machine learning, rule-based methods | OWL | deomain-independent | English, German, Spanish |
Text-To-Onto [41] | plain text, HTML, PDF, PostScript | dump | semi-automatic | yes | yes | OL | concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations | NLP, machine learning, clustering, statistical methods | German | ||||
ThatNeedle | Plain Text | dump | automatic | no | concepts, relations, hierarchy | NLP, proprietary | JSON | multiple domains | English | ||||
The Wiki Machine [42] | plain text, HTML, PDF, DOC | dump | no | yes | automatic | yes | yes | SA | annotation to proper nouns, annotation to common nouns | machine learning | RDFa | domain-independent | English, German, Spanish, French, Portuguese, Italian, Russian |
ThingFinder [43] | IE | named entities, relationships, events | multilingual |
Knowledge discovery describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data. [44] It is often described as deriving knowledge from the input data. Knowledge discovery developed out of the data mining domain, and is closely related to it both in terms of methodology and terminology. [45]
The most well-known branch of data mining is knowledge discovery, also known as knowledge discovery in databases (KDD). Just as many other forms of knowledge discovery it creates abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery. Often the outcomes from knowledge discovery are not actionable, actionable knowledge discovery, also known as domain driven data mining, [46] aims to discover and deliver actionable knowledge and insights.
Another promising application of knowledge discovery is in the area of software modernization, weakness discovery and compliance which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group (OMG) developed the specification Knowledge Discovery Metamodel (KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery in existing code. Knowledge discovery from existing software systems, also known as software mining is closely related to data mining, since existing software artifacts contain enormous value for risk management and business value, key for the evaluation and evolution of software systems. Instead of mining individual data sets, software mining focuses on metadata, such as process flows (e.g. data flows, control flows, & call maps), architecture, database schemas, and business rules/terms/process.
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.
SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.
FOAF is a machine-readable ontology describing persons, their activities and their relations to other people and objects. Anyone can use FOAF to describe themselves. FOAF allows groups of people to describe social networks without the need for a centralised database.
A semantic wiki is a wiki that has an underlying model of the knowledge described in its pages. Regular, or syntactic, wikis have structured text and untyped hyperlinks. Semantic wikis, on the other hand, provide the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like a database through semantic queries.
RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.
The ultimate goal of semantic technology is to help machines understand data. To enable the encoding of semantics with the data, well-known technologies are RDF and OWL. These technologies formally represent the meaning involved in information. For example, ontology can describe concepts, relationships between things, and categories of things. These embedded semantics with the data offer significant advantages such as reasoning over data and dealing with heterogeneous data sources.
Ontotext is a software company with offices in Europe and USA. It is the semantic technology branch of Sirma Group. Its main domain of activity is the development of software based on the Semantic Web languages and standards, in particular RDF, OWL and SPARQL. Ontotext is best known for the Ontotext GraphDB semantic graph database engine. Another major business line is the development of enterprise knowledge management and analytics systems that involve big knowledge graphs. Those systems are developed on top of the Ontotext Platform that builds on top of GraphDB capabilities for text mining using big knowledge graphs.
The concept of the Social Semantic Web subsumes developments in which social interactions on the Web lead to the creation of explicit and semantically rich knowledge representations. The Social Semantic Web can be seen as a Web of collective knowledge systems, which are able to provide useful information based on human contributions and which get better as more people participate. The Social Semantic Web combines technologies, strategies and methodologies from the Semantic Web, social software and the Web 2.0.
In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages only for human readers, it extends them to share information in a way that can be read automatically by computers. Part of the vision of linked data is for the Internet to become a global database.
General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.
DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.
LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and relationship extraction, and semantic analysis and disambiguation. The analysis engine uses Finite State Machine approach at multiple levels, which aids its performance characteristics, while maintaining a reasonably small footprint.
Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.
JSON-LD is a method of encoding linked data using JSON. One goal for JSON-LD was to require as little effort as possible from developers to transform their existing JSON to JSON-LD. JSON-LD allows data to be serialized in a way that is similar to traditional JSON. It was initially developed by the JSON for Linking Data Community Group before being transferred to the RDF Working Group for review, improvement, and standardization, and is currently maintained by the JSON-LD Working Group. JSON-LD is a World Wide Web Consortium Recommendation.
NetOwl is a suite of multilingual text and identity analytics products that analyze big data in the form of text data – reports, web, social media, etc. – as well as structured entity data about people, organizations, places, and things.
An infobox is a digital or physical table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia represents a summary of information about the subject of an article. In this way, they are comparable to data tables in some aspects. When presented within the larger document it summarizes, an infobox is often presented in a sidebar format.
In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.
A semantic triple, or RDF triple or simply triple, is the atomic data entity in the Resource Description Framework (RDF) data model. As its name indicates, a triple is a set of three entities that codifies a statement about semantic data in the form of subject–predicate–object expressions.
Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format.