Schema-agnostic databases

Last updated

Schema-agnostic databases or vocabulary-independent databases aim at supporting users to be abstracted from the representation of the data, supporting the automatic semantic matching between queries and databases. Schema-agnosticism is the property of a database of mapping a query issued with the user terminology and structure, automatically mapping it to the dataset vocabulary.

Contents

The increase in the size and in the semantic heterogeneity of database schemas bring new requirements for users querying and searching structured data. At this scale it can become unfeasible for data consumers to be familiar with the representation of the data in order to query it. At the center of this discussion is the semantic gap between users and databases, which becomes more central as the scale and complexity of the data grows.

Description

The evolution of data environments towards the consumption of data from multiple data sources and the growth in the schema size, complexity, dynamicity and decentralisation (SCoDD) of schemas [1] [2] [3] increases the complexity of contemporary data management. The SCoDD trend emerges as a central data management concern in Big Data scenarios, where users and applications have a demand for more complete data, produced by independent data sources, under different semantic assumptions and contexts of use, which is the typical scenario for Semantic Web Data applications.

The evolution of databases in the direction of heterogeneous data environments strongly impacts the usability, semiotics and semantic assumptions behind existing data accessibility methods such as structured queries, keyword-based search and visual query systems. With schema-less databases containing potentially millions of dynamically changing attributes, it becomes unfeasible for some users to become aware of the 'schema' or vocabulary in order to query the database. At this scale, the effort in understanding the schema in order to build a structured query can become prohibitive.

Schema-agnostic queries

Schema-agnostic queries can be defined as query approaches over structured databases which allow users satisfying complex information needs without the understanding of the representation (schema) of the database. Similarly, Tran et al. [4] defines it as "search approaches, which do not require users to know the schema underlying the data". Approaches such as keyword-based search over databases allow users to query databases without employing structured queries. However, as discussed by Tran et al.: "From these points, users however have to do further navigation and exploration to address complex information needs. Unlike keyword search used on the Web, which focuses on simple needs, the keyword search elaborated here is used to obtain more complex results. Instead of a single set of resources, the goal is to compute complex sets of resources and their relations."

The development of approaches to support natural language interfaces (NLI) over databases have aimed towards the goal of schema-agnostic queries. Complementarily, some approaches based on keyword search have targeted keyword-based queries which express more complex information needs. Other approaches have explored the construction of structured queries over databases where schema constraints can be relaxed. All these approaches (natural language, keyword-based search and structured queries) have targeted different degrees of sophistication in addressing the problem of supporting a flexible semantic matching between queries and data, which vary from the completely absence of the semantic concern to more principled semantic models. While the demand for schema-agnosticism has been an implicit requirement across semantic search and natural language query systems over structured data, it is not sufficiently individuated as a concept and as a necessary requirement for contemporary database management systems. Recent works have started to define and model the semantic aspects involved on schema-agnostic queries. [1] [5] [6]

Schema-agnostic structured queries

Consist of schema-agnostic queries following the syntax of a structured standard (for example SQL, SPARQL). The syntax and semantics of operators are maintained, while different terminologies are used.

Example 1

SELECT ?y {   BillClinton hasDaughter ?x .   ?x marriedTo ?y . } 

which maps to the following SPARQL query in the dataset vocabulary:

PREFIX:<http://dbpedia.org/resource/>PREFIXdbpedia2:<http://dbpedia.org/property/>PREFIXdbpedia:<http://dbpedia.org/ontology/>PREFIXskos:<http://www.w3.org/2004/02/skos/core#>PREFIXdbo:<http://dbpedia.org/ontology/>SELECT?y{:Bill_Clintondbpedia:child?x.?xdbpedia2:spouse?y.}

Example 2

SELECT?x{?xisAbook.?xbyWilliam_Goldman.?xhas_pages?p.FILTER(?p>300)}

which maps to the following SPARQL query in the dataset vocabulary:

PREFIXrdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX:<http://dbpedia.org/resource/>PREFIXdbpedia2:<http://dbpedia.org/property/>PREFIXdbpedia:<http://dbpedia.org/ontology/>SELECT?x{?xrdf:typedbpedia:Book.?xdbpedia2:author:William_Goldman.?xdbpedia:numberOfPages?p.FILTER(?p>300)}

Schema-agnostic keyword queries

Consist of schema-agnostic queries using keyword queries. In this case the syntax and semantics of operators are different from the structured query syntax.

Example

"Bill Clinton daughter married to" 
"Books by William Goldman with more than 300 pages" 

Semantic complexity

As of 2016 the concept of schema-agnostic queries has been developed primarily in academia. Most of schema-agnostic query systems have been investigated in the context of Natural Language Interfaces over databases or over the Semantic Web. [7] These works explore the application of semantic parsing techniques over large, heterogeneous and schema-less databases. More recently, the individuation of the concept of schema-agnostic query systems and databases have appeared more explicitly within the literature. [1] [5] [6] Freitas et al. [8] provide a probabilistic model on the semantic complexity of mapping schema-agnostic queries.

Related Research Articles

The Semantic Web is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax notations and data serialization formats. It is also used in knowledge management applications.

The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies. Ontologies are a formal way to describe taxonomies and classification networks, essentially defining the structure of knowledge for various domains: the nouns representing classes of objects and the verbs representing relations between the objects.

RDF Schema is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies. It uses various forms of RDF vocabularies, intended to structure RDF resources. RDF and RDFS can be saved in a triplestore, then one can entail some knowledge from them using a query language, like SPARQL.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

A semantic wiki is a wiki that has an underlying model of the knowledge described in its pages. Regular, or syntactic, wikis have structured text and untyped hyperlinks. Semantic wikis, on the other hand, provide the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like a database through semantic queries.

RDFa is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The RDF data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

Terse RDF Triple Language (Turtle) is a syntax and file format for expressing data in the Resource Description Framework (RDF) data model. Turtle syntax is similar to that of SPARQL, an RDF query language. It is a common data format for storing RDF data, along with N-Triples, JSON-LD and RDF/XML.

An RDF query language is a computer language, specifically a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.

Linked data Structured data and method for its publication

In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages only for human readers, it extends them to share information in a way that can be read automatically by computers. Part of the vision of linked data is for the Internet to become a global database.

DBpedia Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

A triplestore or RDF store is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject-predicate-object, like "Bob is 35" or "Bob knows Fred".

In computing, a graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

Named graph

Named graphs are a key concept of Semantic Web architecture in which a set of Resource Description Framework statements are identified using a URI, allowing descriptions to be made of that set of statements such as context, provenance information or other such metadata.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

GeoSPARQL is a standard for representation and querying of geospatial linked data for the Semantic Web from the Open Geospatial Consortium (OGC). The definition of a small ontology based on well-understood OGC standards is intended to provide a standardized exchange basis for geospatial RDF data which can support both qualitative and quantitative spatial reasoning and querying with the SPARQL database query language.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

A distributional–relational database, or word-vector database, is a database management system (DBMS) that uses distributional word-vector representations to enrich the semantics of structured data.

Blazegraph

Blazegraph is a triplestore and graph database, which is used in the Wikidata SPARQL endpoint.

References

  1. 1 2 3 A. Freitas, "Schema-agnostic queries over large-schema databases: a distributional semantics approach" PhD Thesis, 2015
  2. Pat Helland, ["If you have too much data, then 'good enough' is good enough"], Commun. ACM 54(6): 40–47, 2011.
  3. M. L. Brodie and J. T. Liu, ["The power and limits of relational technology in the age of information ecosystems"], Keynote, On The Move Federated Conferences, Heraklion, Greece, October 25–29, 2010.
  4. T. Tran, T. Mathaess, P. Haase, ["Usability of Keyword-driven Schema-agnostic Search – A Comparative Study of Keyword Search, Faceted Search, Query Completion and Result Completion"], In Proceedings of 7th Extended Semantic Web Conference (ESWC'10). Heraklion, Greece, June, 2010.
  5. 1 2 A. Freitas, J. C. Pereira Da Silva, E. Curry, "On the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study", Workshop of the Natural Language Interfaces for the Web of Data (NLIWoD), 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014.
  6. 1 2 S. Bischof, M. Kroetzsch, A. Polleres, S. Rudolph, ["Schema-Agnostic Query Rewriting in SPARQL 1.1"], In Proceedings of the 13th International Semantic Web Conference. Springer 2014.
  7. Unger et al., ["Introduction to Question Answering over Linked Data"], In Proceedings of the 2014 Reasoning Web Summer School, 2014
  8. A. Freitas, J. E. Sales, S. Handschuh, E. Curry, "How hard is the Query? Measuring the Semantic Complexity of Schema-Agnostic Queries", In Proceedings of the 11th International Conference on Computational Semantics (IWCS), London, 2015.