Property graph

Last updated

Property Graphs

The data model of "property graphs" , "labeled property graphs ", or "attributed graphs " has emerged since the early 2000s as a common denominator of various models of graph-oriented databases. [1] It can be defined informally as follows:

Contents

Properties take the form of key-value pairs, as used for example in JSON. Keys are defined by character strings. Values are either numeric or also character strings. These properties fall within the usual definition of attributes as understood in entity-attribute-value or object-oriented modeling. This is why the phrase "attributed graph" is relevant. Unlike what is the case with RDF graphs, properties are not arcs of the graph proper. This is another reason why it would be preferable to call them attributed graphs, or graphs with properties, rather than "property graphs", which is misleading.

Relationships are represented by arcs of the graph. These are often called edges, even though, strictly speaking, edges belong in undirected graphs. Arcs must have an identifier, a source node and a target node, and may have one or more attributes/properties in the previous sense

Formal definition

Building upon widely adopted definitions, [2] [3] a property graph/attributed graph can be defined by a 7-tuple (N , A, P, V, α, , π), where

A complementary construct, used in several implementations of property graphs with commercial graph databases, is that of labels , which can be associated both with nodes and arcs of the graph. Labels have a practical rather than theoretical justification, as they were originally intended for users of Entity-Relationship models and relational databases, to facilitate the import of their legacy data sets into graph databases  :. labels make it possible to associate the same identifier (that of the relational table, or of the ER entity) to all graph nodes which would correspond to the different rows of this relational table, or to instances of the same generic entity / class. With the proposed definition, these labels could in fact be viewed as attributes defined only by a key, without an associated value (this is why is defined separately as a binary relation, and π as a partial function). The basic definition thus becomes much clearer, simpler, and satisfies a principle of parsimony. Alternatively, and more consistently, labels can be defined through type graphs, as special types associated with nodes and arcs.

Relations with other models

Graph theory and classical graph algorithmics

Attributed graphs, as defined above, are especially useful and relevant in that they provide an "umbrella" hypernymic concept ( i.e. common generalization) for several key graph-theoretic models, which have long-since been widely used in classical graph algorithms

Knowledge graphs and RDF graphs

Knowledge graphs, usually represented as RDF graphs, are in fact hybrid labeled graphs, whose node labels correspond to instance identifiers ( IRI)s or literals, and edge labels identify types (not instances) of predicates. They have now acquired a visibility which tends to obscure the longer-established use of graphs as direct model for systems of all kinds. [4] Attributed graphs are, by their versatility and expressivity, the best-adapted for this type of modeling, where graphs which can rightly be called cyber-physical do not merely capture weakly structured about a physical system, as would be the case with a knowledge graph, but attempt to directly capture the structure of a physical system, as matched by the connectivity structure of the graph. In contrast, an RDF graph would mix structural relationships with attached properties, and category / class information with instance / individuals, drowning out the structure The expressivity of attributed graphs, on the level of higher order logic, is also far above that of RDF graphs, which is limited to first order logic. Properties of relationships, which are at the heart of the attributed graph model, require a very cumbersome reification process to be expressed in RDF.

Standardization

NGSI-LD

The NGSI-LD data model specified by ETSI has been the first attempt to standardize property graphs under a de jure standards body. Compared to the basic model defined here, the NGSI-LD meta-model adds a formal definition of basic categories (entity, relation, property) on the basis of semantic webstandards ( OWL, RDFS, RDF), which makes it possible to convert all data represented in NGSI-LD into RDF datasets, through JSON-LD serialization. NGSI-LD entities, relations and properties are thus defined by reference to types which can themselves be defined by reference to ontologies, thesauri, taxonomies or microdata vocabularies, for the purpose of ensuring the semantic interoperability of the corresponding information.

GQL

The ISO/IEC JTC1/SC32/WG3 group of ISO, which established the SQL standard, is in the process of specifying a new query language suitable for graph-oriented databases, called GQL (Graph Query Language). This standard will include the specification of a property graph data model, which should be along the lines of the basic model described here, possibly adding notions of labels, types, and schemas .

Type graphs and schemas

Graph-oriented databases are, compared to relational databases, touted for not requiring the prior definition of a schema to start populating the base. This is desirable and suitable for environments and applications where one operates under an open world assumption, such as the description of complex systems and systems of systems, characterized by bottom-up organization and evolution, not control of a single stakeholder. However, even in such environments, it may be needed to constrain the representation of specific subsets of the information entered into the database, in a way that may resemble a traditional database schema, while keeping the openness of the overall graph for addition of unforeseen data or configurations. For example, the description of a smart city falls under the open world assumption and will be described by the upper level of a graph database, without a schema. However, specific technical sub-systems of this city remain top-down closed-world systems managed by a single operator, who may impose a stronger structuring of information, as customarily represented by a schema.

The notions of "type graphs" and schemas [2] make it possible to meet this need, with types playing a role similar to that of labels in classical graph databases, but with the added possibility of specifying relations between these types and constraining them by keys and properties. The type graph is itself a property graph, linked by a relation of graph homomorphism with the graphs of instances that use the types it defines, playing a role similar to that of a schema in a data definition language.

The ontologies, thesauri or taxonomies used to reference NGSI-LD types are also defined by graphs, but these are RDF graphs rather than property graphs, and they typically have broader scopes than database schemas. The complementary use, possible with NGSI-LD types, of type graphs and referencing of external ontologies, makes it possible to enforce strong data structuration and consistency, while affording semantic grounding and interoperability.

Related Research Articles

A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata. It has come to be used as a general method for description and exchange of graph data. RDF provides a variety of syntax notations and data serialization formats, with Turtle currently being the most widely used notation.

RDF Schema (Resource Description Framework Schema, variously abbreviated as RDFS, RDF(S), RDF-S, or RDF/S) is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies. It uses various forms of RDF vocabularies, intended to structure RDF resources. RDF and RDFS can be saved in a triplestore, then one can extract some knowledge from them using a query language, like SPARQL.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

<span class="mw-page-title-main">IDEF1X</span>

Integration DEFinition for information modeling (IDEF1X) is a data modeling language for the development of semantic data models. IDEF1X is used to produce a graphical information model which represents the structure and semantics of information within an environment or system.

<span class="mw-page-title-main">RDFLib</span> Python library to serialize, parse and process RDF data

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information. This library contains parsers/serializers for almost all of the known RDF serializations, such as RDF/XML, Turtle, N-Triples, & JSON-LD, many of which are now supported in their updated form. The library also contains both in-memory and persistent Graph back-ends for storing RDF information and numerous convenience functions for declaring graph namespaces, lodging SPARQL queries and so on. It is in continuous development with the most recent stable release, rdflib 6.1.1 having been released on 20 December 2021. It was originally created by Daniel Krech with the first release in November, 2002.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

Relational Model/Tasmania (RM/T) was published by Edgar F. Codd in 1979 and is the name given to a number of extensions to his original relational model (RM) published in 1970. The overall goal of the RM/T was to define some fundamental semantic units, at "atomic" and "molecular" levels, for data modelling. Codd writes: "the result is a model with a richer variety of objects than the original relational model, additional insert-update-delete rules and some additional operators that make the algebra more powerful."

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

A triplestore or RDF store is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject–predicate–object, like "Bob is 35" or "Bob knows Fred".

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

The following is provided as an overview of and topical guide to databases:

Cypher is a declarative graph query language that allows for expressive and efficient data querying in a property graph.

<span class="mw-page-title-main">TypeDB</span> Open-source, strongly-typed database

TypeDB is an open-source, distributed database management system that relies on a user-defined type system to model, manage, and query data.

GQL is a standardized query language for property graphs first described in ISO/IEC 76120, released in April 2024 by ISO/IEC.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities.

In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.

NGSI-LD is an information model and API for publishing, querying and subscribing to context information. It is meant to facilitate the open exchange and sharing of structured information between different stakeholders. It is used across application domains such as smart cities, smart industry, smart agriculture, and more generally for the Internet of things, cyber-physical systems, systems of systems and digital twins.

References

  1. Angles, Renzo (2012-04-01). "A comparison of current graph database models". International Conference on Data Engineering. IEEE.
  2. 1 2 Bonifati, Angela; Furniss, Peter; Green, Alastair; Harmer, Russ; Oshurko, Eugenia; Voigt, Hannes (2019), Laender, Alberto H. F.; Pernici, Barbara; Lim, Ee-Peng; de Oliveira, José Palazzo M. (eds.), "Schema Validation and Evolution for Graph Databases", Conceptual Modeling, vol. 11788, Cham: Springer International Publishing, pp. 448–456, arXiv: 1902.06427 , doi:10.1007/978-3-030-33223-5_37, ISBN   978-3-030-33222-8 , retrieved 2021-09-15
  3. Gutierrez, Claudio; Hidders, Jan; Wood, Peter T. (2018), "Graph Data Models", in Sakr, Sherif; Zomaya, Albert (eds.), Encyclopedia of Big Data Technologies, Cham: Springer International Publishing, pp. 1–6, doi:10.1007/978-3-319-63962-8_81-1, ISBN   978-3-319-63962-8 , retrieved 2021-09-15
  4. Privat, Gilles; Abbas, Abdullah “Cyber-Physical Graphs” vs. RDF graphs, W3C Workshop on Web Standardization for Graph Data, Berlin, March 2019