Property graph

Last updated

A property graph, labeled property graph, or attributed graph is a data model of various graph-oriented databases, [1] where pairs of entities are associated by directed relationships, and entities and relationships can have properties.

Contents

In graph theory terms, a property graph is a directed multigraph, whose vertices represent entities and arcs represent relationships. Each arc has an identifier, a source node and a target node, and may have properties.

Properties are key-value pairs where keys are character strings and values are numbers or character strings. They are analogous to attributes in entity-attribute-value and object-oriented modeling. By contrast, in RDF graphs, "properties" is the term for the arcs. This is why a clearer name is attributed graphs, or graphs with properties.

This data model emerged in the early 2000s.

Formal definition

Building upon widely adopted definitions, [2] [3] a property graph/attributed graph can be defined by a 7-tuple (N, A, K, V, α, , π), where

A complementary construct, used in several implementations of property graphs with commercial graph databases, is that of labels , which can be associated both with nodes and arcs of the graph. Labels have a practical rather than theoretical justification, as they were originally intended for users of Entity-Relationship models and relational databases, to facilitate the import of their legacy data sets into graph databases  :. labels make it possible to associate the same identifier (that of the relational table, or of the ER entity) to all graph nodes which would correspond to the different rows of this relational table, or to instances of the same generic entity / class. With the proposed definition, these labels could in fact be viewed as attributes defined only by a key, without an associated value (this is why is defined separately as a binary relation, and π as a partial function). The basic definition thus becomes much clearer, simpler, and satisfies a principle of parsimony. Alternatively, and more consistently, labels can be defined through type graphs, as special types associated with nodes and arcs.

Relations with other models

Graph theory and classical graph algorithms

Attributed graphs are especially useful and relevant in that they are an "umbrella" hypernymic concept ( i.e. a generalization) for several key graph-theoretic models, which have long been widely used in classical graph algorithms

Knowledge graphs and RDF graphs

Knowledge graphs, usually represented in RDF, are hybrid labeled graphs, whose node labels correspond to instance identifiers (IRI)s or literals, and edge labels identify types (not instances) of predicates. They have now acquired a visibility which tends to obscure the longer-established use of graphs as direct model for systems of all kinds. [4] They are less versatile and expressive than attributed graphs.

Кnowledge graphs capture weakly structured information about a physical system. They mix structural relationships with attached properties, and category information with instances, drowning out the structure. By contrast, graphs whose connections capture the structure of a physical system can be called cyber-physical.

Also, RDF graphs can only express first order logic, while attributed graphs can express higher order logic. Represening relationship properties in, RDF requires a cumbersome reification process.

Standardization

NGSI-LD

The NGSI-LD data model specified by ETSI has been the first attempt to standardize property graphs under a de jure standards body. Compared to the basic model defined here, the NGSI-LD meta-model adds a formal definition of basic categories (entity, relation, property) on the basis of semantic webstandards (OWL, RDFS, RDF), which makes it possible to convert all data represented in NGSI-LD into RDF datasets, through JSON-LD serialization. NGSI-LD entities, relations and properties are thus defined by reference to types which can themselves be defined by reference to ontologies, thesauri, taxonomies or microdata vocabularies, for the purpose of ensuring the semantic interoperability of the corresponding information.

GQL

The ISO/IEC JTC1/SC32/WG3 group of ISO, which established the SQL standard, is in the process of specifying a new query language suitable for graph-oriented databases, called GQL (Graph Query Language). This standard will include the specification of a property graph data model, which should be along the lines of the basic model described here, possibly adding notions of labels, types, and schemas .

Type graphs and schemas

Graph-oriented databases are, compared to relational databases, touted for not requiring the prior definition of a schema to start populating the base. This is desirable and suitable for environments and applications where one operates under an open world assumption, such as the description of complex systems and systems of systems, characterized by bottom-up organization and evolution, not control of a single stakeholder. However, even in such environments, it may be needed to constrain the representation of specific subsets of the information entered into the database, in a way that may resemble a traditional database schema, while keeping the openness of the overall graph for addition of unforeseen data or configurations. For example, the description of a smart city falls under the open world assumption and will be described by the upper level of a graph database, without a schema. However, specific technical sub-systems of this city remain top-down closed-world systems managed by a single operator, who may impose a stronger structuring of information, as customarily represented by a schema.

The notions of "type graphs" and schemas [2] make it possible to meet this need, with types playing a role similar to that of labels in classical graph databases, but with the added possibility of specifying relations between these types and constraining them by keys and properties. The type graph is itself a property graph, linked by a relation of graph homomorphism with the graphs of instances that use the types it defines, playing a role similar to that of a schema in a data definition language.

The ontologies, thesauri or taxonomies used to reference NGSI-LD types are also defined by graphs, but these are RDF graphs rather than property graphs, and they typically have broader scopes than database schemas. The complementary use, possible with NGSI-LD types, of type graphs and referencing of external ontologies, makes it possible to enforce strong data structuration and consistency, while affording semantic grounding and interoperability.

Related Research Articles

A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.

The Resource Description Framework (RDF) is a method to describe and exchange graph data. It was originally designed as a data model for metadata by the World Wide Web Consortium (W3C). It provides a variety of syntax notations and formats, of which the most widely used is Turtle.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics. The theory was introduced by Edgar F. Codd.

In relational database theory, a functional dependency is the following constraint between two attribute sets in a relation: Given a relation R and attribute sets , X is said to functionally determineY if each X value is associated with precisely one Y value. R is then said to satisfy the functional dependency XY. Equivalently, the projection is a function, that is, Y is a function of X. In simple words, if the values for the X attributes are known, then the values for the Y attributes corresponding to x can be determined by looking them up in any tuple of R containing x. Customarily X is called the determinant set and Y the dependent set. A functional dependency FD: XY is called trivial if Y is a subset of X.

RDF Schema (Resource Description Framework Schema, variously abbreviated as RDFS, RDF(S), RDF-S, or RDF/S) is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies. It uses various forms of RDF vocabularies, intended to structure RDF resources. RDF and RDFS can be saved in a triplestore, then one can extract some knowledge from them using a query language, like SPARQL.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

<span class="mw-page-title-main">IDEF1X</span>

Integration DEFinition for information modeling (IDEF1X) is a data modeling language for the development of semantic data models. IDEF1X is used to produce a graphical information model which represents the structure and semantics of information within an environment or system.

<span class="mw-page-title-main">RDFLib</span> Python library to serialize, parse and process RDF data

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information. This library contains parsers/serializers for almost all of the known RDF serializations, such as RDF/XML, Turtle, N-Triples, & JSON-LD, many of which are now supported in their updated form. The library also contains both in-memory and persistent Graph back-ends for storing RDF information and numerous convenience functions for declaring graph namespaces, lodging SPARQL queries and so on. It is in continuous development with the most recent stable release, rdflib 6.1.1 having been released on 20 December 2021. It was originally created by Daniel Krech with the first release in November, 2002.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

Relational Model/Tasmania (RM/T) was published by Edgar F. Codd in 1979 and is the name given to a number of extensions to his original relational model (RM) published in 1970. The overall goal of the RM/T was to define some fundamental semantic units, at "atomic" and "molecular" levels, for data modelling. Codd writes: "the result is a model with a richer variety of objects than the original relational model, additional insert-update-delete rules and some additional operators that make the algebra more powerful."

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

A triplestore or RDF store is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject–predicate–object, like "Bob is 35" or "Bob knows Fred".

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

The following is provided as an overview of and topical guide to databases:

Cypher is a declarative graph query language that allows for expressive and efficient data querying in a property graph.

GQL is a standardized query language for property graphs first described in ISO/IEC 39075, released in April 2024 by ISO/IEC.

In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.

NGSI-LD is an information model and API for publishing, querying and subscribing to context information. It is meant to facilitate the open exchange and sharing of structured information between different stakeholders. It is used across application domains such as smart cities, smart industry, smart agriculture, and more generally for the Internet of things, cyber-physical systems, systems of systems and digital twins.

References

  1. Angles, Renzo (2012-04-01). "A comparison of current graph database models". International Conference on Data Engineering. IEEE.
  2. 1 2 Bonifati, Angela; Furniss, Peter; Green, Alastair; Harmer, Russ; Oshurko, Eugenia; Voigt, Hannes (2019), Laender, Alberto H. F.; Pernici, Barbara; Lim, Ee-Peng; de Oliveira, José Palazzo M. (eds.), "Schema Validation and Evolution for Graph Databases", Conceptual Modeling, vol. 11788, Cham: Springer International Publishing, pp. 448–456, arXiv: 1902.06427 , doi:10.1007/978-3-030-33223-5_37, ISBN   978-3-030-33222-8 , retrieved 2021-09-15
  3. Gutierrez, Claudio; Hidders, Jan; Wood, Peter T. (2018), "Graph Data Models", in Sakr, Sherif; Zomaya, Albert (eds.), Encyclopedia of Big Data Technologies, Cham: Springer International Publishing, pp. 1–6, doi:10.1007/978-3-319-63962-8_81-1, ISBN   978-3-319-63962-8 , retrieved 2021-09-15
  4. Privat, Gilles; Abbas, Abdullah “Cyber-Physical Graphs” vs. RDF graphs, W3C Workshop on Web Standardization for Graph Data, Berlin, March 2019