Gremlin (query language)

Last updated
Gremlin
Gremlin (programming language).png
Designed by Marko A. Rodriguez
Developer Apache TinkerPop of the Apache Software Foundation
First appeared2009;15 years ago (2009)
Stable release
3.7.0 / 31 July 2023;5 months ago (2023-07-31) [1]
OS Cross-platform (multi-platform)
License Apache License 2.0
Website tinkerpop.apache.org
Dialects
GremlinJava8, GremlinGroovy, GremlinPython, GremlinScala, GremlinClojure, GremlinPHP, GremlinJavaScript, GremlinTypeset
Influenced by
Regular expression, XPath, Ripple, SPARQL, SQL, Java/JVM

Gremlin is a graph traversal language and virtual machine developed by Apache TinkerPop of the Apache Software Foundation. Gremlin works for both OLTP-based graph databases as well as OLAP-based graph processors. Gremlin's automata and functional language foundation enable Gremlin to naturally support: imperative and declarative querying; host language agnosticism; user-defined domain specific languages; an extensible compiler/optimizer, single- and multi-machine execution models; hybrid depth- and breadth-first evaluation with Turing completeness. [2]

Contents

As an explanatory analogy, Apache TinkerPop and Gremlin are to graph databases what the JDBC and SQL are to relational databases. Likewise, the Gremlin traversal machine is to graph computing as what the Java virtual machine is to general purpose computing. [3]

History

Vendor integration

Gremlin is an Apache2-licensed graph traversal language that can be used by graph system vendors. There are typically two types of graph system vendors: OLTP graph databases and OLAP graph processors. The table below outlines those graph vendors that support Gremlin.

VendorGraph System
Neo4j graph database
OrientDB graph database
DataStax Enterprise (5.0+)graph database
Hadoop (Giraph)graph processor
Hadoop (Spark)graph processor
InfiniteGraph graph database
JanusGraph graph database
Cosmos DB graph database
Amazon Neptune graph database
ArcadeDB graph database

Traversal examples

The following examples of Gremlin queries and responses in a Gremlin-Groovy environment are relative to a graph representation of the MovieLens dataset. [4] The dataset includes users who rate movies. Users each have one occupation, and each movie has one or more categories associated with it. The MovieLens graph schema is detailed below.

user--rated[stars:0-5]-->movieuser--occupation-->occupationmovie--category-->category

Simple traversals

For each vertex in the graph, emit its label, then group and count each distinct label.

gremlin>g.V().label().groupCount()==>[occupation:21,movie:3883,category:18,user:6040]

What year was the oldest movie made?

gremlin>g.V().hasLabel('movie').values('year').min()==>1919

What is Die Hard's average rating?

gremlin>g.V().has('movie','name','Die Hard').inE('rated').values('stars').mean()==>4.121848739495798

Projection traversals

For each category, emit a map of its name and the number of movies it represents.

gremlin>g.V().hasLabel('category').as('a','b').select('a','b').by('name').by(inE('category').count())==>[a:Animation,b:105]==>[a:Children's,b:251]==>[a:Comedy,b:1200]==>[a:Adventure,b:283]==>[a:Fantasy,b:68]==>[a:Romance,b:471]==>[a:Drama,b:1603]==>[a:Action,b:503]==>[a:Crime,b:211]==>[a:Thriller,b:492]==>[a:Horror,b:343]==>[a:Sci-Fi,b:276]==>[a:Documentary,b:127]==>[a:War,b:143]==>[a:Musical,b:114]==>[a:Mystery,b:106]==>[a:Film-Noir,b:44]==>[a:Western,b:68]

For each movie with at least 11 ratings, emit a map of its name and average rating. Sort the maps in decreasing order by their average rating. Emit the first 10 maps (i.e. top 10).

gremlin>g.V().hasLabel('movie').as('a','b').where(inE('rated').count().is(gt(10))).select('a','b').by('name').by(inE('rated').values('stars').mean()).order().by(select('b'),decr).limit(10)==>[a:Sanjuro,b:4.608695652173913]==>[a:SevenSamurai(TheMagnificentSeven),b:4.560509554140127]==>[a:ShawshankRedemption,The,b:4.554557700942973]==>[a:Godfather,The,b:4.524966261808367]==>[a:CloseShave,A,b:4.52054794520548]==>[a:UsualSuspects,The,b:4.517106001121705]==>[a:Schindler'sList,b:4.510416666666667]==>[a:WrongTrousers,The,b:4.507936507936508]==>[a:SunsetBlvd.(a.k.a.SunsetBoulevard),b:4.491489361702127]==>[a:RaidersoftheLostArk,b:4.47772]

Declarative pattern matching traversals

Gremlin supports declarative graph pattern matching similar to SPARQL. For instance, the following query below uses Gremlin's match()-step.

What 80's action movies do 30-something programmers like? Group count the movies by their name and sort the group count map in decreasing order by value. Clip the map to the top 10 and emit the map entries.

gremlin>g.V().match(__.as('a').hasLabel('movie'),__.as('a').out('category').has('name','Action'),__.as('a').has('year',between(1980,1990)),__.as('a').inE('rated').as('b'),__.as('b').has('stars',5),__.as('b').outV().as('c'),__.as('c').out('occupation').has('name','programmer'),__.as('c').has('age',between(30,40))).select('a').groupCount().by('name').order(local).by(valueDecr).limit(local,10)==>RaidersoftheLostArk=26==>StarWarsEpisodeV-TheEmpireStrikesBack=26==>Terminator,The=23==>StarWarsEpisodeVI-ReturnoftheJedi=22==>PrincessBride,The=19==>Aliens=18==>Boat,The(DasBoot)=11==>IndianaJonesandtheLastCrusade=11==>StarTrekTheWrathofKhan=10==>Abyss,The=9

OLAP traversal

Which movies are most central in the implicit 5-stars graph?

gremlin>g=graph.traversal(computer(SparkGraphComputer))==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat],sparkgraphcomputer]gremlin>g.V().repeat(outE('rated').has('stars',5).inV().groupCount('m').by('name').inE('rated').has('stars',5).outV()).times(4).cap('m')==>StarWarsEpisodeIV-ANewHope35405394353105332==>AmericanBeauty31943228282020585==>RaidersoftheLostArk31224779793238499==>StarWarsEpisodeV-TheEmpireStrikesBack30434677119726223==>Godfather,The30258518523013057==>ShawshankRedemption,The28297717387901031==>Schindler'sList27539336654199309==>SilenceoftheLambs,The26736276376806173==>Fargo26531050311325270==>Matrix,The26395118239203191

Gremlin graph traversal machine

Gremlin is a virtual machine composed of an instruction set as well as an execution engine. An analogy is drawn between Gremlin and Java.

Java EcosystemGremlin Ecosystem
Apache Groovy programming language Gremlin-Groovy
Scala programming language Gremlin-Scala
Clojure programming language Gremlin-Clojure
......
Java programming language Gremlin-Java8
Java instruction setGremlin step library
Java virtual machine Gremlin traversal machine

Gremlin steps (instruction set)

The following traversal is a Gremlin traversal in the Gremlin-Java8 dialect.

g.V().as("a").out("knows").as("b").select("a","b").by("name").by("age")

The Gremlin language (i.e. the fluent-style of expressing a graph traversal) can be represented in any host language that supports function composition and function nesting. Due to this simple requirement, there exists various Gremlin dialects including Gremlin-Groovy, Gremlin-Scala, Gremlin-Clojure, etc. The above Gremlin-Java8 traversal is ultimately compiled down to a step sequence called a traversal. A string representation of the traversal above provided below.

[GraphStep([],vertex)@[a],VertexStep(OUT,[knows],vertex)@[b],SelectStep([a,b],[value(name),value(age)])]

The steps are the primitives of the Gremlin graph traversal machine. They are the parameterized instructions that the machine ultimately executes. The Gremlin instruction set is approximately 30 steps. These steps are sufficient to provide general purpose computing and what is typically required to express the common motifs of any graph traversal query.

Given that Gremlin is a language, an instruction set, and a virtual machine, it is possible to design another traversal language that compiles to the Gremlin traversal machine (analogous to how Scala compiles to the JVM). For instance, the popular SPARQL graph pattern match language can be compiled to execute on the Gremlin machine. The following SPARQL query

SELECT?a?b?cWHERE{?aaPerson.?aex:knows?b.?aex:created?c.?bex:created?c.?bex:age?d.FILTER(?d<30)}

would compile to

[GraphStep([],vertex),MatchStep(AND,[[MatchStartStep(a),LabelStep,IsStep(eq(Person)),MatchEndStep],[MatchStartStep(a),VertexStep(OUT,[knows],vertex),MatchEndStep(b)],[MatchStartStep(a),VertexStep(OUT,[created],vertex),MatchEndStep(c)],[MatchStartStep(b),VertexStep(OUT,[created],vertex),MatchEndStep(c)],[MatchStartStep(b),PropertiesStep([age],value),MatchEndStep(d)],[MatchStartStep(d),IsStep(gt(30)),MatchEndStep]]),SelectStep([a,b,c])].

In Gremlin-Java8, the SPARQL query above would be represented as below and compile to the identical Gremlin step sequence (i.e. traversal).

g.V().match(as("a").label().is("person"),as("a").out("knows").as("b"),as("a").out("created").as("c"),as("b").out("created").as("c"),as("b").values("age").as("d"),as("d").is(gt(30))).select("a","b","c")

Gremlin Machine (virtual machine)

The Gremlin graph traversal machine can execute on a single machine or across a multi-machine compute cluster. Execution agnosticism allows Gremlin to run over both graph databases (OLTP) and graph processors (OLAP).

See also

Related Research Articles

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">Directed acyclic graph</span> Directed graph with no directed cycles

In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges, with each edge directed from one vertex to another, such that following those directions will never form a closed loop. A directed graph is a DAG if and only if it can be topologically ordered, by arranging the vertices as a linear ordering that is consistent with all edge directions. DAGs have numerous scientific and computational applications, ranging from biology to information science to computation (scheduling).

A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve information. A well known example is the Structured Query Language (SQL).

In computer science, graph transformation, or graph rewriting, concerns the technique of creating a new graph out of an original graph algorithmically. It has numerous applications, ranging from software engineering to layout algorithms and picture generation.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

In graph theory, reachability refers to the ability to get from one vertex to another within a graph. A vertex can reach a vertex if there exists a sequence of adjacent vertices which starts with and ends with .

Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information, documents of all sorts, contacts, search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value.

<span class="mw-page-title-main">Deterministic acyclic finite state automaton</span>

In computer science, a deterministic acyclic finite state automaton (DAFSA), also called a directed acyclic word graph is a data structure that represents a set of strings, and allows for a query operation that tests whether a given string belongs to the set in time proportional to its length. Algorithms exist to construct and maintain such automata, while keeping them minimal.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

The Perl virtual machine is a stack-based process virtual machine implemented as an opcodes interpreter which runs previously compiled programs written in the Perl language. The opcodes interpreter is a part of the Perl interpreter, which also contains a compiler in one executable file, commonly /usr/bin/perl on various Unix-like systems or perl.exe on Microsoft Windows systems.

GeoSPARQL is a standard for representation and querying of geospatial linked data for the Semantic Web from the Open Geospatial Consortium (OGC). The definition of a small ontology based on well-understood OGC standards is intended to provide a standardized exchange basis for geospatial RDF data which can support both qualitative and quantitative spatial reasoning and querying with the SPARQL database query language.

Cypher is a declarative graph query language that allows for expressive and efficient data querying in a property graph.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

Shapes Constraint Language (SHACL) is a World Wide Web Consortium (W3C) standard language for describing Resource Description Framework (RDF) graphs. SHACL has been designed to enhance the semantic and technical interoperability layers of ontologies expressed as RDF graphs.

Amazon Neptune is a managed graph database product published by Amazon.com. It is used as a web service and is part of Amazon Web Services (AWS). It was announced on November 29, 2017. Amazon Neptune supports popular graph models property graph and W3C's RDF, and their respective query languages Apache TinkerPop's Gremlin, openCypher, and SPARQL, including other Amazon Web Services products.

NitrosBase is a Russian high-performance multi-model database system. The database system supports relational, graph and document database models.

<span class="mw-page-title-main">Blazegraph</span> Open source triplestore and graph database

Blazegraph is an open source triplestore and graph database, developed by Systap, which is used in the Wikidata SPARQL endpoint and by other large customers. It is licensed under the GNU GPL.

<span class="mw-page-title-main">JanusGraph</span> Graph database

JanusGraph is an open source, distributed graph database under The Linux Foundation. JanusGraph is available under the Apache License 2.0. The project is supported by IBM, Google, Hortonworks and Grakn Labs.

GQL is a proposed standard graph query language. In September 2019 a proposal for a project to create a new standard graph query language was approved by a vote of national standards bodies which are members of ISO/IEC Joint Technical Committee 1(ISO/IEC JTC 1). JTC 1 is responsible for international Information Technology standards. GQL is intended to be a declarative database query language, like SQL.

References

  1. "Apache TinkerPop - Downloads" . Retrieved 27 October 2023.
  2. Rodriguez, Marko A. (2015). "The Gremlin graph traversal machine and language (invited talk)". The Gremlin Graph Traversal Machine and Language. pp. 1–10. arXiv: 1508.03843 . doi:10.1145/2815072.2815073. ISBN   9781450339025. S2CID   10533031.
  3. "The Benefits of the Gremlin Graph Traversal Machine". 2015-09-14. Retrieved September 17, 2015.
  4. "The Gremlin Graph Traversal Language". 2015-08-19. Retrieved August 22, 2015.
  1. Apache TinkerPop Homepage
  2. sql2gremlin.com (TinkerPop2)
  3. Rodriguez, M.A., "The Gremlin Graph Traversal Machine and Language," Proceedings of the ACM Database Programming Languages Conference, October, 2015.