Founder(s) | Ramanathan V. Guha |
---|---|
Key people | Prem Ramaswami (Head of Data Commons) |
Parent | |
URL | datacommons |
Launched | May 2018 |
Data Commons is an open-source platform [1] created by Google [2] that provides an open knowledge graph, combining economic, scientific and other public datasets into a unified view. [3] Ramanathan V. Guha, a creator of web standards including RDF, [4] RSS, and Schema.org, [5] founded the project, [6] which is now led by Prem Ramaswami. [7]
The Data Commons website was launched in May 2018 with an initial dataset consisting of fact-checking data published in Schema.org "ClaimReview" format by several fact checkers from the International Fact-Checking Network. [8] [9] Google has worked with partners such as the United Nations (UN) to populate the repository, [2] which also includes data from the United States Census, the World Bank, the US Bureau of Labor Statistics, [10] Wikipedia, the National Oceanic and Atmospheric Administration and the Federal Bureau of Investigation. [11]
The service expanded during 2019 to include an RDF-style knowledge graph populated from a number of largely statistical open datasets. The service was announced to a wider audience in 2019. [12] In 2020 the service improved its coverage of non-US datasets, while also increasing its coverage of bioinformatics and coronavirus. [13] In 2023, the service relaunched with a natural-language front end powered by a large language model. [2] It also launched as the back end to the UN data portal with Sustainable Development Goals data. [14]
Data Commons places more emphasis on statistical data than is common for linked data and knowledge graph initiatives. It includes geographical, demographic, weather and real estate data alongside other categories, [3] describing states, Congressional districts, and cities in the United States as well as biological specimens, power plants, and elements of the human genome via the Encyclopedia of DNA Elements (ENCODE) project. [11] It represents data as semantic triples each of which can have its own provenance. [3] It centers on the entity-oriented integration of statistical observations from a variety of public datasets. Although it supports a subset of the W3C SPARQL query language, [15] its APIs [16] also include tools — such as a Pandas dataframe interface — oriented towards data science, statistics and data visualization.
Data Commons is integrative, meaning that, rather than providing a hosting platform for diverse datasets, it attempts to consolidate much of the information the datasets provide into a single data graph.
Data Commons is built on a graph data-model. The graph can be accessed through a browser interface and several APIs, [3] [11] and is expanded through loading data (typically CSV and MCF-based templates). [17] The graph can be accessed by natural language queries in Google Search. [18] The data vocabulary used to define the datacommons.org graph is based upon Schema.org. [3] In particular the Schema.org terms StatisticalPopulation [19] and Observation [20] were proposed to Schema.org to support datacommons-like usecases. [21]
Software from the project is available on GitHub under Apache 2 license. [22]
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
The Resource Description Framework (RDF) is a method to describe and exchange graph data. It was originally designed as a data model for metadata by the World Wide Web Consortium (W3C). It provides a variety of syntax notations and data serialization formats, of which the most widely used is Turtle.
Ramanathan V. Guha is the creator of widely used web standards such as RSS, RDF and Schema.org. He is also responsible for products such as Google Custom Search. He was a co-founder of Epinions and Alpiri. He worked at Google for nearly two decades, most recently as a Google Fellow, before announcing his departure from the company in August 2024.
A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve information. A well known example is the Structured Query Language (SQL).
RDF Schema (Resource Description Framework Schema, variously abbreviated as RDFS, RDF(S), RDF-S, or RDF/S) is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies. It uses various forms of RDF vocabularies, intended to structure RDF resources. RDF and RDFS can be saved in a triplestore, then one can extract some knowledge from them using a query language, like SPARQL.
SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.
RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information. This library contains parsers/serializers for almost all of the known RDF serializations, such as RDF/XML, Turtle, N-Triples, & JSON-LD, many of which are now supported in their updated form. The library also contains both in-memory and persistent Graph back-ends for storing RDF information and numerous convenience functions for declaring graph namespaces, lodging SPARQL queries and so on. It is in continuous development with the most recent stable release, rdflib 6.1.1 having been released on 20 December 2021. It was originally created by Daniel Krech with the first release in November, 2002.
An RDF query language is a computer language, specifically a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.
Ontotext is a software company that produces software relating to data management. Its main products are GraphDB, an RDF database; and Ontotext Platform, a general data management platform based on knowledge graphs. It was founded in 2000 in Bulgaria, and now has offices internationally. Together with the BBC, Ontotext developed one of the early large-scale industrial semantic applications, Dynamic Semantic Publishing, starting in 2010.
DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.
Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions. Freebase aimed to create a global resource that allowed people to access common information more effectively. It was developed by the American software company Metaweb and run publicly beginning in March 2007. Metaweb was acquired by Google in a private sale announced on 16 July 2010. Google's Knowledge Graph is powered in part by Freebase.
A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.
Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, is able to use under the CC0 public domain license. Wikidata is a wiki powered by the software MediaWiki, including its extension for semi-structured data, the Wikibase. As of mid-2024, Wikidata had 1.57 billion item statements.
GeoSPARQL is a model for representing and querying geospatial linked data for the Semantic Web. It is standardized by the Open Geospatial Consortium as OGC GeoSPARQL. The definition of a small ontology based on well-understood OGC standards is intended to provide a standardized exchange basis for geospatial RDF data which can support both qualitative and quantitative spatial reasoning and querying with the SPARQL database query language.
Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.
GraphQL is a data query and manipulation language for APIs, that allows a client to specify what data it needs. A GraphQL server can fetch data from separate sources for a single client query and present the results in a unified graph, so it is not tied to any specific database or storage engine.
Shapes Constraint Language (SHACL) is a World Wide Web Consortium (W3C) standard language for describing Resource Description Framework (RDF) graphs. SHACL has been designed to enhance the semantic and technical interoperability layers of ontologies expressed as RDF graphs.
Blazegraph is an open source triplestore and graph database, developed by Systap, which is used in the Wikidata SPARQL endpoint and by other large customers. It is licensed under the GNU GPL.
TerminusDB is an open source knowledge graph and document store. It is used to build versioned data products. It is a native revision control database that is architecturally similar to Git. It is listed on DB-Engines.
QLever is an open-source triplestore and graph database developed by a team at the University of Freiburg led by Hannah Bast. QLever performs high-performance queries of semantic Web knowledge bases, including full-text search within text corpuses. A specialized user interface for QLever predictively autocompletes SPARQL queries.