Data Commons

Last updated
Data Commons
Data Commons logo.png
Data Commons screenshot.png
Results for a query in Data Commons
Founder(s) Ramanathan V. Guha
Parent Google
URL datacommons.org
LaunchedMay 2018;6 years ago (2018-05)

Data Commons is an open-source platform [1] created by Google [2] that provides an open knowledge graph, combining economic, scientific and other public datasets into a unified view. [3] Ramanathan V. Guha, a creator of web standards including RDF, [4] RSS, and Schema.org, [5] founded the project. [6]

Contents

The Data Commons website was launched in May 2018 with an initial dataset consisting of fact-checking data published in Schema.org "ClaimReview" format by several fact checkers from the International Fact-Checking Network. [7] [8] Google has worked with partners such as the United Nations (UN) to populate the repository, [2] which also includes data from the United States Census, the World Bank, the US Bureau of Labor Statistics, [9] Wikipedia, the National Oceanic and Atmospheric Administration and the Federal Bureau of Investigation. [10]

The service expanded during 2019 to include an RDF-style knowledge graph populated from a number of largely statistical open datasets. The service was announced to a wider audience in 2019. [11] In 2020 the service improved its coverage of non-US datasets, while also increasing its coverage of bioinformatics and coronavirus. [12] In 2023, the service relaunched with a natural-language front end powered by a large language model. [2] It also launched as the back end to the UN data portal with Sustainable Development Goals data. [13]

Features

Data Commons places more emphasis on statistical data than is common for linked data and knowledge graph initiatives. It includes geographical, demographic, weather and real estate data alongside other categories, [3] describing states, Congressional districts, and cities in the United States as well as biological specimens, power plants, and elements of the human genome via the Encyclopedia of DNA Elements (ENCODE) project. [10] It represents data as semantic triples each of which can have its own provenance. [3] It centers on the entity-oriented integration of statistical observations from a variety of public datasets. Although it supports a subset of the W3C SPARQL query language, [14] its APIs [15] also include tools — such as a Pandas dataframe interface — oriented towards data science, statistics and data visualization.

Data Commons is integrative, meaning that, rather than providing a hosting platform for diverse datasets, it attempts to consolidate much of the information the datasets provide into a single data graph.

Technology

Data Commons is built on a graph data-model. The graph can be accessed through a browser interface and several APIs, [3] [10] and is expanded through loading data (typically CSV and MCF-based templates). [16] The graph can be accessed by natural language queries in Google Search. [17] The data vocabulary used to define the datacommons.org graph is based upon Schema.org. [3] In particular the Schema.org terms StatisticalPopulation [18] and Observation [19] were proposed to Schema.org to support datacommons-like usecases. [20]

Software from the project is available on GitHub under Apache 2 license. [21]

Related Research Articles

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata. It has come to be used as a general method for description and exchange of graph data. RDF provides a variety of syntax notations and data serialization formats, with Turtle currently being the most widely used notation.

<span class="mw-page-title-main">Ramanathan V. Guha</span>

Ramanathan V. Guha is the creator of widely used web standards such as RSS, RDF and Schema.org. He is also responsible for products such as Google Custom Search. He was a co-founder of Epinions and Alpiri. He worked at Google for nearly two decades, most recently as a Google Fellow, before announcing his departure from the company in August 2024.

A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve information. A well known example is the Structured Query Language (SQL).

RDF Schema (Resource Description Framework Schema, variously abbreviated as RDFS, RDF(S), RDF-S, or RDF/S) is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies. It uses various forms of RDF vocabularies, intended to structure RDF resources. RDF and RDFS can be saved in a triplestore, then one can extract some knowledge from them using a query language, like SPARQL.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

<span class="mw-page-title-main">RDFLib</span> Python library to serialize, parse and process RDF data

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information. This library contains parsers/serializers for almost all of the known RDF serializations, such as RDF/XML, Turtle, N-Triples, & JSON-LD, many of which are now supported in their updated form. The library also contains both in-memory and persistent Graph back-ends for storing RDF information and numerous convenience functions for declaring graph namespaces, lodging SPARQL queries and so on. It is in continuous development with the most recent stable release, rdflib 6.1.1 having been released on 20 December 2021. It was originally created by Daniel Krech with the first release in November, 2002.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

An RDF query language is a computer language, specifically a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.

Ontotext is a software company that produces software relating to data management. Its main products are GraphDB, an RDF database; and Ontotext Platform, a general data management platform based on knowledge graphs. It was founded in 2000 in Bulgaria, and now has offices internationally. Together with the BBC, Ontotext developed one of the early large-scale industrial semantic applications, Dynamic Semantic Publishing, starting in 2010.

<span class="mw-page-title-main">DBpedia</span> Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions. Freebase aimed to create a global resource that allowed people to access common information more effectively. It was developed by the American software company Metaweb and run publicly beginning in March 2007. Metaweb was acquired by Google in a private sale announced on 16 July 2010. Google's Knowledge Graph is powered in part by Freebase.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

<span class="mw-page-title-main">Wikidata</span> Free knowledge database project

Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, is able to use under the CC0 public domain license. Wikidata is a wiki powered by the software MediaWiki, including its extension for semi-structured data, the Wikibase. As of mid-2024, Wikidata had 1.57 billion item statements.

GeoSPARQL is a model for representing and querying geospatial linked data for the Semantic Web. It is standardized by the Open Geospatial Consortium as OGC GeoSPARQL. The definition of a small ontology based on well-understood OGC standards is intended to provide a standardized exchange basis for geospatial RDF data which can support both qualitative and quantitative spatial reasoning and querying with the SPARQL database query language.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

<span class="mw-page-title-main">GraphQL</span> Data query language developed by Facebook

GraphQL is a data query and manipulation language for APIs, that allows a client to specify what data it needs. A GraphQL server can fetch data from separate sources for a single client query and present the results in a unified graph, so it is not tied to any specific database or storage engine.

<span class="mw-page-title-main">Blazegraph</span> Open source triplestore and graph database

Blazegraph is an open source triplestore and graph database, developed by Systap, which is used in the Wikidata SPARQL endpoint and by other large customers. It is licensed under the GNU GPL.

TerminusDB is an open source knowledge graph and document store. It is used to build versioned data products. It is a native revision control database that is architecturally similar to Git. It is listed on DB-Engines.

QLever is an open-source triplestore and graph database developed by a team at the University of Freiburg led by Hannah Bast. QLever performs high-performance queries of semantic Web knowledge bases, including full-text search within text corpuses. A specialized user interface for QLever predictively autocompletes SPARQL queries.

References

  1. "Custom Data Commons". Docs - Data Commons. Retrieved 16 July 2024.
  2. 1 2 3 "Data Commons is using AI to make the world's public data more accessible and helpful". Google. 13 September 2023. Retrieved 16 July 2024.
  3. 1 2 3 4 5 Fensel, Dieter; Şimşek, Umutcan; Angele, Kevin; Huaman, Elwin; Kärle, Elias; Panasiuk, Oleksandra; Toma, Ioan; Umbrich, Jürgen; Wahler, Alexander (2020), "Introduction: What Is a Knowledge Graph?", Knowledge Graphs, Cham: Springer International Publishing, pp. 1–10, doi:10.1007/978-3-030-37439-6_1, ISBN   978-3-030-37438-9, S2CID   213620389 , retrieved 2020-10-16
  4. Guns, Raf (2013). "Tracing the origins of the semantic web". Journal of the American Society for Information Science and Technology. 64 (10): 2173–2181. doi:10.1002/asi.22907. hdl: 10067/1111170151162165141 .
  5. Funke, Daniel (7 December 2017). "This website helps you find related fact checks - and it was built by a 17-year-old". Poynter. Retrieved 16 July 2024.
  6. Guha, Ramanathan V. (15 October 2020). "Data Commons, now accessible on Google Search". docs.datacommons.org. Retrieved 2020-10-16.
  7. "Fact Checks". datacommons.org. 29 March 2019. Retrieved 14 October 2020.
  8. Jiang, Shan; Baumgartner, Simon; Ittycheriah, Abe; Yu, Cong (2020-04-20). "Factoring Fact-Checks: Structured Information Extraction from Fact-Checking Articles". Proceedings of the Web Conference 2020. WWW '20. Taipei Taiwan: ACM. pp. 1592–1603. doi:10.1145/3366423.3380231. ISBN   978-1-4503-7023-3. S2CID   215882520.
  9. Raghavan, Prabhakar (2020-10-15). "How AI is powering a more helpful Google". Google. Retrieved 2020-10-16.
  10. 1 2 3 Sheth, Amit; Padhee, Swati; Gyrard, Amelie; Sheth, Amit (2019-07-01). "Knowledge Graphs and Knowledge Networks: The Story in Brief". IEEE Internet Computing. 23 (4): 67–75. arXiv: 2003.03623 . doi:10.1109/MIC.2019.2928449. ISSN   1089-7801. S2CID   204820800.
  11. Luong, Daphne; Chou, Charina (5 March 2019). "Doing our part to share open data responsibly". The Keyword. Retrieved 14 October 2020.
  12. Ramasubramanian, Sowmya (21 September 2020). "Google's open source data to study impact of COVID-19". The Hindu . Retrieved 14 October 2020.
  13. Manyika, James (19 September 2023). "Using data and AI to track progress toward the UN Global Goals". Google. Retrieved 22 July 2024.
  14. "Query the Data Commons Knowledge Graph using SPARQL". datacommons.org. Retrieved 14 October 2020.
  15. "Overview". datacommons.org. Retrieved 14 October 2020.
  16. "Contributing to Data Commons – Adding datasets". datacommons.org. Data Commons.
  17. Guha, Ramanathan V. (15 October 2020). "Data Commons, now accessible on Google Search". docs.datacommons.org. Retrieved 2020-10-16.
  18. "StatisticalPopulation type at Schema.org". schema.org. Retrieved 14 October 2020.
  19. "Observation type at Schema.org". schema.org. Retrieved 14 October 2020.
  20. "Proposal for representing Aggregate Statistical Data". GitHub – Schema.org repository. 25 June 2019. Retrieved 14 October 2020.
  21. "datacommons.org GitHub". GitHub .