Datacommons.org

Last updated

Datacommons.org is an open knowledge graph hosted by Google that provides a unified view across multiple public datasets, combining economic, scientific and other open datasets into an integrated data graph. [1] The Datacommons.org site was launched in May 2018 with an initial dataset consisting of fact-checking data published in Schema.org "ClaimReview" format by several fact checkers from the International Fact-Checking Network. [2] [3] Google has worked with partners including the United States Census, the World Bank, and US Bureau of Labor Statistics to populate the repository, [4] which also hosts data from Wikipedia, the National Oceanic and Atmospheric Administration and the Federal Bureau of Investigation. [5] The service expanded during 2019 to include an RDF-style Knowledge Graph populated from a number of largely statistical open datasets. The service was announced to a wider audience in 2019. [6] In 2020 the service improved its coverage of non-US datasets, while also increasing its coverage of bioinformatics and coronavirus. [7]

Contents

Features

Datacommons.org places more emphasis on statistical data than is common for Linked Data and knowledge graph initiatives. It includes geographical, demographic, weather and real estate data alongside other categories, [1] describing states, Congressional districts, and cities in the United States as well as biological specimens, power plants, and elements of the human genome via the Encyclopedia of DNA Elements (ENCODE) project. [5] It represents data as semantic triples each of which can have its own provenance. [1] It centers on the entity-oriented integration of statistical observations from a variety of public datasets. Although it supports a subset of the W3C SPARQL query language, [8] its APIs [9] also include tools — such as a Pandas dataframe interface — oriented towards data science, statistics and data visualization.

Datacommons.org is integrative, meaning that, rather than providing a hosting platform for diverse datasets, it attempts to consolidate much of the information the datasets provide into a single data graph.

Technology

Datacommons.org is built on a graph data-model. The graph can be accessed through a browser interface and several APIs, [1] [5] and is expanded through loading data (typically CSV and MCF-based templates). [10] The graph can be accessed by natural language queries in Google Search. [11] The data vocabulary used to define the datacommons.org graph is based upon Schema.org. [1] In particular the Schema.org terms StatisticalPopulation [12] and Observation [13] were proposed to Schema.org to support datacommons-like usecases. [14]

Software from the project is available on GitHub under Apache 2 license. [15]

Related Research Articles

A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL).

<span class="mw-page-title-main">Mantis Bug Tracker</span> Bug tracking system

Mantis Bug Tracker is a free and open source, web-based bug tracking system. The most common use of MantisBT is to track software defects. However, MantisBT is often configured by users to serve as a more generic issue tracking system and project management tool.

RDF Schema (Resource Description Framework Schema, variously abbreviated as RDFS, RDF(S), RDF-S, or RDF/S) is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies. It uses various forms of RDF vocabularies, intended to structure RDF resources. RDF and RDFS can be saved in a triplestore, then one can extract some knowledge from them using a query language, like SPARQL.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

An RDF query language is a computer language, specifically a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.

<span class="mw-page-title-main">Linked data</span> Structured data and method for its publication

In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages only for human readers, it extends them to share information in a way that can be read automatically by computers. Part of the vision of linked data is for the Internet to become a global database.

<span class="mw-page-title-main">DBpedia</span> Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 106, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions. Freebase aimed to create a global resource that allowed people to access common information more effectively. It was developed by the American software company Metaweb and run publicly beginning in March 2007. Metaweb was acquired by Google in a private sale announced on 16 July 2010. Google's Knowledge Graph is powered in part by Freebase.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

<span class="mw-page-title-main">Wikidata</span> Free knowledge database project

Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, can use under the CC0 public domain license. Wikidata is a wiki powered by the software MediaWiki, and is also powered by the set of knowledge graph MediaWiki extensions known as Wikibase.

GeoSPARQL is a standard for representation and querying of geospatial linked data for the Semantic Web from the Open Geospatial Consortium (OGC). The definition of a small ontology based on well-understood OGC standards is intended to provide a standardized exchange basis for geospatial RDF data which can support both qualitative and quantitative spatial reasoning and querying with the SPARQL database query language.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

<span class="mw-page-title-main">GraphQL</span> Data query language developed by Facebook

GraphQL is an open-source data query and manipulation language for APIs and a query runtime engine.

<span class="mw-page-title-main">Blazegraph</span> Open source triplestore and graph database

Blazegraph is an open source triplestore and graph database, developed by Systap, which is used in the Wikidata SPARQL endpoint and by other large customers. It is licensed under the GNU GPL.

TerminusDB is an open source knowledge graph and document store. It is used to build versioned data products. It is a native revision control database that is architecturally similar to Git. It is listed on DB-Engines.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, knowledge graph is a knowledge base that uses a graph-structured data model or topology to integrate data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the semantics underlying the used terminology.

<span class="mw-page-title-main">Ontotext GraphDB</span> RDF-store

Ontotext GraphDB is a graph database and knowledge discovery tool compliant with RDF and SPARQL and available as a high-availability cluster. Ontotext GraphDB is used in various European research projects.

References

  1. 1 2 3 4 5 Fensel, Dieter; Şimşek, Umutcan; Angele, Kevin; Huaman, Elwin; Kärle, Elias; Panasiuk, Oleksandra; Toma, Ioan; Umbrich, Jürgen; Wahler, Alexander (2020), "Introduction: What Is a Knowledge Graph?", Knowledge Graphs, Cham: Springer International Publishing, pp. 1–10, doi:10.1007/978-3-030-37439-6_1, ISBN   978-3-030-37438-9, S2CID   213620389 , retrieved 2020-10-16
  2. "Fact Checks". datacommons.org. 29 March 2019. Retrieved 14 October 2020.
  3. Jiang, Shan; Baumgartner, Simon; Ittycheriah, Abe; Yu, Cong (2020-04-20). "Factoring Fact-Checks: Structured Information Extraction from Fact-Checking Articles". Proceedings of the Web Conference 2020. WWW '20. Taipei Taiwan: ACM. pp. 1592–1603. doi:10.1145/3366423.3380231. ISBN   978-1-4503-7023-3. S2CID   215882520.
  4. Raghavan, Prabhakar (2020-10-15). "How AI is powering a more helpful Google". Google. Retrieved 2020-10-16.
  5. 1 2 3 Sheth, Amit; Padhee, Swati; Gyrard, Amelie; Sheth, Amit (2019-07-01). "Knowledge Graphs and Knowledge Networks: The Story in Brief". IEEE Internet Computing. 23 (4): 67–75. arXiv: 2003.03623 . doi:10.1109/MIC.2019.2928449. ISSN   1089-7801. S2CID   204820800.
  6. Luong, Daphne; Chou, Charina (5 March 2019). "Doing our part to share open data responsibly". The Keyword. Retrieved 14 October 2020.
  7. Ramasubramanian, Sowmya (21 September 2020). "Google's open source data to study impact of COVID-19". The Hindu . Retrieved 14 October 2020.
  8. "Query the Data Commons Knowledge Graph using SPARQL". datacommons.org. Retrieved 14 October 2020.
  9. "Overview". datacommons.org. Retrieved 14 October 2020.
  10. "Contributing to Data Commons - Adding datasets". datacommons.org. Data Commons.
  11. Guha, Ramanathan V. (15 October 2020). "Data Commons, now accessible on Google Search". docs.datacommons.org. Retrieved 2020-10-16.
  12. "StatisticalPopulation type at Schema.org". schema.org. Retrieved 14 October 2020.
  13. "Observation type at Schema.org". schema.org. Retrieved 14 October 2020.
  14. "Proposal for representing Aggregate Statistical Data". GitHub - Schema.org repository. 25 June 2019. Retrieved 14 October 2020.
  15. "datacommons.org GitHub". GitHub .