DBpedia

Last updated

DBpedia
Developer(s)
Initial release10 January 2007(17 years ago) (2007-01-10)
Stable release
DBpedia 2016-10 / 4 July 2017
Repository
Written in
Type
License GNU General Public License
Website dbpedia.org

DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. [1] [2] DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets. [3]

Contents

The project was heralded as "one of the more famous pieces" of the decentralized Linked Data effort by Tim Berners-Lee, one of the Internet's pioneers. [4] As of June 2021, DBPedia contained over 850 million triples.

Background

The project was started by people at the Free University of Berlin and Leipzig University [5] in collaboration with OpenLink Software, and is now maintained by people at the University of Mannheim and Leipzig University. [6] [7] The first publicly available dataset was published in 2007. [5] The data is made available under free licenses (CC BY-SA), allowing others to reuse the dataset; it therefore doesn't however use an open data license to waive the sui generis database rights.

Wikipedia articles consist mostly of free text, but also include structured information embedded in the articles, such as "infobox" tables (the pull-out panels that appear in the top right of the default view of many Wikipedia articles, or at the start of the mobile versions), categorization information, images, geo-coordinates and links to external Web pages. This structured information is extracted and put in a uniform dataset which can be queried.

Dataset

The 2016-04 release of the DBpedia data set describes 6.0 million entities, out of which 5.2 million are classified in a consistent ontology, including 1.5 million persons, 810,000 places, 135,000 music albums, 106,000 films, 20,000 video games, 275,000 organizations, 301,000 species and 5,000 diseases. [8] DBpedia uses the Resource Description Framework (RDF) to represent extracted information and consists of 9.5 billion RDF triples, of which 1.3 billion were extracted from the English edition of Wikipedia and 5.0 billion from other language editions. [8]

From this data set, information spread across multiple pages can be extracted. For example, book authorship can be put together from pages about the work, or the author.[ further explanation needed ]

One of the challenges in extracting information from Wikipedia is that the same concepts can be expressed using different parameters in infobox and other templates, such as |birthplace= and |placeofbirth=. Because of this, queries about where people were born would have to search for both of these properties in order to get more complete results. As a result, the DBpedia Mapping Language has been developed to help in mapping these properties to an ontology while reducing the number of synonyms. Due to the large diversity of infoboxes and properties in use on Wikipedia, the process of developing and improving these mappings has been opened to public contributions. [9]

Version 2014 was released in September 2014. [10] A main change since previous versions was the way abstract texts were extracted. Specifically, running a local mirror of Wikipedia and retrieving rendered abstracts from it made extracted texts considerably cleaner. Also, a new data set extracted from Wikimedia Commons was introduced.

As of June 2021, DBPedia contains over 850 million triples. [11]

Examples

DBpedia extracts factual information from Wikipedia pages, allowing users to find answers to questions where the information is spread across multiple Wikipedia articles. Data is accessed using an SQL-like query language for RDF called SPARQL.

For example, if one were interested in the Japanese shōjo manga series Tokyo Mew Mew , and wanted to find the genres of other works written by its illustrator Mia Ikumi. DBpedia combines information from Wikipedia's entries on Tokyo Mew Mew, Mia Ikumi and on this author's works such as Super Doll Licca-chan and Koi Cupid. Since DBpedia normalises information into a single database, the following query can be asked without needing to know exactly which entry carries each fragment of information, and will list related genres:

PREFIXdbprop:<http://dbpedia.org/ontology/>PREFIXdb:<http://dbpedia.org/resource/>SELECT?who,?WORK,?genreWHERE{db:Tokyo_Mew_Mewdbprop:author?who.?WORKdbprop:author?who.OPTIONAL{?WORKdbprop:genre?genre}.}

Use cases

DBpedia has a broad scope of entities covering different areas of human knowledge. This makes it a natural hub for connecting datasets, where external datasets could link to its concepts. [12] The DBpedia dataset is interlinked on the RDF level with various other Open Data datasets on the Web. This enables applications to enrich DBpedia data with data from these datasets. As of September 2013, there are more than 45 million interlinks between DBpedia and external datasets including: Freebase, OpenCyc, UMBEL, GeoNames, MusicBrainz, CIA World Fact Book, DBLP, Project Gutenberg, DBtune Jamendo, Eurostat, UniProt, Bio2RDF, and US Census data. [13] [14] The Thomson Reuters initiative OpenCalais, the Linked Open Data project of The New York Times , the Zemanta API [15] and DBpedia Spotlight also include links to DBpedia. [16] [17] [18] The BBC uses DBpedia to help organize its content. [19] [20] Faviki uses DBpedia for semantic tagging. [21] Samsung also includes DBpedia in its "Knowledge Sharing Platform".

Such a rich source of structured cross-domain knowledge is fertile ground for artificial intelligence systems. DBpedia was used as one of the knowledge sources in IBM Watson's Jeopardy! winning system [22]

Amazon provides a DBpedia Public Data Set that can be integrated into Amazon Web Services applications. [23]

Data about creators from DBpedia can be used for enriching artworks' sales observations. [24]

The crowdsourcing software company, Ushahidi, built a prototype of its software that leveraged DBpedia to perform semantic annotations on citizen-generated reports. The prototype incorporated the "YODIE" (Yet another Open Data Information Extraction system) service [25] developed by the University of Sheffield, which uses DBpedia to perform the annotations. The goal for Ushahidi was to improve the speed and facility with which incoming reports could be validated managed. [26]

DBpedia Spotlight

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text. This allows linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight performs named entity extraction, including entity detection and name resolution (in other words, disambiguation). It can also be used for named entity recognition, and other information extraction tasks. DBpedia Spotlight aims to be customizable for many use cases. Instead of focusing on a few entity types, the project strives to support the annotation of all 3.5 million entities and concepts from more than 320 classes in DBpedia. The project started in June 2010 at the Web Based Systems Group at the Free University of Berlin.

DBpedia Spotlight is publicly available as a web service for testing and a Java/Scala API licensed via the Apache License. The DBpedia Spotlight distribution includes a jQuery plugin that allows developers to annotate pages anywhere on the Web by adding one line to their page. [27] Clients are also available in Java or PHP. [28] The tool handles various languages through its demo page [29] and web services. Internationalization is supported for any language that has a Wikipedia edition. [30]

Archivo ontology database

From 2020, the DBpedia project provides a regularly updated database of web‑accessible ontologies written in the OWL ontology language. [31] Archivo also provides a four star rating scheme for the ontologies it scrapes, based on accessibility, quality, and related fitness‑for‑use criteria. For instance, SHACL compliance for graph‑based data is evaluated when appropriate. Ontologies should also contain metadata about their characteristics and specify a public license describing their terms‑of‑use. [32] [33] As of June 2021 the Archivo database contains 1368 entries.

History

DBpedia was initiated in 2007 by Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak and Zachary Ives. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

The Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) standard originally designed as a data model for metadata. It has come to be used as a general method for description and exchange of graph data. RDF provides a variety of syntax notations and data serialization formats, with Turtle currently being the most widely used notation.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

A semantic wiki is a wiki that has an underlying model of the knowledge described in its pages. Regular, or syntactic, wikis have structured text and untyped hyperlinks. Semantic wikis, on the other hand, provide the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like a database through semantic queries.

Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.

Ontotext is a software company with offices in Europe and USA. It is the semantic technology branch of Sirma Group. Its main domain of activity is the development of software based on the Semantic Web languages and standards, in particular RDF, OWL and SPARQL. Ontotext is best known for the Ontotext GraphDB semantic graph database engine. Another major business line is the development of enterprise knowledge management and analytics systems that involve big knowledge graphs. Those systems are developed on top of the Ontotext Platform that builds on top of GraphDB capabilities for text mining using big knowledge graphs.

<span class="mw-page-title-main">Apache Jena</span> Open source semantic web framework for Java

Apache Jena is an open source Semantic Web framework for Java. It provides an API to extract data from and write to RDF graphs. The graphs are represented as an abstract "model". A model can be sourced with data from files, databases, URLs or a combination of these. A model can also be queried through SPARQL 1.1.

The concept of the Social Semantic Web subsumes developments in which social interactions on the Web lead to the creation of explicit and semantically rich knowledge representations. The Social Semantic Web can be seen as a Web of collective knowledge systems, which are able to provide useful information based on human contributions and which get better as more people participate. The Social Semantic Web combines technologies, strategies and methodologies from the Semantic Web, social software and the Web 2.0.

<span class="mw-page-title-main">Linked data</span> Structured data and method for its publication

In computing, linked data is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages only for human readers, it extends them to share information in a way that can be read automatically by computers. Part of the vision of linked data is for the Internet to become a global database.

<span class="mw-page-title-main">GeoNames</span> Geographical database available and accessible through various web services

GeoNames is a user-editable geographical database available and accessible through various web services, under a Creative Commons attribution license. The project was founded in late 2005.

Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions. Freebase aimed to create a global resource that allowed people to access common information more effectively. It was developed by the American software company Metaweb and run publicly beginning in March 2007. Metaweb was acquired by Google in a private sale announced on 16 July 2010. Google's Knowledge Graph is powered in part by Freebase.

The FAO geopolitical ontology is an ontology developed by the Food and Agriculture Organization of the United Nations (FAO) to describe, manage and exchange data related to geopolitical entities such as countries, territories, regions and other similar areas.

<span class="mw-page-title-main">YAGO (database)</span> Open-source information repository

YAGO is an open source knowledge base developed at the Max Planck Institute for Informatics in Saarbrücken. It is automatically extracted from Wikipedia and other sources.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

<span class="mw-page-title-main">Infobox</span> Template used to collect and present a subset of information about a subject

An infobox is a digital or physical table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia represents a summary of information about the subject of an article. In this way, they are comparable to data tables in some aspects. When presented within the larger document it summarizes, an infobox is often presented in a sidebar format.

The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components developed specifically to provide a complete Web application framework. OSF is made available under the Apache 2 license.

<span class="mw-page-title-main">UMBEL</span>

UMBEL is a logically organized knowledge graph of 34,000 concepts and entity types that can be used in information science for relating information from disparate sources to one another. It was retired at the end of 2019. UMBEL was first released in July 2008. Version 1.00 was released in February 2011. Its current release is version 1.50.

In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.

Schema-agnostic databases or vocabulary-independent databases aim at supporting users to be abstracted from the representation of the data, supporting the automatic semantic matching between queries and databases. Schema-agnosticism is the property of a database of mapping a query issued with the user terminology and structure, automatically mapping it to the dataset vocabulary.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the semantics or relationships underlying these entities.

References

  1. Bizer, Christian; Lehmann, Jens; Kobilarov, Georgi; Auer, Soren; Becker, Christian; Cyganiak, Richard; Hellmann, Sebastian (September 2009). "DBpedia - A crystallization point for the Web of Data" (PDF). Web Semantics: Science, Services and Agents on the World Wide Web. 7 (3): 154–165. CiteSeerX   10.1.1.150.4898 . doi:10.1016/j.websem.2009.07.002. ISSN   1570-8268. Archived from the original (PDF) on 10 August 2017. Retrieved 11 December 2015.
  2. "About DBpedia". DBpedia. Retrieved 14 January 2024.
  3. "Komplett verlinkt — Linked Data" (in German). 3sat. 19 June 2009. Archived from the original on 6 January 2013. Retrieved 10 November 2009.
  4. "Sir Tim Berners-Lee Talks with Talis about the Semantic Web". Talis. 7 February 2008. Archived from the original on 10 May 2013.
  5. 1 2 3 DBpedia: A Nucleus for a Web of Open Data, available at , , or
  6. "Credits". DBpedia. Archived from the original on 21 September 2014. Retrieved 9 September 2014.
  7. "Home".
  8. 1 2 "YEAH! We did it again ;) – New 2016-04 DBpedia release". DBpedia. 19 October 2016. Retrieved 9 January 2019.
  9. "DBpedia Mappings". mappings.dbpedia.org. Retrieved 3 April 2010.
  10. "Changelog". DBpedia. September 2014. Retrieved 9 September 2014.
  11. Holze, Julia (23 July 2021). "Announcement: DBpedia Snapshot 2021-06 Release". DBpedia Association. Retrieved 28 July 2021.
  12. E. Curry, A. Freitas, and S. O'Riáin, "The Role of Community-Driven Data Curation for Enterprises", Archived 23 January 2012 at the Wayback Machine in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47.
  13. "Statistics on links between Data sets", SWEO Community Project: Linking Open Data on the Semantic Web, W3C, retrieved 24 November 2009
  14. "Statistics on Data sets", SWEO Community Project: Linking Open Data on the Semantic Web, W3C, retrieved 24 November 2009
  15. "Zemanta API". dev.zemanta.com. Retrieved 26 July 2021.
  16. Sandhaus, Evan; Larson, Rob (29 October 2009). "First 5,000 Tags Released to the Linked Data Cloud". The New York Times Blogs. Retrieved 10 November 2009.
  17. "Life in the Linked Data Cloud". opencalais.com. Archived from the original on 24 November 2009. Retrieved 10 November 2009. Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format.
  18. "Zemanta talks Linked Data with SDK and commercial API". ZDNet. Archived from the original on 28 February 2010. Retrieved 10 November 2009. Zemanta fully supports the Linking Open Data initiative. It is the first API that returns disambiguated entities linked to dbPedia, Freebase, MusicBrainz, and Semantic Crunchbase.
  19. "European Semantic Web Conference 2009 - Georgi Kobilarov, Tom Scott, Yves Raimond, Silver Oliver, Chris Sizemore, Michael Smethurst, Christian Bizer and Robert Lee. Media meets Semantic Web - How the BBC uses DBpedia and Linked Data to make Connections". eswc2009.org. Archived from the original on 8 June 2009. Retrieved 10 November 2009.
  20. "BBC Learning - Open Lab - Reference". BBC. Archived from the original on 25 August 2009. Retrieved 10 November 2009. Dbpedia is a database version of Wikipedia. It is used in a lot of projects for a wide range of different reasons. At the BBC we are using it for tagging content.
  21. "Semantic Tagging with Faviki". readwriteweb.com. Archived from the original on 29 January 2010.
  22. David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty "Building Watson: An Overview of the DeepQA Project." Archived 6 November 2020 at the Wayback Machine In AI Magazine Fall, 2010. Association for the Advancement of Artificial Intelligence (AAAI).
  23. "Amazon Web Services Developer Community : DBpedia". developer.amazonwebservices.com. Archived from the original on 13 February 2010. Retrieved 10 November 2009.
  24. Filipiak, Dominik; Filipowska, Agata (2 December 2015). "DBpedia in the Art Market". Business Information Systems Workshops. Lecture Notes in Business Information Processing. Vol. 228. pp. 321–331. doi:10.1007/978-3-319-26762-3_28. ISBN   978-3-319-26761-6.
  25. "GATE.ac.uk - applications/yodie.html". gate.ac.uk. Retrieved 11 May 2020.
  26. "ushahidi/platform-comrades". GitHub. 30 June 2019. Retrieved 9 March 2020.
  27. Mendes, Pablo. "DBpedia Spotlight jQuery Plugin". jQuery Plugins. Archived from the original on 3 April 2011. Retrieved 15 September 2011.
  28. DiCiuccio, Rob (25 September 2016). "PHP Client for DBpedia Spotlight". GitHub.
  29. "Demo of DBpedia Spotlight" . Retrieved 8 September 2013.
  30. "Internationalization of DBpedia Spotlight". GitHub . Retrieved 8 September 2013.
  31. "DBpedia Archivo" . Retrieved 8 July 2021.
  32. Frey, Johannes; Streitmatter, Denis; Götz, Fabian; Hellmann, Sebastian; Arndt, Natanael (27 October 2020). "DBpedia Archivo: a web-scale interface for ontology archiving under consumer-oriented aspects". In Sure-Vetter, York; Sack, Harald; Cudré-Mauroux, Philippe; Maleshkova, Maria; Pellegrini, Tassilo; Acosta, Maribel (eds.). Semantic systems: the power of AI and knowledge graphs. Cham, Switzerland: Springer. doi:10.1007/978-3-030-59833-4_2. ISBN   978-3-030-59832-7. S2CID   219939266. Download as PDF or ePUB. Open Access logo PLoS transparent.svg
  33. Frey, Johannes; Streitmatter, Denis; Götz, Fabian; Hellmann, Sebastian; Arndt, Natanael (10 September 2020). DBpedia Archivo: a web-scale interface for ontology archiving under consumer-oriented aspects. Leipzig, Germany: Institut für Angewandte Informatik (InfAI). Retrieved 8 July 2021. YouTube video 00:10:38.