Type of site | |
---|---|
Available in | Multiple languages |
Owner | Wikimedia Foundation |
Editor | Wikimedia community |
URL | www |
Commercial | No |
Registration | Optional |
Launched | 29 October 2012 [1] |
Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. [2] It is a common source of open data that Wikimedia projects such as Wikipedia, [3] [4] and anyone else, is able to use under the CC0 public domain license. Wikidata is a wiki powered by the software MediaWiki, including its extension for semi-structured data, the Wikibase. As of mid-2024, Wikidata had 1.57 billion item statements (semantic triple). [5]
Wikidata is a document-oriented database, focusing on items, which represent any kind of topic, concept, or object. Each item is allocated a unique, persistent identifier, a positive integer prefixed with the upper-case letter Q, known as a "QID". Q is the starting letter of the first name of Qamarniso Vrandečić (née Ismoilova), an Uzbek Wikimedian married to the Wikidata co-developer Denny Vrandečić. [6] This enables the basic information required to identify the topic that the item covers to be translated without favouring any language.
Examples of items include 1988 Summer Olympics (Q8470), love (Q316), Johnny Cash (Q42775), Elvis Presley (Q303), and Gorilla (Q36611).
Item labels do not need to be unique. For example, there are two items named "Elvis Presley": Elvis Presley (Q303), which represents the American singer and actor, and Elvis Presley (Q610926), which represents his self-titled album. However, the combination of a label and its description must be unique. To avoid ambiguity, an item's unique identifier (QID) is hence linked to this combination.
Fundamentally, an item consists of:
Statements are how any information known about an item is recorded in Wikidata. Formally, they consist of key–value pairs, which match a property (such as "author", or "publication date") with one or more entity values (such as "Sir Arthur Conan Doyle" or "1902"). For example, the informal English statement "milk is white" would be encoded by a statement pairing the property color (P462) with the value white (Q23444) under the item milk (Q8495).
Statements may map a property to more than one value. For example, the "occupation" property for Marie Curie could be linked with the values "physicist" and "chemist", to reflect the fact that she engaged in both occupations. [7]
Values may take on many types including other Wikidata items, strings, numbers, or media files. Properties prescribe what types of values they may be paired with. For example, the property official website (P856) may only be paired with values of type "URL". [8]
Optionally, qualifiers can be used to refine the meaning of a statement by providing additional information. For example, a "population" statement could be modified with a qualifier such as "point in time (P585): 2011" (as its own key-value pair). Values in the statements may also be annotated with references, pointing to a source backing up the statement's content. [9] As with statements, all qualifiers and references are property–value pairs.
Each property has a numeric identifier prefixed with a capital P and a page on Wikidata with optional label, description, aliases, and statements. As such, there are properties with the sole purpose of describing other properties, such as subproperty of (P1647).
Properties may also define more complex rules about their intended usage, termed constraints. For example, the capital (P36) property includes a "single value constraint", reflecting the reality that (typically) territories have only one capital city. Constraints are treated as testing alerts and hints, rather than inviolable rules. [10]
Before a new property is created, it needs to undergo a discussion process. [11] [12]
The most used property is cites work (P2860), which is used on more than 290,000,000 item pages as of November 2023. [update] [13]
In linguistics, a lexeme is a unit of lexical meaning representing a group of words that share the same core meaning and grammatical characteristics. [14] [15] Similarly, Wikidata's lexemes are items with a structure that makes them more suitable to store lexicographical data. Since 2016, Wikidata has supported lexicographical entries in the form of lexemes. [16]
In Wikidata, lexicographical entries have a different identifier from regular item entries. These entries are prefixed with the letter L, such as in the example entries for book and cow. Lexicographical entries in Wikidata can contain statements, senses, and forms. [17] The use of lexicographical entries in Wikidata allows for the documentation of word usage, the connection between words and items on Wikidata, word translations, and enables machine-readable lexicographical data.
In 2020, lexicographical entries on Wikidata exceeded 250,000. The language with the most lexicographical entries was Russian, with a total of 101,137 lexemes, followed by English with 38,122 lexemes. There are over 668 languages with lexicographical entries on Wikidata. [18]
In Wikidata, a schema is a data model that outlines the necessary attributes for a data item. [19] [20] For instance, a data item that uses the attribute "instance of" with the value "human" would typically include attributes such as "place of birth," "date of birth," "date of death," and "place of death." [21] The entity schema in Wikidata utilizes Shape Expression (ShEx) to describe the data in Wikidata items in the form of a Resource Description Framework (RDF). [22] The use of entity schemas in Wikidata helps address data inconsistencies and unchecked vandalism. [19]
In January 2019, development started of a new extension for MediaWiki to enable storing ShEx in a separate namespace. [23] [24] Entity schemas are stored with different identifiers than those used for items, properties, and lexemes. Entity schemas are stored with an "E" identifier, such as E10 for the entity schema of human data instances and E270 for the entity schema of building data instances. This extension has since been installed on Wikidata [25] and enables contributors to use ShEx for validating and describing Resource Description Framework data in items and lexemes. Any item or lexeme on Wikidata can be validated against an Entity Schema,[ clarification needed ] and this makes it an important tool for quality assurance.
Wikidata's content collections include data for biographies, [26] medicine, [27] digital humanities, [28] scholarly metadata through the WikiCite project. [29]
It includes data collections from other open projects including Freebase (database). [30]
The creation of the project was funded by donations from the Allen Institute for Artificial Intelligence, the Gordon and Betty Moore Foundation, and Google, Inc., totaling €1.3 million. [31] [32] The development of the project is mainly driven by Wikimedia Deutschland under the management of Lydia Pintscher, and was originally split into three phases: [33]
Wikidata was launched on 29 October 2012 and was the first new project of the Wikimedia Foundation since 2006. [3] [34] [35] At this time, only the centralization of language links was available. This enabled items to be created and filled with basic information: a label – a name or title, aliases – alternative terms for the label, a description, and links to articles about the topic in all the various language editions of Wikipedia (interwikipedia links).
Historically, a Wikipedia article would include a list of interlanguage links (links to articles on the same topic in other editions of Wikipedia, if they existed). Wikidata was originally a self-contained repository of interlanguage links. [36] Wikipedia language editions were still not able to access Wikidata, so they needed to continue to maintain their own lists of interlanguage links.[ citation needed ]
On 14 January 2013, the Hungarian Wikipedia became the first to enable the provision of interlanguage links via Wikidata. [37] This functionality was extended to the Hebrew and Italian Wikipedias on 30 January, to the English Wikipedia on 13 February and to all other Wikipedias on 6 March. [38] [39] [40] [41] After no consensus was reached over a proposal to restrict the removal of language links from the English Wikipedia, [42] they were automatically removed by bots. On 23 September 2013, interlanguage links went live on Wikimedia Commons. [43]
On 4 February 2013, statements were introduced to Wikidata entries. The possible values for properties were initially limited to two data types (items and images on Wikimedia Commons), with more data types (such as coordinates and dates) to follow later. The first new type, string, was deployed on 6 March. [44]
The ability for the various language editions of Wikipedia to access data from Wikidata was rolled out progressively between 27 March and 25 April 2013. [45] [46] On 16 September 2015, Wikidata began allowing so-called arbitrary access, or access from a given article of a Wikipedia to the statements on Wikidata items not directly connected to it. For example, it became possible to read data about Germany from the Berlin article, which was not feasible before. [47] On 27 April 2016, arbitrary access was activated on Wikimedia Commons. [48]
According to a 2020 study, a large proportion of the data on Wikidata consists of entries imported en masse from other databases by Internet bots, which helps to "break down the walls" of data silos. [49]
On 7 September 2015, the Wikimedia Foundation announced the release of the Wikidata Query Service, [50] which lets users run queries on the data contained in Wikidata. [51] The service uses SPARQL as the query language. As of November 2018, there are at least 26 different tools that allow querying the data in different ways. [52] It uses Blazegraph as its triplestore and graph database. [53] [54]
In 2021, Wikimedia Deutschland released the Query Builder, [55] "a form-based query builder to allow people who don't know how to use SPARQL" to write a query.
The bars on the logo contain the word "WIKI" encoded in Morse code. [56] It was created by Arun Ganesh and selected through community decision. [57]
In November 2014, Wikidata received the Open Data Publisher Award from the Open Data Institute "for sheer scale, and built-in openness". [58]
In December 2014, Google announced that it would shut down Freebase in favor of Wikidata. [59]
As of November 2018 [update] , Wikidata information was used in 58.4% of all English Wikipedia articles, mostly for external identifiers or coordinate locations. In aggregate, data from Wikidata is shown in 64% of all Wikipedias' pages, 93% of all Wikivoyage articles, 34% of all Wikiquotes', 32% of all Wikisources', and 27% of Wikimedia Commons. [60]
As of December 2020 [update] , Wikidata's data was visualized by at least 20 other external tools [61] and over 300 papers have been published about Wikidata. [62]
A systematic literature review of the uses of Wikidata in research was carried out in 2019. [68]
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
The Resource Description Framework (RDF) is a method to describe and exchange graph data. It was originally designed as a data model for metadata by the World Wide Web Consortium (W3C). It provides a variety of syntax notations and data serialization formats, of which the most widely used is Turtle.
Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustration, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of terms into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 195 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.
MediaWiki is free and open-source wiki software originally developed by Magnus Manske for use on Wikipedia on January 25, 2002, and further improved by Lee Daniel Crocker, after which development has been coordinated by the Wikimedia Foundation. It powers several wiki hosting websites across the Internet, as well as most websites hosted by the Wikimedia Foundation including Wikipedia, Wiktionary, Wikimedia Commons, Wikiquote, Meta-Wiki and Wikidata, which define a large part of the set requirements for the software. Besides its usage on Wikimedia sites, MediaWiki has been used as a knowledge management and content management system on websites such as Fandom, wikiHow and major internal installations like Intellipedia and Diplopedia.
WinFS was the code name for a canceled data storage and management system project based on relational databases, developed by Microsoft and first demonstrated in 2003. It was intended as an advanced storage subsystem for the Microsoft Windows operating system, designed for persistence and management of structured, semi-structured and unstructured data.
SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.
A semantic wiki is a wiki that has an underlying model of the knowledge described in its pages. Regular, or syntactic, wikis have structured text and untyped hyperlinks. Semantic wikis, on the other hand, provide the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like a database through semantic queries.
Heinrich Magnus Manske is a German biochemist who is a leading researcher on malaria. He is a senior staff scientist at the Wellcome Sanger Institute in Cambridge, UK and a software developer of one of the first versions of the MediaWiki software, which powers Wikipedia and a number of other wiki-based websites.
The Wikimedia movement is the global community of contributors to the Wikimedia projects, including Wikipedia. This community directly builds and administers these projects with the commitment of achieving this using open standards and software.
Semantic MediaWiki (SMW) is an extension to MediaWiki that allows for annotating semantic data within wiki pages, thus turning a wiki that incorporates the extension into a semantic wiki. Data that has been encoded can be used in semantic searches, used for aggregation of pages, displayed in formats like maps, calendars and graphs, and exported to the outside world via formats like RDF and CSV.
DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.
The history of wikis began in 1994, when Ward Cunningham gave the name "WikiWikiWeb" to the knowledge base, which ran on his company's website at c2.com, and the wiki software that powered it. The wiki went public in March 1995, the date used in anniversary celebrations of the wiki's origins. c2.com is thus the first true wiki, or a website with pages and links that can be easily edited via the browser, with a reliable version history for each page. He chose "WikiWikiWeb" as the name based on his memories of the "Wiki Wiki Shuttle" at Honolulu International Airport, and because "wiki" is the Hawaiian word for "quick".
Wikimedia Commons, or simply Commons, is a wiki-based media repository of free-to-use images, sounds, videos and other media. It is a project of the Wikimedia Foundation.
Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions. Freebase aimed to create a global resource that allowed people to access common information more effectively. It was developed by the American software company Metaweb and run publicly beginning in March 2007. Metaweb was acquired by Google in a private sale announced on 16 July 2010. Google's Knowledge Graph is powered in part by Freebase.
YAGO is an open source knowledge base developed at the Max Planck Institute for Informatics in Saarbrücken. It is automatically extracted from Wikidata and Schema.org.
A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.
Wikibase is a set of MediaWiki extensions for working with versioned semi-structured data in a central repository. It is based upon JSON instead of the unstructured data of wikitext normally used in MediaWiki. Its primary components are the Wikibase Repository, an extension for storing and managing data, and the Wikibase Client which allows for the retrieval and embedding of structured data from a Wikibase repository. It was developed for and is used by Wikidata, by Wikimedia Deutschland.
Abstract Wikipedia is an in-development project of the Wikimedia Foundation. It aims to use Wikifunctions to create a language-independent version of Wikipedia using its structured data. First conceived in 2020, Abstract Wikipedia has been under active development ever since, with the related project of Wikifunctions launched in 2023. Nevertheless, the project has proved controversial. As envisioned, Abstract Wikipedia would consist of "Constructors", "Content", and "Renderers".
Zdenko "Denny" Vrandečić is a Croatian computer scientist. He was a co-developer of Semantic MediaWiki and Wikidata, the lead developer of the Wikifunctions project, and an employee of the Wikimedia Foundation as a Head of Special Projects, Structured Content. He published modules for the German role-playing game The Dark Eye.
Wikidata went live on October 29, 2012