Google Dataset Search

Last updated August 15, 2023

Google Dataset Search is a search engine from Google that helps researchers locate online data that is freely available for use.^[1] The company launched the service on September 5, 2018, and stated that the product was targeted at scientists and data journalists. The service was out of beta as of January 23, 2020.^[2]

Features

Dataset Search can filter results based on the desired type of data (for example, focusing on images or text). It is also available in mobile.^[4]

Technology

Dataset Search is heavily reliant on dataset providers' use of metadata in accordance with the standards defined by the schema.org consortium.^[5] According to the Google AI blog,

When Google's search engine processes a Web page with schema.org/Dataset mark-up, it understands that there is dataset metadata there and processes that structured metadata to create "records" describing each annotated dataset on a page. The use of schema.org allows developers to embed this structured information into HTML, without affecting the appearance of the page while making the semantics of the information visible to all search engines.^[6]

Versions

Dataset Search was initially released in beta on September 5, 2018.^[7] It moved out of beta on January 23, 2020.^[8]

Related Research Articles

Google Search is a search engine provided and operated by Google. Handling more than 3.5 billion searches per day, it has a 92% share of the global search engine market. It is the most-visited website in the world. Additionally, it is the most searched and used search engine in the entire world.

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Picasa was a cross-platform image organizer and image viewer for organizing and editing digital photos, integrated with a now defunct photo-sharing website, originally created by a company named Lifescape in 2002. "Picasa" is a blend of the name of Spanish painter Pablo Picasso, the word casa and "pic" for pictures.

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

Catalogue Service for the Web (CSW), sometimes seen as Catalogue Service - Web, is a standard for exposing a catalogue of geospatial records in XML on the Internet (over HTTP). The catalogue is made up of records that describe geospatial data (e.g. KML), geospatial services (e.g. WMS), and related resources.

Search Engine Results Pages (SERP) are the pages displayed by search engines in response to a query by a user. The main component of the SERP is the listing of results that are returned by the search engine in response to a keyword query.

Google Images is a search engine owned by Google that allows users to search the World Wide Web for images. It was introduced on July 12, 2001, due to a demand for pictures of the green Versace dress of Jennifer Lopez worn in February 2000. In 2011, reverse image search functionality was added.

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:

DuckDuckGo (DDG) is an internet privacy company. DuckDuckGo offers a number of products oriented towards helping people protect their privacy online, most notably, a private search engine, a tracker-blocking browser extension, email protection and app tracking protection for Android.

<span class="mw-page-title-main">Reverse image search</span> Content-based image retrieval

Reverse image search is a content-based image retrieval (CBIR) query technique that involves providing the CBIR system with a sample image that it will then base its search upon; in terms of information retrieval, the sample image is very useful. In particular, reverse image search is characterized by a lack of search terms. This effectively removes the need for a user to guess at keywords or terms that may or may not return a correct result. Reverse image search also allows users to discover content that is related to a specific sample image or the popularity of an image, and to discover manipulated versions and derivative works.

Schema.org is a reference website that publishes documentation and guidelines for using structured data mark-up on web-pages. Its main objective is to standardize HTML tags to be used by webmasters for creating rich results about a certain topic of interest. It is a part of the semantic web project, which aims to make document mark-up codes more readable and meaningful to both humans and machines.

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2008. It completes crawls generally every month.

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, Google Drive, and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning. Registration requires a credit card or bank account details.

Microsoft Academic was a free internet-based academic search engines for academic publications and literature, developed by Microsoft Research, shut down in 2022. At the same time, OpenAlex launched and claimed to be a successor to Microsoft Academic.

Natasha Fridman Noy is a Russian-born American Research scientist who works at Google Research in Mountain View, CA, who focuses on making structured data more accessible and usable. She is the team leader for Dataset Search, a web-based search engine for all datasets. Natasha worked at Stanford Center for Biomedical Informatics Research before joining Google, where she made significant contributions to ontology building and alignment, as well as collaborative ontology engineering. Natasha is on the Editorial Boards of many Semantic Web and Information Systems publications and is the Immediate Past President of the Semantic Web Science Association. From 2011 to 2017, she was the president of the Semantic Web Science Association.

Datacommons.org is an open knowledge graph hosted by Google that provides a unified view across multiple public datasets, combining economic, scientific and other open datasets into an integrated data graph. The Datacommons.org site was launched in May 2018 with an initial dataset consisting of fact-checking data published in Schema.org "ClaimReview" format by several fact checkers from the International Fact-Checking Network. Google has worked with partners including the United States Census, the World Bank, and US Bureau of Labor Statistics to populate the repository, which also hosts data from Wikipedia, the National Oceanic and Atmospheric Administration and the Federal Bureau of Investigation. The service expanded during 2019 to include an RDF-style Knowledge Graph populated from a number of largely statistical open datasets. The service was announced to a wider audience in 2019. In 2020 the service improved its coverage of non-US datasets, while also increasing its coverage of bioinformatics and coronavirus.

References

↑ Castelvecchi, Davide (2018-09-05). "Google unveils search engine for open data". Nature. 561 (7722): 161–162. Bibcode:2018Natur.561..161C. doi:10.1038/d41586-018-06201-x. ISSN 0028-0836. PMID 30206390. S2CID 52190512.
↑ Noy, Natasha (23 January 2020). "Discovering millions of datasets on the web". The Keyword. Retrieved 18 June 2020.
↑ "Google launches new search engine to help scientists find the datasets they need". The Verge. Retrieved 2018-09-07.
↑ Noy, Natasha (23 January 2020). "Discovering millions of datasets on the web". The Keyword. Retrieved 18 June 2020.
↑ Google, Vincent. "FAQ - Structured data markup for datasets". Search Console Help. Google Inc. Retrieved 20 June 2020.{{cite web}}: |last1= has generic name (help)
↑ Burgess, Matthew; Noy, Natasha. "Building Google Dataset Search and Fostering an Open Data Ecosystem". Google AI blog. Retrieved 20 June 2020.
↑ Noy, Natasha (5 September 2018). "Making it easier to discover datasets". The Keyword. Retrieved 27 June 2020.
↑ Noy, Natasha (23 January 2020). "Discovering millions of datasets on the web". The Keyword. Retrieved 27 June 2020.

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Castelvecchi, Davide (2018-09-05). "Google unveils search engine for open data". Nature. 561 (7722): 161–162. Bibcode:2018Natur.561..161C. doi:10.1038/d41586-018-06201-x. ISSN 0028-0836. PMID 30206390. S2CID 52190512.

[2] Noy, Natasha (23 January 2020). "Discovering millions of datasets on the web". The Keyword. Retrieved 18 June 2020.

[3] "Google launches new search engine to help scientists find the datasets they need". The Verge. Retrieved 2018-09-07.

[4] Noy, Natasha (23 January 2020). "Discovering millions of datasets on the web". The Keyword. Retrieved 18 June 2020.

[5] Google, Vincent. "FAQ - Structured data markup for datasets". Search Console Help. Google Inc. Retrieved 20 June 2020.{{cite web}}: |last1= has generic name (help)

[6] Burgess, Matthew; Noy, Natasha. "Building Google Dataset Search and Fostering an Open Data Ecosystem". Google AI blog. Retrieved 20 June 2020.

[7] Noy, Natasha (5 September 2018). "Making it easier to discover datasets". The Keyword. Retrieved 27 June 2020.

[8] Noy, Natasha (23 January 2020). "Discovering millions of datasets on the web". The Keyword. Retrieved 27 June 2020.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]