The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI-PMH must support representing metadata in Dublin Core, but may also support additional representations. [1] [2]
The protocol is usually just referred to as the OAI Protocol.
OAI-PMH uses XML over HTTP. Version 2.0 of the protocol was released in 2002; the document was last updated in 2015. It has a Creative Commons license BY-SA.
In the late 1990s, Herbert Van de Sompel (Ghent University) was working with researchers and librarians at Los Alamos National Laboratory (US) and called a meeting to address difficulties related to interoperability issues of e-print servers and digital repositories. The meeting was held in Santa Fe, New Mexico, in October 1999. [3] A key development from the meeting was the definition of an interface that permitted e-print servers to expose metadata for the papers it held in a structured fashion so other repositories could identify and copy papers of interest with each other. This interface/protocol was named the "Santa Fe Convention". [1] [2] [4]
Several workshops were held in 2000 at the ACM Digital Libraries conference, [5] at the 1st ACM/IEEE-CS joint conference on Digital libraries [6] [7] and elsewhere to share the ideas from the Santa Fe Convention. [8] It was discovered at the workshops that the problems faced by the e-print community were also shared by libraries, museums, journal publishers, and others who needed to share distributed resources. To address these needs, the Coalition for Networked Information [9] and the Digital Library Federation [10] provided funding to establish an Open Archives Initiative (OAI) secretariat managed by Herbert Van de Sompel and Carl Lagoze. The OAI held a meeting at Cornell University (Ithaca, New York) in September 2000 aimed to improve the interface developed at the Santa Fe Convention. [11] The specifications were refined over e-mail.
OAI-PMH version 1.0 was introduced to the public in January 2001 at a workshop in Washington D.C., [12] and another in February in Berlin, Germany. [13] Subsequent modifications to the XML standard by the W3C required making minor modifications to OAI-PMH resulting in version 1.1. The current version, 2.0, was released in June 2002. It contained several technical changes and enhancements and is not backward compatible. [14]
From 2001 CERN, and later in collaboration with University of Geneva, has organized bi-annual OAI workshops, [15] which over time have developed to cover most aspects of open science. Since 2021 the workshop series is named the Geneva Workshop on Innovations in Scholarly Communication, with the nick name OAI reflecting its origin. [16]
Some commercial search engines use OAI-PMH to acquire more resources. Google initially included support for OAI-PMH when launching sitemaps, however decided to support only the standard XML Sitemaps format in May 2008. [17] In 2004, Yahoo! acquired content from OAIster (University of Michigan) that was obtained through metadata harvesting with OAI-PMH. Wikimedia uses an OAI-PMH repository to provide feeds of Wikipedia and related site updates for search engines and other bulk analysis/republishing endeavors. [18] Especially when dealing with thousands of files being harvested every day, OAI-PMH can help in reducing the network traffic and other resource usage by doing incremental harvesting. [19] NASA's Mercury metadata search system uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) every day. [20]
The mod_oai project is using OAI-PMH to expose content to web crawlers that is accessible from Apache Web servers.
OAI-PMH has later been applied to sharing of scientific data. [21]
OAI-PMH is based on a client–server architecture, in which "harvesters" request information on updated records from "repositories". Requests for data can be based on a datestamp range, and can be restricted to named sets defined by the provider. Data providers are required to provide XML metadata in Dublin Core format, and may also provide it in other XML formats.
A number of software systems support the OAI-PMH, including Fedora, EThOS from the British Library, GNU EPrints from the University of Southampton, Open Journal Systems from the Public Knowledge Project, Desire2Learn, DSpace from MIT, HyperJournal from the University of Pisa, Digibib from Digibis, MyCoRe, Koha, Primo, DigiTool, Rosetta and MetaLib from Ex Libris, ArchivalWare from PTFS, DOOR [22] from the eLab [23] in Lugano, Switzerland, panFMP from the PANGAEA (data library), [24] SimpleDL from Roaring Development, and jOAI from the National Center for Atmospheric Research. [25]
A number of large archives support the protocol including arXiv and the CERN Document Server.
The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives to share catalogue information (metadata). The group got together in the late late 1990s and was active for around twenty years. OAI coordinated in particular three specification activities: OAI-PMH, OAI-ORE and ResourceSync. All along the group worked towards building a "low-barrier interoperability framework" for archives containing digital content to allow people harvest metadata. Such sets of metadata are since then harvested to provide "value-added services", often by combining different data sets.
CiteSeerX is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.
Z39.50 is an international standard client–server, application layer communications protocol for searching and retrieving information from a database over a TCP/IP computer network, developed and maintained by the Library of Congress. It is covered by ANSI/NISO standard Z39.50, and ISO standard 23950.
An institutional repository is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published works to increase their visibility and collaboration with other academics However, most of these outputs produced by universities are not effectively accessed and shared by researchers and other stakeholders As a result Academics should be involved in the implementation and development of an IR project so that they can learn the benefits and purpose of building an IR.
Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt
, a URL exclusion protocol.
Fedora is a digital asset management (DAM) content repository architecture upon which institutional repositories, digital archives, and digital library systems might be built. Fedora is the underlying architecture for a digital repository, and is not a complete management, indexing, discovery, and delivery application. It is a modular architecture built on the principle that interoperability and extensibility are best achieved by the integration of data, interfaces, and mechanisms as clearly defined modules.
mod_oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community. mod_oai also allows harvesters to obtain "archive-ready" resources from a web server.
ScientificCommons was a project of the University of St. Gallen Institute for Media and Communications Management. The major aim of the project was to develop the world’s largest archive of scientific knowledge with fulltexts freely accessible to the public. The project was closed down in 2014.
BASE is a multi-disciplinary search engine to scholarly internet resources, created by Bielefeld University Library in Bielefeld, Germany. It is based on free and open-source software such as Apache Solr and VuFind. It harvests OAI metadata from institutional repositories and other academic digital libraries that implement the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), and then normalizes and indexes the data for searching. In addition to OAI metadata, the library indexes selected web sites and local data collections, all of which can be searched via a single search interface.
EPrints is a free and open-source software package for building open access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It shares many of the features commonly seen in document management systems, but is primarily used for institutional repositories and scientific journals. EPrints has been developed at the University of Southampton School of Electronics and Computer Science and released under the GPL-3.0-or-later license.
The Redalyc project is a bibliographic database and a digital library of Open Access journals, supported by the Universidad Autónoma del Estado de México with the help of numerous other higher education institutions and information systems.
PREservation Metadata: Implementation Strategies (PREMIS) is the de facto digital preservation metadata standard.
The Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards for the description and exchange of aggregations of web resources. The OAI-ORE specification implements the ORE Model which introduces the resource map (ReM) that makes it possible to associate an identity with aggregations of resources and make assertions about their structure and semantics.
A resource map (ReM) is a concept of the ORE Model for associating an identity with compound digital objects and making assertions about their structure and semantics. Compound objects combine distributed resources, including multiple media types.
Herbert Van de Sompel is a Belgian librarian, computer scientist, and musician, most known for his role in the development of the Open Archives Initiative (OAI) and standards such as OpenURL, Object Reuse and Exchange, and the OAI Protocol for Metadata Harvesting.
Invenio is an open source software framework for large-scale digital repositories that provides the tools for management of digital assets in an institutional repository and research data management systems. The software is typically used for open access repositories for scholarly and/or published digital content and as a digital library.
The OpenSIGLE repository provides open access to the bibliographic records of the former SIGLE database. The creation of the OpenSIGLE archive was decided by some major European STI centres, members of the former European network EAGLE for the collection and dissemination of grey literature. OpenSIGLE was developed by the French INIST-CNRS, with assistance from the German FIZ Karlsruhe and the Dutch Grey Literature Network Service (GreyNet). OpenSIGLE is hosted on an INIST-CNRS server at Nancy. Part of the open Access movement, OpenSIGLE is referenced by the international Directory of Open Access Repositories.
An open repository or open-access repository is a digital platform that holds research output and provides free, immediate and permanent access to research results for anyone to use, download and distribute. To facilitate open access such repositories must be interoperable according to the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Search engines harvest the content of open access repositories, constructing a database of worldwide, free of charge available research.
OPUS is an open-source software package under the GNU General Public License used for creating Open Access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting. It provides tools for creating collections of digital resources, as well as for their storage and dissemination. It is usually used at universities, libraries and research institutes as a platform for institutional repositories.
MyCoRe is an open source repository software framework for building disciplinary or institutional repositories, digital archives, digital libraries, and scientific journals. The software is developed at various German university libraries and computer centers. Although most MyCoRe web applications are located in Germany, there are English-language applications, such as "The International Treasury of Islamic Manuscripts" at the University of Cambridge (UK).
{{cite journal}}
: CS1 maint: date and year (link){{cite journal}}
: Cite journal requires |journal=
(help)