OAI-PMH
   HOME

TheInfoList



OR:

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for
harvest Harvesting is the process of gathering a ripe crop from the fields. Reaping is the cutting of grain or pulse for harvest, typically using a scythe, sickle, or reaper. On smaller farms with minimal mechanization, harvesting is the most l ...
ing
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
descriptions of records in an archive so that services can be built using metadata from many archives. An
implementation Implementation is the realization of an application, or execution of a plan, idea, model, design, specification, standard, algorithm, or policy. Industry-specific definitions Computer science In computer science, an implementation is a real ...
of OAI-PMH must support representing metadata in
Dublin Core 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin Core has ...
, but may also support additional representations. The protocol is usually just referred to as the OAI Protocol. OAI-PMH uses
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
over
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide We ...
. Version 2.0 of the protocol was released in 2002; the document was last updated in 2015. It has a
Creative Commons license A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
BY-SA.


History

In the late 1990s,
Herbert Van de Sompel Herbert Van de Sompel is a Belgian librarian, computer scientist, and musician, most known for his role in the development of the Open Archives Initiative (OAI) and standards such as OpenURL, Object Reuse and Exchange, and the OAI Protocol f ...
(
Ghent University Ghent University ( nl, Universiteit Gent, abbreviated as UGent) is a public research university located in Ghent, Belgium. Established before the state of Belgium itself, the university was founded by the Dutch King William I in 1817, when th ...
) was working with researchers and librarians at
Los Alamos National Laboratory Los Alamos National Laboratory (often shortened as Los Alamos and LANL) is one of the sixteen research and development laboratories of the United States Department of Energy (DOE), located a short distance northwest of Santa Fe, New Mexico, ...
(US) and called a meeting to address difficulties related to
interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader defi ...
issues of e-print servers and digital repositories. The meeting was held in
Santa Fe, New Mexico Santa Fe ( ; , Spanish for 'Holy Faith'; tew, Oghá P'o'oge, Tewa for 'white shell water place'; tiw, Hulp'ó'ona, label= Northern Tiwa; nv, Yootó, Navajo for 'bead + water place') is the capital of the U.S. state of New Mexico. The name “S ...
, in October 1999. A key development from the meeting was the definition of an interface that permitted e-print servers to expose
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
for the papers it held in a structured fashion so other repositories could identify and copy papers of interest with each other. This interface/protocol was named the "Santa Fe Convention". Several workshops were held in 2000 at the ACM Digital Libraries conference, at the 1st ACM/IEEE-CS joint conference on Digital libraries and elsewhere to share the ideas from the Santa Fe Convention. It was discovered at the workshops that the problems faced by the e-print community were also shared by libraries, museums, journal publishers, and others who needed to share distributed resources. To address these needs, the
Coalition for Networked Information The Coalition for Networked Information (CNI) is an organization whose mission is to promote networked information technology as a way to further the advancement of intellectual collaboration and productivity. Overview The Coalition for Network ...
and the Digital Library Federation provided funding to establish an Open Archives Initiative (OAI) secretariat managed by Herbert Van de Sompel and Carl Lagoze. The OAI held a meeting at
Cornell University Cornell University is a private statutory land-grant research university based in Ithaca, New York. It is a member of the Ivy League. Founded in 1865 by Ezra Cornell and Andrew Dickson White, Cornell was founded with the intention to tea ...
(
Ithaca, New York Ithaca is a city in the Finger Lakes region of New York, United States. Situated on the southern shore of Cayuga Lake, Ithaca is the seat of Tompkins County and the largest community in the Ithaca metropolitan statistical area. It is named ...
) in September 2000 aimed to improve the interface developed at the Santa Fe Convention. The specifications were refined over e-mail. OAI-PMH version 1.0 was introduced to the public in January 2001 at a workshop in Washington D.C., and another in February in
Berlin, Germany Berlin ( , ) is the capital and List of cities in Germany by population, largest city of Germany by both area and population. Its 3.7 million inhabitants make it the European Union's List of cities in the European Union by population within ci ...
. Subsequent modifications to the
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
standard by the W3C required making minor modifications to OAI-PMH resulting in version 1.1. The current version, 2.0, was released in June 2002. It contained several technical changes and enhancements and is not backward compatible. From 2001
CERN The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in a northwestern suburb of Gen ...
, and later in collaboration with
University of Geneva The University of Geneva (French: ''Université de Genève'') is a public research university located in Geneva, Switzerland. It was founded in 1559 by John Calvin as a theological seminary. It remained focused on theology until the 17th centur ...
, has organized bi-annual OAI workshops, which over time have developed to cover most aspects of open science.


Uses

Some commercial
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s use OAI-PMH to acquire more resources.
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
initially included support for OAI-PMH when launching sitemaps, however decided to support only the standard XML
Sitemaps The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about ea ...
format in May 2008. In 2004,
Yahoo! Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo Inc., which is 90% owned by investment funds managed by Apollo Global Mana ...
acquired content from OAIster (
University of Michigan , mottoeng = "Arts, Knowledge, Truth" , former_names = Catholepistemiad, or University of Michigania (1817–1821) , budget = $10.3 billion (2021) , endowment = $17 billion (2021)As o ...
) that was obtained through metadata harvesting with OAI-PMH.
Wikimedia The Wikimedia Foundation, Inc., or Wikimedia for short and abbreviated as WMF, is an American 501(c)(3) nonprofit organization headquartered in San Francisco, California and registered as a charitable foundation under local laws. Best know ...
uses an OAI-PMH repository to provide feeds of
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
and related site updates for search engines and other bulk analysis/republishing endeavors. Especially when dealing with thousands of files being harvested every day, OAI-PMH can help in reducing the network traffic and other resource usage by doing incremental harvesting. NASA's Mercury metadata search system uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) every day. The mod_oai project is using OAI-PMH to expose content to web crawlers that is accessible from Apache Web servers. OAI-PMH has later been applied to sharing of scientific data.


Software

OAI-PMH is based on a client–server architecture, in which "harvesters" request information on updated records from "repositories". Requests for data can be based on a datestamp range, and can be restricted to named sets defined by the provider. Data providers are required to provide
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
metadata in
Dublin Core 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin Core has ...
format, and may also provide it in other XML formats. A number of software systems support the OAI-PMH, including
Fedora A fedora () is a hat with a soft brim and indented crown.Kilgour, Ruth Edwards (1958). ''A Pageant of Hats Ancient and Modern''. R. M. McBride Company. It is typically creased lengthwise down the crown and "pinched" near the front on both side ...
,
EThOS Ethos ( or ) is a Greek word meaning "character" that is used to describe the guiding beliefs or ideals that characterize a community, nation, or ideology; and the balance between caution, and passion. The Greeks also used this word to refer to ...
from the
British Library The British Library is the national library of the United Kingdom and is one of the largest libraries in the world. It is estimated to contain between 170 and 200 million items from many countries. As a legal deposit library, the Briti ...
, GNU EPrints from the
University of Southampton , mottoeng = The Heights Yield to Endeavour , type = Public research university , established = 1862 – Hartley Institution1902 – Hartley University College1913 – Southampton University Coll ...
, Open Journal Systems from the
Public Knowledge Project The Public Knowledge Project (PKP) is a non-profit research initiative that is focused on the importance of making the results of publicly funded research freely available through open access policies, and on developing strategies for making th ...
,
Desire2Learn D2L (or Desire2Learn) is a Canada-based global software company with offices in Australia, Brazil, Europe, Singapore, and the United States. D2L is the developer of the Brightspace learning management system, a cloud-based software s ...
,
DSpace DSpace is an open source repository software package typically used for creating open access repositories for scholarly and/or published digital content. While DSpace shares some feature overlap with content management systems and document manag ...
from MIT, HyperJournal from the
University of Pisa The University of Pisa ( it, Università di Pisa, UniPi), officially founded in 1343, is one of the oldest universities in Europe. History The Origins The University of Pisa was officially founded in 1343, although various scholars place ...
, Digibib from Digibis, MyCoRe, Koha, Primo, DigiTool, Rosetta and MetaLib from Ex Libris, ArchivalWare fro
PTFS
DOOR from the eLab in Lugano, Switzerland, panFMP from the
PANGAEA (data library) PANGAEA - Data Publisher for Earth & Environmental Science is a digital data library and a data publisher for earth system science. Data can be georeferenced in time (date/time or geological age) and space (latitude, longitude, depth/height). S ...
,
SimpleDL SimpleDL is digital collection management software that allows for the upload, description, management and access of digital collections. In addition to that, it is UTF-8 compatible. SimpleDL is not limited by format and is capable of handling doc ...
from Roaring Development, and jOAI from the National Center for Atmospheric Research.


Archives

A number of large archives support the protocol including
arXiv arXiv (pronounced "archive"—the X represents the Greek letter chi ⟨χ⟩) is an open-access repository of electronic preprints and postprints (known as e-prints) approved for posting after moderation, but not peer review. It consists of ...
and the
CERN The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in a northwestern suburb of Gen ...
Document Server.


See also

* Data format management * Digital curation *
Digital preservation In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and ...
*
File format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file format ...
*
Dublin Core 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin Core has ...
, an ISO metadata standard * National Digital Information Infrastructure and Preservation Program (NDIIPP) * National Digital Library Program (NDLP) * Metadata Encoding and Transmission Standard (METS) maintained by the Library of Congress * Preservation Metadata: Implementation Strategies (PREMIS) *
LOCKSS The LOCKSS ("Lots of Copies Keep Stuff Safe") project, under the auspices of Stanford University, is a peer-to-peer network that develops and supports an open source system allowing libraries to collect, preserve and provide their readers with acc ...
*
Search as a service Search as a service is a branch of software as a service (SaaS), focussed on enterprise search or site-specific web search. The need for search Searching is an important part of any business database function, either through internal databases, ...
*
Web archiving Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated captu ...
* Object Reuse and Exchange (OAI-ORE)


References


External links


Suleyman Demirel University Open Archives Harvester




* ttp://www.digitalpreservation.gov/ Library of Congress, National Digital Information Infrastructure and Preservation Program
Library of Congress, Web Capture
{{open access navbox Online archives Internet protocols Metadata Open access projects Archival science de:OAI-PMH