Open Archives Initiative Protocol for Metadata Harvesting
   HOME

TheInfoList



OR:

The
Open Archives Initiative The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives to ...
Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An
implementation Implementation is the realization of an application, or execution of a plan, idea, model, design, specification, standard, algorithm, or policy. Industry-specific definitions Computer science In computer science, an implementation is a real ...
of OAI-PMH must support representing metadata in
Dublin Core 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin Core has ...
, but may also support additional representations. The protocol is usually just referred to as the OAI Protocol. OAI-PMH uses
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
over
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide We ...
. Version 2.0 of the protocol was released in 2002; the document was last updated in 2015. It has a
Creative Commons license A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
BY-SA.


History

In the late 1990s, Herbert Van de Sompel (
Ghent University Ghent University ( nl, Universiteit Gent, abbreviated as UGent) is a public research university located in Ghent, Belgium. Established before the state of Belgium itself, the university was founded by the Dutch King William I in 1817, when th ...
) was working with researchers and librarians at
Los Alamos National Laboratory Los Alamos National Laboratory (often shortened as Los Alamos and LANL) is one of the sixteen research and development laboratories of the United States Department of Energy (DOE), located a short distance northwest of Santa Fe, New Mexico, ...
(US) and called a meeting to address difficulties related to
interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader defi ...
issues of e-print servers and digital repositories. The meeting was held in Santa Fe, New Mexico, in October 1999. A key development from the meeting was the definition of an interface that permitted e-print servers to expose metadata for the papers it held in a structured fashion so other repositories could identify and copy papers of interest with each other. This interface/protocol was named the "Santa Fe Convention". Several workshops were held in 2000 at the ACM Digital Libraries conference, at the 1st ACM/IEEE-CS joint conference on Digital libraries and elsewhere to share the ideas from the Santa Fe Convention. It was discovered at the workshops that the problems faced by the e-print community were also shared by libraries, museums, journal publishers, and others who needed to share distributed resources. To address these needs, the
Coalition for Networked Information The Coalition for Networked Information (CNI) is an organization whose mission is to promote networked information technology as a way to further the advancement of intellectual collaboration and productivity. Overview The Coalition for Network ...
and the
Digital Library Federation The Digital Library Federation (DLF) is a program of the Council on Library and Information Resources (CLIR) that brings together a consortium of college and university libraries, public libraries, museums, and related institutions with the stated ...
provided funding to establish an
Open Archives Initiative The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives to ...
(OAI) secretariat managed by Herbert Van de Sompel and Carl Lagoze. The OAI held a meeting at
Cornell University Cornell University is a private statutory land-grant research university based in Ithaca, New York. It is a member of the Ivy League. Founded in 1865 by Ezra Cornell and Andrew Dickson White, Cornell was founded with the intention to tea ...
(
Ithaca, New York Ithaca is a city in the Finger Lakes region of New York, United States. Situated on the southern shore of Cayuga Lake, Ithaca is the seat of Tompkins County and the largest community in the Ithaca metropolitan statistical area. It is named ...
) in September 2000 aimed to improve the interface developed at the Santa Fe Convention. The specifications were refined over e-mail. OAI-PMH version 1.0 was introduced to the public in January 2001 at a workshop in
Washington D.C. ) , image_skyline = , image_caption = Clockwise from top left: the Washington Monument and Lincoln Memorial on the National Mall, United States Capitol, Logan Circle, Jefferson Memorial, White House, Adams Morgan, Na ...
, and another in February in Berlin, Germany. Subsequent modifications to the
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
standard by the W3C required making minor modifications to OAI-PMH resulting in version 1.1. The current version, 2.0, was released in June 2002. It contained several technical changes and enhancements and is not backward compatible. From 2001 CERN, and later in collaboration with
University of Geneva The University of Geneva (French: ''Université de Genève'') is a public research university located in Geneva, Switzerland. It was founded in 1559 by John Calvin as a theological seminary. It remained focused on theology until the 17th centur ...
, has organized bi-annual OAI workshops, which over time have developed to cover most aspects of open science.


Uses

Some commercial search engines use OAI-PMH to acquire more resources.
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
initially included support for OAI-PMH when launching sitemaps, however decided to support only the standard XML
Sitemaps The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about ea ...
format in May 2008. In 2004,
Yahoo! Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo Inc., which is 90% owned by investment funds managed by Apollo Global Manage ...
acquired content from
OAIster OAIster is an online combined bibliographic catalogue of open access material aggregated using OAI-PMH. It began at the University of Michigan in 2002 funded by a grant from the Andrew W. Mellon Foundation and with the purpose of establishing ...
(
University of Michigan , mottoeng = "Arts, Knowledge, Truth" , former_names = Catholepistemiad, or University of Michigania (1817–1821) , budget = $10.3 billion (2021) , endowment = $17 billion (2021)As o ...
) that was obtained through metadata harvesting with OAI-PMH.
Wikimedia The Wikimedia Foundation, Inc., or Wikimedia for short and abbreviated as WMF, is an American 501(c)(3) nonprofit organization headquartered in San Francisco, California and registered as a charitable foundation under local laws. Best know ...
uses an OAI-PMH repository to provide feeds of
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
and related site updates for search engines and other bulk analysis/republishing endeavors. Especially when dealing with thousands of files being harvested every day, OAI-PMH can help in reducing the network traffic and other resource usage by doing incremental harvesting. NASA's Mercury metadata search system uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) every day. The
mod_oai Mod, MOD or mods may refer to: Places * Modesto City–County Airport, Stanislaus County, California, US Arts, entertainment, and media Music * Mods (band), a Norwegian rock band * M.O.D. (Method of Destruction), a band from New York City, US ...
project is using OAI-PMH to expose content to web crawlers that is accessible from Apache Web servers. OAI-PMH has later been applied to sharing of scientific data.


Software

OAI-PMH is based on a client–server architecture, in which "harvesters" request information on updated records from "repositories". Requests for data can be based on a datestamp range, and can be restricted to named sets defined by the provider. Data providers are required to provide
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
metadata in
Dublin Core 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin Core has ...
format, and may also provide it in other XML formats. A number of software systems support the OAI-PMH, including Fedora,
EThOS Ethos ( or ) is a Greek word meaning "character" that is used to describe the guiding beliefs or ideals that characterize a community, nation, or ideology; and the balance between caution, and passion. The Greeks also used this word to refer to ...
from the
British Library The British Library is the national library of the United Kingdom and is one of the largest libraries in the world. It is estimated to contain between 170 and 200 million items from many countries. As a legal deposit library, the British ...
, GNU EPrints from the
University of Southampton , mottoeng = The Heights Yield to Endeavour , type = Public research university , established = 1862 – Hartley Institution1902 – Hartley University College1913 – Southampton University Coll ...
,
Open Journal Systems Open Journal Systems, also known as OJS, is a free software for the management of peer-reviewed academic journals, created by the Public Knowledge Project, and released under the GNU General Public License. History Open Journal Systems (OJS ...
from the Public Knowledge Project, Desire2Learn,
DSpace DSpace is an open source repository software package typically used for creating open access repositories for scholarly and/or published digital content. While DSpace shares some feature overlap with content management systems and document manag ...
from
MIT The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the m ...
, HyperJournal from the
University of Pisa The University of Pisa ( it, Università di Pisa, UniPi), officially founded in 1343, is one of the oldest universities in Europe. History The Origins The University of Pisa was officially founded in 1343, although various scholars place ...
, Digibib from Digibis,
MyCoRe MyCoRe (portmanteau of My Content Repository) is an open source repository software framework for building disciplinary or institutional repositories, digital archives, digital libraries, and scientific journals. The software is developed at var ...
, Koha, Primo, DigiTool, Rosetta and MetaLib from Ex Libris, ArchivalWare fro
PTFS
DOOR from the eLab in Lugano, Switzerland, panFMP from the PANGAEA (data library), SimpleDL from Roaring Development, and jOAI from the
National Center for Atmospheric Research The US National Center for Atmospheric Research (NCAR ) is a US federally funded research and development center (FFRDC) managed by the nonprofit University Corporation for Atmospheric Research (UCAR) and funded by the National Science Foundatio ...
.


Archives

A number of large archives support the protocol including
arXiv arXiv (pronounced "archive"—the X represents the Greek letter chi ⟨χ⟩) is an open-access repository of electronic preprints and postprints (known as e-prints) approved for posting after moderation, but not peer review. It consists of ...
and the CERN Document Server.


See also

*
Data format management Data format management (DFM) is the application of a systematic approach to the selection and use of the data formats used to encode information for storage on a computer. In practical terms, data format management is the analysis of data formats ...
*
Digital curation Digital curation is the selection, preservation, maintenance, collection and archiving of digital assets. Digital curation establishes, maintains and adds value to repositories of digital data for present and future use. This is often accomplished ...
*
Digital preservation In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and ...
* File format *
Dublin Core 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set (DCMES), is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin Core has ...
, an ISO metadata standard *
National Digital Information Infrastructure and Preservation Program The National Digital Information Infrastructure and Preservation Program (NDIIPP) of the United States was an archival program led by the Library of Congress to archive and provide access to digital resources. The program convened several working ...
(NDIIPP) *
National Digital Library Program The Library of Congress National Digital Library Program (NDLP) is assembling a digital library of reproductions of primary source materials to support the study of the history and culture of the United States. Begun in 1995 after a five-year p ...
(NDLP) *
Metadata Encoding and Transmission Standard The Metadata Encoding and Transmission Standard (METS) is a metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wid ...
(METS) maintained by the Library of Congress * Preservation Metadata: Implementation Strategies (PREMIS) *
LOCKSS The LOCKSS ("Lots of Copies Keep Stuff Safe") project, under the auspices of Stanford University, is a peer-to-peer network that develops and supports an open source system allowing libraries to collect, preserve and provide their readers with acc ...
*
Search as a service Search as a service is a branch of software as a service (SaaS), focussed on enterprise search or site-specific web search. The need for search Searching is an important part of any business database function, either through internal databases, ...
*
Web archiving Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated captur ...
*
Object Reuse and Exchange The Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards for the description and exchange of aggregations of web resources. The OAI-ORE specification implements the ORE Model which introduces the resource map (ReM) that mak ...
(OAI-ORE)


References


External links


Suleyman Demirel University Open Archives Harvester




* ttp://www.digitalpreservation.gov/ Library of Congress, National Digital Information Infrastructure and Preservation Program
Library of Congress, Web Capture
{{open access navbox Online archives Internet protocols Metadata Open access projects Archival science de:OAI-PMH