BagIt
   HOME

TheInfoList



OR:

BagIt is a set of hierarchical file system conventions designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" (the arbitrary content) and "tags," which are
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. The name, ''BagIt'', is inspired by the "enclose and deposit" method, sometimes referred to as "bag it and tag it." Bags are ideal for digital content normally kept as a collection of files. They are also well-suited to the export, for archival purposes, of content normally kept in database structures that receiving parties are unlikely to support. Relying on cross-platform (
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
and
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
) filesystem naming conventions, a bag's payload may include any number of directories and sub-directories (folders and sub-folders). A bag can specify payload content indirectly via a "fetch.txt" file that lists
URLs A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
for content that can be fetched over the network to complete the bag; simple parallelization (e.g. running 10 instances of Wget) can exploit this feature to transfer large bags very quickly. Benefits of bags include: * Wide adoption in digital libraries (e.g. the
Library of Congress The Library of Congress (LOC) is the research library that officially serves the United States Congress and is the ''de facto'' national library of the United States. It is the oldest federal cultural institution in the country. The library is ...
). * Easy to implement using ubiquitous and ordinary filesystem tools. * Content that originates as files need only be copied to the payload directory. * Compared to
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
wrapping, content need not be encoded (e.g. Base64), which saves time and storage space. * Received content is ready-to-go in a familiar filesystem tree. * Easy to implement fast network transfer by running ordinary transfer tools in parallel.


Specification

BagIt is currently defined in
RFC RFC may refer to: Computing * Request for Comments, a memorandum on Internet standards * Request for change, change management * Remote Function Call, in SAP computer systems * Rhye's and Fall of Civilization, a modification for Sid Meier's Civ ...
8493. It defines a simple file naming convention used by the digital curation community for packaging up arbitrary digital content, so that it can be reliably transported via both physical media (
hard disk drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
,
CD-ROM A CD-ROM (, compact disc read-only memory) is a type of read-only memory consisting of a pre-pressed optical compact disc that contains data. Computers can read—but not write or erase—CD-ROMs. Some CDs, called enhanced CDs, hold both comput ...
,
DVD The DVD (common abbreviation for Digital Video Disc or Digital Versatile Disc) is a digital optical disc data storage format. It was invented and developed in 1995 and first released on November 1, 1996, in Japan. The medium can store any kin ...
) and network transfers (
FTP The File Transfer Protocol (FTP) is a standard communication protocol used for the transfer of computer files from a server to a client on a computer network. FTP is built on a client–server model architecture using separate control and data ...
,
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...
, rsync, etc.). BagIt is also used for managing the digital preservation of content over time. Discussion about the specification and its future directions takes place on th
Digital Curation discussion list
The BagIt specification is organized around the notion of a "bag." A bag is a named file system directory that minimally contains: * a "data" directory that includes the payload, or data files that comprise the digital content being preserved. Files can also be placed in subdirectories, but empty directories are not supported. * at least one manifest file that itemizes the filenames present in the "data" directory, as well as their checksums. The particular checksum algorithm is included as part of the manifest filename. For instance, a manifest file with MD5 checksums is named "manifest-md5.txt." * a "bagit.txt" file that identifies the directory as a bag, the version of the BagIt specification that it adheres to, and the
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
used for tag files. On receipt of a bag, a piece of software can examine the manifest file to make sure that the payload files are present and that their checksums are correct. This allows for accidentally removed or corrupted files to be identified. Below is an example of a minimal bag "myfirstbag" that encloses two files of payload. The contents of the tag files are included below their filenames.
myfirstbag/
, -- data
,    \-- 27613-h
,        \-- images
,            \-- q172.png
,            \-- q172.txt
, -- manifest-md5.txt
,      49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png
,      408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt
\-- bagit.txt
      BagIt-Version: 0.97
      Tag-File-Character-Encoding: UTF-8
In this example the payload happens to consist of a
Portable Network Graphics Portable Network Graphics (PNG, officially pronounced , colloquially pronounced ) is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange F ...
image file and an
Optical Character Recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
text file. In general the identification and definition of
file format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file formats ...
s is out of the scope of the BagIt specification;
file attribute File attributes are a type of meta-data that describe and may modify how files and/or directories in a filesystem behave. Typical file attributes may, for example, indicate or specify whether a file is visible, modifiable, compressed, or encrypted. ...
s are likewise out of scope. The specification allows for several optional tag files (in addition to the manifest). Their character encoding must be identified in "bagit.txt," which itself must always be encoded in
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
. The specification defines the following optional tag files: * a "bag-info.txt" file which details metadata for the bag, using colon-separated key/value pairs (similar to HTTP headers) * a tag manifest file which lists tag files and their associated checksums (e.g. "tagmanifest-md5.txt") * a "fetch.txt" that lists URLs where payload files can be retrieved from in addition or to replace payload files in the "data" directory Until version 15, the draft also described how to serialize a bag in an archive file, such as
ZIP Zip, Zips or ZIP may refer to: Common uses * ZIP Code, USPS postal code * Zipper or zip, clothing fastener Science and technology Computing * ZIP (file format), a compressed archive file format ** zip, a command-line program from Info-ZIP * Zi ...
or
TAR Tar is a dark brown or black viscous liquid of hydrocarbons and free carbon, obtained from a wide variety of organic materials through destructive distillation. Tar can be produced from coal, wood, petroleum, or peat. "a dark brown or black bit ...
. From version 15 on, the serialization is no longer part of the specifications, not because of technical reasons but because of the scope and focus of the specification.


History

The BagIt specification emerged from a collaboration between the
Library of Congress The Library of Congress (LOC) is the research library that officially serves the United States Congress and is the ''de facto'' national library of the United States. It is the oldest federal cultural institution in the country. The library is ...
and the California Digital Library while transferring digital content created as part of the National Digital Information Infrastructure and Preservation Program. The origins of the idea date back to work done at the
University of Tsukuba is a public university, public research university located in Tsukuba, Ibaraki Prefecture, Ibaraki, Japan. It is a top 10 Designated National University, and was ranked Type A by the Japanese government as part of the Top Global University Pro ...
on the "enclose and deposit" model, for mutually depositing archived resources to enable long-term digital preservation. The practice of using manifests and checksums is fairly common practice as evidenced by their use in ZIP (file format), the
Deb (file format) deb is the format, as well as extension of the software package format for the Debian Linux distribution and its derivatives. Design Debian packages are standard Unix ar archives that include two tar archives. One archive holds the contro ...
, as well as on public FTP sites. In 2007, the California Digital Library needed to transfer several terabytes of content (largely
Web archiving Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated captur ...
data) to the
Library of Congress The Library of Congress (LOC) is the research library that officially serves the United States Congress and is the ''de facto'' national library of the United States. It is the oldest federal cultural institution in the country. The library is ...
. The BagIt specification allowed the content to be packaged up in "bags" with package metadata and a manifest that detailed file checksums, which were later verified on receipt of the bags. The specification was written up as an
IETF The Internet Engineering Task Force (IETF) is a standards organization for the Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster or requirements and a ...
draft by
John Kunze John is a common English name and surname: * John (given name) * John (surname) John may also refer to: New Testament Works * Gospel of John, a title often shortened to John * First Epistle of John, often shortened to 1 John * Second ...
in December 2008, where it has seen several revisions before being issued as an RFC. In 2009, the
Library of Congress The Library of Congress (LOC) is the research library that officially serves the United States Congress and is the ''de facto'' national library of the United States. It is the oldest federal cultural institution in the country. The library is ...
produced a video that describes the specification and the use cases around it. In 2018, version 1.0 was published as an RFC by the Internet Engineering Task Force.


See also

* Metadata Encoding and Transmission Standard (METS)


References


External links

* : the canonical BagIt specification
BagIt on GitHub
the latest working copy of the specification, with source files for publishing to IETF.
Digital Curation Google Group
where most discussion about use of the specification, and its continued development takes place.
BagIt specification from the California Digital Library
CDL has found that it helps to have local documentation about the BagIt specification for development purposes.
BagIt specification from the Library of Congress
similarly the Library of Congress has made a snapshot of the specification available. {{DEFAULTSORT:Bagit Archive formats Archival science Digital preservation