WARC (file Format)
   HOME

TheInfoList



OR:

The WARC (Web ARChive) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC_IA File Format that has traditionally been used to store " web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
, abbreviated duplicate detection events (see §7.6 "revisit"), and later-date transformations. The WARC format is inspired by HTTP/1.0 streams, with a similar header and the use of CRLFs as delimiters, making it very conducive to crawler implementations. First specified in 2008, WARC is now recognised by most national library systems as the standard to follow for web archiving.


Software

* Heritrix web archiver in Java * wget (since version 1.14) * Conifer, formerly Webrecorder * StormCrawler * Apache Nutch * libarchive


See also

ZIM (file format) The ZIM file format is an open file format that stores wiki content for offline usage. Its primary focus is the contents of Wikipedia and other Wikimedia projects. The format allows for the compression of articles. ZIM file can also contain full ...


References


External links


WARC File Format specifications

The WARC File Format (ISO 28500) - Information, Maintenance, Drafts

WARC, Web ARChive file format

WARC implementation guidelines

Welcome



The WARC Ecosystem
Archive formats Web archiving Web archives {{Web-stub