Web Archiving
   HOME

TheInfoList



OR:

Web archiving is the process of collecting portions of the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...
to ensure the information is preserved in an
archive An archive is an accumulation of historical records or materials – in any medium – or the physical facility in which they are located. Archives contain primary source documents that have accumulated over the course of an individual or ...
for future researchers, historians, and the public. Web archivists typically employ
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
s for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the
Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see ...
, which strives to maintain an archive of the entire Web. The growing portion of human culture created and recorded on the web makes it inevitable that more and more libraries and archives will have to face the challenges of web archiving.
National libraries A national library is a library established by a government as a country's preeminent repository of information. Unlike public libraries, these rarely allow citizens to borrow books. Often, they include numerous rare, valuable, or significant wo ...
, national archives and various consortia of organizations are also involved in archiving culturally important Web content. Commercial web archiving software and services are also available to organizations who need to archive their own web content for corporate heritage, regulatory, or legal purposes.


History and development

While curation and organization of the web has been prevalent since the mid- to late-1990s, one of the first large-scale web archiving project was the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
, a non-profit organization created by Brewster Kahle in 1996. The Internet Archive released its own search engine for viewing archived web content, the
Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see ...
, in 2001. As of 2018, the Internet Archive was home to 40 petabytes of data. The Internet Archive also developed many of its own tools for collecting and storing its data, including
PetaBox PetaBox is a storage unit from Capricorn Technologies. It was designed by the staff of the Internet Archive and C. R. Saikley to store and process one petabyte (a million gigabytes) of information. Specifications * Density: 1.4 PetaBytes/rack ...
for storing the large amounts of data efficiently and safely, and
Heritrix Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line too ...
, a web crawler developed in conjunction with the Nordic national libraries. Other projects launched around the same time included a web archiving project by the
National Library of Canada Library and Archives Canada (LAC; french: Bibliothèque et Archives Canada) is the federal institution, tasked with acquiring, preserving, and providing accessibility to the documentary heritage of Canada. The national archive and library is t ...
, Australia's Pandora, Tasmanian web archives and Sweden's Kulturarw3. From 2001 the International Web Archiving Workshop (IWAW) provided a platform to share experiences and exchange ideas. The International Internet Preservation Consortium (IIPC), established in 2003, has facilitated international collaboration in developing standards and open source tools for the creation of web archives. The now-defunct
Internet Memory Foundation The Internet Memory Foundation (formerly the European Archive Foundation) was a non-profitable foundation whose purpose was archiving content of the World Wide Web. It supported projects and research that included the preservation and protection ...
was founded in 2004 and founded by the
European Commission The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body o ...
in order to archive the web in Europe. This project developed and released many open source tools, such as "rich media capturing, temporal coherence analysis, spam assessment, and terminology evolution detection." The data from the foundation is now housed by the Internet Archive, but not currently publicly accessible. Despite the fact that there is no centralized responsibility for its preservation, web content is rapidly becoming the official record. For example, in 2017, the United States Department of Justice affirmed that the government treats the President’s tweets as official statements.


Collecting the web

Web archivists generally archive various types of web content including
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
web pages, style sheets,
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
,
images An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...
, and
video Video is an electronic medium for the recording, copying, playback, broadcasting, and display of moving visual media. Video was first developed for mechanical television systems, which were quickly replaced by cathode-ray tube (CRT) syste ...
. They also archive
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
about the collected resources such as access time,
MIME type A media type (also known as a MIME type) is a two-part identifier for file formats and format contents transmitted on the Internet. The Internet Assigned Numbers Authority, Internet Assigned Numbers Authority (IANA) is the official authority for t ...
, and content length. This metadata is useful in establishing
authenticity Authenticity or authentic may refer to: * Authentication, the act of confirming the truth of an attribute Arts and entertainment * Authenticity in art, ways in which a work of art or an artistic performance may be considered authentic Music * A ...
and
provenance Provenance (from the French ''provenir'', 'to come from/forth') is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art but is now used in similar senses i ...
of the archived collection.


Methods of collection


Remote harvesting

The most common web archiving technique uses
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
s to automate the process of collecting web pages. Web crawlers typically access web pages in the same manner that users with a browser see the Web, and therefore provide a comparatively simple method of remote harvesting web content. Examples of web crawlers used for web archiving include: *
Heritrix Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line too ...
*
HTTrack HTTrack is a free and open-source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3. HTTrack allows users to download World Wide Web sites from the Internet to a local computer. ...
*
Wget GNU Wget (or just Wget, formerly Geturl, also written as its package name, wget) is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from "World Wide Web" and " ''get''." It supports dow ...
There exist various free services which may be used to archive web resources "on-demand", using web crawling techniques. These services include the
Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see ...
and
WebCite WebCite was an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted ...
.


Database archiving

Database archiving refers to methods for archiving the underlying content of database-driven websites. It typically requires the extraction of the
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
content into a standard
schema The word schema comes from the Greek word ('), which means ''shape'', or more generally, ''plan''. The plural is ('). In English, both ''schemas'' and ''schemata'' are used as plural forms. Schema may refer to: Science and technology * SCHEMA ...
, often using
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
. Once stored in that standard format, the archived content of multiple databases can then be made available using a single access system. This approach is exemplified by th
DeepArc
an
Xinq
tools developed by the
Bibliothèque Nationale de France The Bibliothèque nationale de France (, 'National Library of France'; BnF) is the national library of France, located in Paris on two main sites known respectively as ''Richelieu'' and ''François-Mitterrand''. It is the national repository ...
and the
National Library of Australia The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the ''National Library Act 1960'' for "mainta ...
respectively. DeepArc enables the structure of a
relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
to be mapped to an
XML schema An XML schema is a description of a type of Extensible Markup Language, XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed ...
, and the content exported into an XML document. Xinq then allows that content to be delivered online. Although the original layout and behavior of the website cannot be preserved exactly, Xinq does allow the basic querying and retrieval functionality to be replicated.


Transactional archiving

Transactional archiving is an event-driven approach, which collects the actual transactions which take place between a
web server A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initiate ...
and a
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
. It is primarily used as a means of preserving evidence of the content which was actually viewed on a particular
website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google Search, Google, Facebook, Amaz ...
, on a given date. This may be particularly important for organizations which need to comply with legal or regulatory requirements for disclosing and retaining information. A transactional archiving system typically operates by intercepting every
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...
request to, and response from, the web server, filtering each response to eliminate duplicate content, and permanently storing the responses as bitstreams.


Difficulties and limitations


Crawlers

Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling: * The robots exclusion protocol may request crawlers not access portions of a website. Some web archivists may ignore the request and crawl those portions anyway. * Large portions of a web site may be hidden in the
Deep Web The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Co ...
. For example, the results page behind a web form can lie in the Deep Web if crawlers cannot follow a link to the results page. * Crawler traps (e.g., calendars) may cause a crawler to download an infinite number of pages, so crawlers are usually configured to limit the number of dynamic pages they crawl. * Most of the archiving tools do not capture the page as it is. It is observed that ad banners and images are often missed while archiving. However, it is important to note that a native format web archive, i.e., a fully browsable web archive, with working links, media, etc., is only really possible using crawler technology. The Web is so large that crawling a significant portion of it takes a large number of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.


General limitations

Some web servers are configured to return different pages to web archiver requests than they would in response to regular browser requests. This is typically done to fool search engines into directing more user traffic to a website, and is often done to avoid accountability, or to provide enhanced content only to those browsers that can display it. Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman states that "although the Web is popularly regarded as a
public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...
resource, it is
copyright A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, education ...
ed; thus, archivists have no legal right to copy the Web". However
national libraries A national library is a library established by a government as a country's preeminent repository of information. Unlike public libraries, these rarely allow citizens to borrow books. Often, they include numerous rare, valuable, or significant wo ...
in some countries have a legal right to copy portions of the web under an extension of a
legal deposit Legal deposit is a legal requirement that a person or group submit copies of their publications to a repository, usually a library. The number of copies required varies from country to country. Typically, the national library is the primary reposit ...
. Some private non-profit web archives that are made publicly accessible like
WebCite WebCite was an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted ...
, the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
or the
Internet Memory Foundation The Internet Memory Foundation (formerly the European Archive Foundation) was a non-profitable foundation whose purpose was archiving content of the World Wide Web. It supported projects and research that included the preservation and protection ...
allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage. WebCite cites a recent lawsuit against Google's caching, which
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
won.


Laws

In 2017 the Financial Industry Regulatory Authority, Inc. (FINRA), a United States financial regulatory organization, released a notice stating all the business doing digital communications are required to keep a record. This includes website data, social media posts, and messages. Some
copyright law A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, education ...
s may inhibit Web archiving. For instance, academic archiving by Sci-Hub falls outside the bounds of contemporary copyright law. The site provides enduring access to academic works including those that do not have an
open access Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...
license and thereby contributes to the archival of scientific research which may otherwise be lost.


See also

*
Anna's Archive Anna's Archive is a free Non-profit organization, non-profit online shadow library metasearch engine providing access to a variety of book resources (also via InterPlanetary File System, IPFS), created by a team of anonymous archivists (referre ...
*
Archive site In web archiving, an archive site is a website that stores information on webpages from the past for anyone to view. Common techniques Two common techniques for archiving websites are using a web crawler or soliciting user submissions: # Using ...
*
Archive Team Archive Team is a group dedicated to digital preservation and web archiving that was co-founded by Jason Scott in 2009. Its primary focus is the copying and preservation of content housed by at-risk online services. Some of its projects include ...
*
archive.today archive.today (or archive.is) is a web archiving site, founded in 2012, that saves snapshots on demand, and has support for JavaScript-heavy sites such as Google Maps and progressive web apps such as Twitter. archive.today records two snaps ...
(formerly archive.is) *
Collective memory Collective memory refers to the shared pool of memories, knowledge and information of a social group that is significantly associated with the group's identity. The English phrase "collective memory" and the equivalent French phrase "la mémoire c ...
* Common Crawl *
Digital hoarding Digital hoarding (also known as e-hoarding, e-clutter, datahoarding, digital pack-rattery or cyberhoarding) is defined by researchers as an emerging sub-type of hoarding disorder characterized by individuals collecting excessive digital material ...
*
Digital preservation In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods an ...
*
Digital library A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital me ...
*
Google Cache Search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or ...
* List of Web archiving initiatives * Wikipedia:List of web archives on Wikipedia *
Memento Project Memento is a United States ''National Digital Information Infrastructure and Preservation Program ( NDIIPP)''–funded project aimed at making Web-archived content more readily discoverable and accessible to the public. Technical description ...
*
Minerva Initiative The Minerva Initiative is a research program sponsored by the U.S. Department of Defense (DoD) that provides grants to sustain university-based, social science studies on areas of strategic importance to U.S. national security policy. The program ...
* Mirror website *
National Digital Information Infrastructure and Preservation Program The National Digital Information Infrastructure and Preservation Program (NDIIPP) of the United States was an archival program led by the Library of Congress to archive and provide access to digital resources. The program convened several working ...
(NDIIPP) *
National Digital Library Program The Library of Congress National Digital Library Program (NDLP) is assembling a digital library of reproductions of primary source materials to support the study of the history and culture of the United States. Begun in 1995 after a five-year p ...
(NDLP) * PADICAT * PageFreezer * Pandora Archive *
UK Web Archive The UK Web Archive is a consortium of the six UK legal deposit libraries which aims to collect all UK websites at least once each year. History In 2005, the British Library, The National Archives, Wellcome Trust, National Library of Scotland, ...
*
Virtual artifact A virtual artifact (VA) is an immaterial object that exists in the human mind or in a digital environment, for example the Internet, intranet, virtual reality, cyberspace, etc. Background The term "virtual artifact" has been used in a variety of ...
*
Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see ...
*
Web crawling A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
*
WebCite WebCite was an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted ...


References


Citations


General bibliography

* * * * * * * * * *


External links


International Internet Preservation Consortium (IIPC)
€”International consortium whose mission is to acquire, preserve, and make accessible knowledge and information from the Internet for future generations
International Web Archiving Workshop (IWAW)
€”Annual workshop that focuses on web archiving

* ttps://www.loc.gov/webarchiving/ Library of Congress—Web Archiving
Web archiving bibliography
€”Lengthy list of web-archiving resources

€”Julien Masanès, Bibliothèque Nationale de France
Comparison of web archiving services
{{DEFAULTSORT:Web Archiving Collections care Computer-related introductions in 2001 Conservation and restoration of cultural heritage Digital preservation Digital Library project Museology