Archive Site
   HOME

TheInfoList



OR:

In
web archiving Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated captur ...
, an archive site is a
website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google Search, Google, Facebook, Amaz ...
that stores information on webpages from the past for anyone to view.


Common techniques

Two common techniques for archiving websites are using a
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
or soliciting user submissions: # Using a
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
: By using a web crawler (e.g., the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
) the service will not depend on an active community for its content, and thereby can build a larger database faster. However, web crawlers are only able to index and archive information the public has chosen to post to the Internet, or that is available to be crawled, as website developers and system administrators have the ability to block web crawlers from accessing ertainweb pages (using a
robots.txt The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the site they are allowed to visit. Th ...
). # User submissions: While it can be difficult to start user submission services due to potentially low rates of user submissions, this system can yield some of the best results. By crawling web pages one is only able to obtain the information the public has chosen to post online; however, potential content providers may not bother to post certain information, assuming no one would be interested in it, because they lack a proper venue in which to post it, or because of copyright concerns. However, users who see someone wants their information may be more apt to submit it.


Examples


Google Groups

On 12 February 2001,
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
acquired the
usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it was ...
discussion group archives from Deja.com and turned it into their
Google Groups Google Groups is a service from Google that provides discussion groups for people sharing common interests. The Groups service also provides a gateway to Usenet newsgroups via a shared user interface. Google Groups became operational in February ...
service. They allow users to search old discussions with Google's search technology, while still allowing users to post to the
mailing list A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is re ...
s.


Internet Archive

The
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
is building a compendium of websites and
digital media Digital media is any communication media that operate in conjunction with various encoded machine-readable data formats. Digital media can be created, viewed, distributed, modified, listened to, and preserved on a digital electronics device. ' ...
. Starting in 1996, the Archive has been employing a web crawler to build up their database. It is one of the best known archive sites.


NBCUniversal Archives

NBCUniversal Archives NBCUniversal Archives offers access to years worth of footage from NBCUniversal and its owned-and-operated stations. With headquarters in 30 Rockefeller Plaza, the Archives contain everything from rare, award-winning footage, to 3D graphics produ ...
offer access to exclusive content from
NBCUniversal NBCUniversal Media, LLC is an American multinational mass media and entertainment conglomerate corporation owned by Comcast and headquartered at 30 Rockefeller Plaza in Midtown Manhattan, New York City, United States. NBCUniversal is primari ...
and its subsidiaries. Their NBCUniversal Archives website provides easy viewing of past and recent news clips, and it is a prime example of a news archive.NBCUniversal Archives
/ref>


Nextpoint

Nextpoint offers an automated
cloud In meteorology, a cloud is an aerosol consisting of a visible mass of miniature liquid droplets, frozen crystals, or other particles suspended in the atmosphere of a planetary body or similar space. Water or various other chemicals may co ...
-based,
SaaS Software as a service (SaaS ) is a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted. SaaS is also known as "on-demand software" and Web-based/Web-hosted software. SaaS is cons ...
for marketing, compliance, and litigation related needs including electronic discovery.


PANDORA Archive

PANDORA ( Pandora Archive), founded in 1996 by the National Library of
Australia Australia, officially the Commonwealth of Australia, is a Sovereign state, sovereign country comprising the mainland of the Australia (continent), Australian continent, the island of Tasmania, and numerous List of islands of Australia, sma ...
, stands for Preserving and Accessing Networked Documentary Resources of Australia, which encapsulates their mission. They provide a long-term catalog of select online publications and web sites authored by Australians or that are of an Australian topic. They employ their PANDAS (PANDORA Digital Archiving System) when building their catalog.


textfiles.com

textfiles.com textfiles.com is a website dedicated to preserving the digital documents that contain the history of the bulletin board system (BBS) world and various subcultures, and thus providing "a glimpse into the history of writers and artists bound by ...
is a large library of old text files maintained by
Jason Scott Sadofsky Jason Scott Sadofsky (born September 13, 1970), more commonly known as Jason Scott, is an American archivist, historian of technology, filmmaker, performer, and actor. Scott has been known by the online pseudonyms Sketch, SketchCow, The Slipped ...
. Its mission is to archive the old documents that had floated around the
bulletin board systems A bulletin board system (BBS), also called computer bulletin board service (CBBS), is a computer server running software that allows users to connect to the system using a terminal program. Once logged in, the user can perform functions such a ...
(BBS) of his youth and to document other people's experiences on the bulletin board systems.


See also

*
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
* Pandora Archive *
WebCite WebCite was an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted ...
*
Web archiving Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated captur ...


References

{{DEFAULTSORT:Archive Site Data management Online archives Web archiving initiatives