The Australian Web Archive (AWA) is an publicly available online database of archived Australian websites, hosted by the

National Library of Australia The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the ''National Library Act 1960'' for "mainta ...

(NLA) on its

Trove Trove is an Australian online library database owned by the National Library of Australia in which it holds partnerships with source providers National and State Libraries Australia, an aggregator and service which includes full text documen ...

platform, an online library database aggregator. It comprises the NLA's own PANDORA archive, the

Australian Government Web Archive The Australian Web Archive (AWA) is an publicly available online database of archived Australian websites, hosted by the National Library of Australia (NLA) on its Trove platform, an online library database aggregator. It comprises the NLA's own ...

(AGWA) and the

's ".au"

domain Domain may refer to: Mathematics *Domain of a function, the set of input values for which the (total) function is defined ** Domain of definition of a partial function **Natural domain of a partial function **Domain of holomorphy of a function *Do ...

collections. Access is through a single interface in Trove, which is publicly available. The Australian Web Archive was created in March 2019, and is one of the biggest web archives in the world. Its purpose is to provide a resource for historians and researchers, now and into the future.

History of the three components

The PANDORA service started archiving websites in October 1996. In 2005, the NLA started archiving annual snapshots of the entire Australian web domain (

URL A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...

s with the suffix. ".au"), collected via large crawl harvests. Later, the earliest websites from the .au web domain, dating back to 1996, were obtained from the

Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music ...

. In 2019 this content was first made publicly accessible through Trove. The PANDORA infrastructure, which works well for a selective small scale archiving, does not adapt to large scale "bulk harvesting" of web content, so a new technical system had to be developed whereby a web archiving service which would integrate the delivery of archived websites within a live website interface delivering the archived websites seamlessly to the user, which is difficult to achieve technically.

AGWA

Australian Government The Australian Government, also known as the Commonwealth Government, is the national government of Australia, a federal parliamentary constitutional monarchy. Like other Westminster-style systems of government, the Australian Government ...

websites are Commonwealth records, and are therefore publications to be managed in accordance with the ''Archives Act 1983''. The Australian Government Web Archive (AGWA) consists of bulk archiving of Commonwealth Government websites. The NLA began regular harvests of the websites in June 2011, after a significant obstacle had been overcome with an administrative agreement made in May 2010 allowing the NLA to collect, preserve and make accessible government websites without having to seek prior permission for each website or document, as was the case before that. The service uses the Heritrix web crawler for harvesting, WARC files for storage and Open Wayback for delivery of the service. There is a huge amount of publishing by the government, but many challenges to overcome trying to preserve content, such as its sudden disappearance. In March 2014, the AGWA was made publicly accessible. The AGWA meets the preservation and retention requirements for websites as "retain as national archives" (RNA) material under the ''Archives Act''; however

video Video is an Electronics, electronic medium for the recording, copying, playback, broadcasting, and display of moving picture, moving image, visual Media (communication), media. Video was first developed for mechanical television systems, whi ...

s and document files ( such as PDFs or Word documents) are not always captured, so must be managed separately. As of early 2015, the AGWA included content dating from 2005, which amounted to about 144 million files occupying 15 terabytes. It only included Commonwealth Government websites collected through bulk harvests of nearly 1000 seed URLs. The scheduling of the harvests was not yet routinely established, but harvests were being conducted roughly three times per year.

Amalgamation

In 2017, the AGWA and the PANDORA archive were amalgamated with the other web archive collections, to form the Trove web archive collection. After further development and the creation of the Australia Web Archive, government websites archived via AGWA and now included in AWA can still be searched separately using the "Advanced Search" option.

Description of AWA

A web archive is described by the NLA as a "collection of snapshots of websites captured while they are accessible on the web, and then preserved in a static copy". The collection archived in the AWA is "relevant to the cultural, social, political, research and commercial life and activities of Australia and Australians". It collects web material via both scheduled archiving of selected websites and publications as well as some

ad hoc Ad hoc is a Latin phrase meaning literally 'to this'. In English, it typically signifies a solution for a specific purpose, problem, or task rather than a generalized solution adaptable to collateral instances. (Compare with '' a priori''.) C ...

harvesting relating to significant events. As of March 2019, when it began, AWA already contained around 600 terabytes of data, with 9 billion records. It contains more functionality than the

Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and s ...

, hosted by the

, allowing

full-text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original tex ...

ing using a

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

built in-house. The developers also devised techniques to filter out unwanted "noise". The data remains on the Library servers, although a move to the

cloud In meteorology, a cloud is an aerosol consisting of a visible mass of miniature liquid droplets, frozen crystals, or other particles suspended in the atmosphere of a planetary body or similar space. Water or various other chemicals may ...

is envisaged in the future, as content grows. Usability by a wide range of users, and in particular the search functionality, were major focuses during development. The archive is fully searchable, based on a combination of techniques used by the developers. Each team created a unique and complex

search algorithm In computer science, a search algorithm is an algorithm designed to solve a search problem. Search algorithms work to retrieve information stored within particular data structure, or calculated in the Feasible region, search space of a problem do ...

, by adapting a version of

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

’s page ranking algorithm (based frequency of clicks on a page), modified to lead to better, high-quality resources. Other technologies include a Bayesian filter (effectively a

spam filter Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly app ...

), a Not Safe For Work classifier from

Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...

, and

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

. There is a "Limit to the gov.au web domain" option before searching, and government websites archived via AGWA can still be searched separately using the "Advanced Search" option. Other options in Advanced Search are to limit by timespan of the snapshots, domain and file type. With many of the earlier websites from the 1990s now lost, mainly because of the frequent change of web platforms, the Australian Web Archive is a significant initiative that will help to save current and future web pages, especially Australian content. Material will continue to be added to the Archive, and other online material collected in accordance with the ''National Library Act 1960'', the

legal deposit Legal deposit is a legal requirement that a person or group submit copies of their publications to a repository, usually a library. The number of copies required varies from country to country. Typically, the national library is the primary reposit ...

provisions of the '' Copyright Act 1968'' and the NLA's digital collections selection policy.

Asia/Pacific websites

Websites in the Asia Pacific region are not included in the AWA, but NLA partners with the

to collect and preserve "selected Asia/Pacific websites related to specific events or socio-political groups".

References

External links

* {{Authority control Archives in Australia Web archiving initiatives Australian digital libraries Online databases