Australian Web Archive

Last updated

The Australian Web Archive (AWA) is an publicly available online database of archived Australian websites, hosted by the National Library of Australia (NLA) on its Trove platform, an online library database aggregator. It comprises the NLA's own PANDORA archive, the Australian Government Web Archive (AGWA) and the National Library of Australia's ".au" domain collections. Access is through a single interface in Trove, which is publicly available. [1] [2] [3] The Australian Web Archive was created in March 2019, [4] and is one of the biggest web archives in the world. [5] Its purpose is to provide a resource for historians and researchers, now and into the future. [5]

Contents

History of the three components

The PANDORA service started archiving websites in October 1996. [6]

In 2005, the NLA started archiving annual snapshots of the entire Australian web domain (URLs with the suffix. ".au" [4] ), [7] collected via large crawl harvests. [8] Later, the earliest websites from the .au web domain, dating back to 1996, were obtained from the Internet Archive. In 2019 this content was first made publicly accessible through Trove. [9]

The PANDORA infrastructure, which works well for a selective small scale archiving, does not adapt to large scale "bulk harvesting" of web content, so a new technical system had to be developed whereby a web archiving service which would integrate the delivery of archived websites within a live website interface delivering the archived websites seamlessly to the user, which is difficult to achieve technically. [10]

AGWA

Australian Government websites are Commonwealth records, and are therefore publications to be managed in accordance with the Archives Act 1983. [11]

The Australian Government Web Archive (AGWA) consists of bulk archiving of Commonwealth Government websites. The NLA began regular harvests of the websites in June 2011, [12] after a significant obstacle had been overcome with an administrative agreement made in May 2010 allowing the NLA to collect, preserve and make accessible government websites without having to seek prior permission for each website or document, as was the case before that. The service uses the Heritrix web crawler for harvesting, WARC files for storage and Open Wayback for delivery of the service. There is a huge amount of publishing by the government, but many challenges to overcome trying to preserve content, such as its sudden disappearance. In March 2014, the AGWA was made publicly accessible. [10]

The AGWA meets the preservation and retention requirements for websites as "retain as national archives" (RNA) material under the Archives Act; however videos and document files ( such as PDFs or Word documents) are not always captured, so must be managed separately. [11]

As of early 2015, the AGWA included content dating from 2005, which amounted to about 144 million files occupying 15 terabytes. It only included Commonwealth Government websites collected through bulk harvests of nearly 1000 seed URLs. The scheduling of the harvests was not yet routinely established, but harvests were being conducted roughly three times per year. [10]

Amalgamation

In 2017, the AGWA and the PANDORA archive were amalgamated with the other web archive collections, to form the Trove web archive collection. [9] After further development and the creation of the Australia Web Archive, government websites archived via AGWA and now included in AWA can still be searched separately using the "Advanced Search" option. [9]

Description of AWA

A web archive is described by the NLA as a "collection of snapshots of websites captured while they are accessible on the web, and then preserved in a static copy". The collection archived in the AWA is "relevant to the cultural, social, political, research and commercial life and activities of Australia and Australians". It collects web material via both scheduled archiving of selected websites and publications as well as some ad hoc harvesting relating to significant events. [9]

As of March 2019, when it began, AWA already contained around 600 terabytes of data, with 9 billion records. [5] [13] It contains more functionality than the Wayback Machine, hosted by the Internet Archive, allowing full-text searching using a search engine built in-house. The developers also devised techniques to filter out unwanted "noise". The data remains on the Library servers, although a move to the cloud is envisaged in the future, as content grows. [5] Usability by a wide range of users, and in particular the search functionality, were major focuses during development. [9]

The archive is fully searchable, based on a combination of techniques used by the developers. Each team created a unique and complex search algorithm, by adapting a version of Google’s page ranking algorithm (based frequency of clicks on a page), modified to lead to better, high-quality resources. Other technologies include a Bayesian filter (effectively a spam filter), a Not Safe For Work classifier from Yahoo, and machine learning. [14]

There is a "Limit to the gov.au web domain" option before searching, [15] and government websites archived via AGWA can still be searched separately using the "Advanced Search" option. [9] Other options in Advanced Search are to limit by timespan of the snapshots, domain and file type. [16]

With many of the earlier websites from the 1990s now lost, mainly because of the frequent change of web platforms, the Australian Web Archive is a significant initiative that will help to save current and future web pages, especially Australian content. [4] Material will continue to be added to the Archive, and other online material collected in accordance with the National Library Act 1960, the legal deposit provisions of the Copyright Act 1968 and the NLA's digital collections selection policy. [9]

Asia/Pacific websites

Websites in the Asia Pacific region are not included in the AWA, but NLA partners with the Internet Archive to collect and preserve "selected Asia/Pacific websites related to specific events or socio-political groups". [17]

See also

Related Research Articles

<span class="mw-page-title-main">Waltzing Matilda</span> Australian song

"Waltzing Matilda" is a song developed in the Australian style of poetry and folk music called a bush ballad. It has been described as the country's "unofficial national anthem".

<span class="mw-page-title-main">Internet Archive</span> American non-profit digital archive

The Internet Archive is an American digital library founded on May 10, 1996, and chaired by free information advocate Brewster Kahle. It provides free access to collections of digitized materials including websites, software applications, music, audiovisual and print materials. The Archive also advocates for a free and open Internet. As of January 1, 2024, the Internet Archive says that it holds more than 41 million print materials, 8.4 million videos, 0.89 million software programs, 14.7 million audio files, 4.4 million images, 240,000 concerts, and over 735 billion web pages in its Wayback Machine. Its mission is to provide "universal access to all knowledge".

<span class="mw-page-title-main">National Library of Australia</span> National reference library in Canberra, Australia

The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the National Library Act 1960 for "maintaining and developing a national collection of library material, including a comprehensive collection of library material relating to Australia and the Australian people", thus functioning as a national library. It is located in Parkes, Canberra, ACT.

PANDORA, or Pandora, is a national web archive for the preservation of Australia's online publications. Established by the National Library of Australia in 1996, it has been built in collaboration with Australian state libraries and cultural collecting organisations, including the Australian Institute of Aboriginal and Torres Strait Islander Studies, the Australian War Memorial, and the National Film and Sound Archive. It is now one of three components of the Australian Web Archive.

<span class="mw-page-title-main">UK Web Archive</span> Project to archive UK websites

The UK Web Archive is a consortium of the six UK legal deposit libraries which aims to collect all UK websites at least once each year.

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the Wayback Machine, which strives to maintain an archive of the entire Web.

<span class="mw-page-title-main">National Film and Sound Archive</span> Australias audiovisual archive

The National Film and Sound Archive of Australia (NFSA), known as ScreenSound Australia from 1999 to 2004, is Australia's audiovisual archive, responsible for developing, preserving, maintaining, promoting and providing access to a national collection of film, television, sound, radio, video games, new media, and related documents and artefacts. The collection ranges from works created in the late nineteenth century when the recorded sound and film industries were in their infancy, to those made in the present day.

<span class="mw-page-title-main">State Library of Tasmania</span> Library in Hobart, Tasmania, Australia

The State Library of Tasmania is the reference, special collections, research and public lending library in the Tasmanian capital of Hobart, Australia. It is part of Libraries Tasmania. Libraries Tasmania includes a state-wide network of library services, community learning, adult literacy and the State’s archives and heritage services.

Music Australia was a free national online service hosted by the National Library of Australia, launched on 14 March 2005, covering all types, styles, and genres of Australian music. It was integrated into Trove in 2012.

A digital library, also called an online library, an internet library, a digital repository, a library without walls, or a digital collection, is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.

<span class="mw-page-title-main">PADICAT</span> Web archive

PADICAT acronym for Patrimoni Digital de Catalunya, in Catalan; or Digital Heritage of Catalonia, in English, is the Web Archive of Catalonia.

<span class="mw-page-title-main">Wayback Machine</span> Digital archive founded by the Internet Archive

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" to see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

Webarchiv is a digital archive of important Czech web resources, which are collected with the aim of their long-term preservation.

<span class="mw-page-title-main">Trove</span> Australian online library database aggregator

Trove is an Australian online library database owned by the National Library of Australia in which it holds partnerships with source providers National and State Libraries Australia, an aggregator and service which includes full text documents, digital images, bibliographic and holdings data of items which are not available digitally, and a free faceted-search engine as a discovery tool.

<span class="mw-page-title-main">Internet Memory Foundation</span> Web archiving organisation

The Internet Memory Foundation As of August 2018, it was defunct. It was a non-profitable foundation whose purpose was archiving content of the World Wide Web. It supported projects and research that included the preservation and protection of digital media content in various forms to form a digital library of cultural content.

archive.today is a web archiving site, founded in 2012, that saves snapshots on demand, and has support for JavaScript-heavy sites such as Google Maps and progressive web apps such as Twitter. archive.today records two snapshots: one replicates the original webpage including any functional live links; the other is a screenshot of the page.

National edeposit (NED) is a collaboration between Australia's nine national, state and territory libraries which provides for the legal deposit, management, storage and preservation of, and access to, published electronic material across Australia. It is a website, a system and a service, the result of a project by National and State Libraries Australia, and is a world-first collaboration. The National Library of Australia (NLA), Libraries ACT, Libraries Tasmania, Northern Territory Library, State Library of New South Wales, State Library of Queensland, State Library of South Australia, State Library Victoria and the State Library of Western Australia are the member organisations, while the system is hosted and managed by the NLA.

<span class="mw-page-title-main">End of Term Web Archive</span>

The End of Term Web Archive preserves U.S. federal government websites during administration changes.

References

  1. "Preserving and Accessing Networked DOcumentary Resources of Australia". Pandora Archive. Retrieved 30 April 2020.
  2. "Archived websites". National Library of Australia. 23 March 2020. Retrieved 30 April 2020.
  3. Koerbin, Paul (11 February 2015). "The Australian Government Web Archive". National Library of Australia. Archived from the original on 30 April 2020. Retrieved 30 April 2020.
  4. 1 2 3 Bruns, Axel (14 March 2019). "The Australian Web Archive is a momentous achievement – but things will get harder from here". The Conversation. Retrieved 30 April 2020.
  5. 1 2 3 4 Nott, George (11 March 2019). "National Library launches 'enormous' archive of Australia's Internet". Computerworld. Retrieved 6 May 2020.
  6. "History and Achievements". PANDORA. 18 February 2009. Retrieved 6 May 2020.
  7. McKenzie, Amelia (12 March 2019). "Preserving Australia's Web History:The beginning of the Australian Web Archive". National Library of Australia. Retrieved 6 May 2020.
  8. "Archived websites (1996 – now)". Trove. Retrieved 6 May 2020.
  9. 1 2 3 4 5 6 7 "About the Australian Web Archive". Trove Help Centre. Archived from the original on 17 March 2020. Retrieved 8 May 2020.
  10. 1 2 3 Koerbin, Paul (11 February 2015). "The Australian Government Web Archive: Collecting the government's online documentary heritage goes large scale". National Library of Australia. Archived from the original on 1 May 2020. Retrieved 6 May 2020.
  11. 1 2 "Archiving Australian Government websites". National Archives of Australia. Retrieved 8 May 2020.
  12. "Archived websites". National Library of Australia. 7 December 2018. Retrieved 6 May 2020.
  13. NOTE: AWA help page says 400 tb, 8 billion records
  14. "Check Out Australia's Web Archive". Southern Phone. 11 April 2019. Retrieved 8 May 2020.
  15. "Australian Web Archive". Trove. Retrieved 8 May 2020.
  16. "Australian Web Archive - Advanced Search". Trove. Retrieved 8 May 2020.
  17. "Archived websites". National Library of Australia. 23 March 2020. Retrieved 8 May 2020.