Webarchiv

Webarchiv
Type of site	Digital library
Available in	Czech, English
Founded	2000;23 years ago
Headquarters	Prague, Czech Republic
Parent	National Library of the Czech Republic
URL	Webarchiv.cz
Launched	2001

Last updated November 01, 2023

Webarchiv is a digital archive of important Czech web resources, (i.e. published on the Internet) which are collected with the aim of their long-term preservation.

Types of harvests

The main aim of the Webarchiv project is to implement a comprehensive solution in the field of archiving of the national web, i.e. bohemical online-born documents. That includes tools and methods for collecting, archiving and preserving web resources as well as providing long-term access to them. Both large-scale automated harvesting of the entire national web and selective archiving are being carried out, including thematic „event-based“ collections. At present these methods are tested and are a subject of further research. To run all operations in a routine way, two conditions must be met: long-term funding has to be provided and the current legal issues have to be solved (primarily the legal deposit legislation).^[2]

Webarchiv have two collections of archived websites. One is available via online access; it's a limited dataset whose content is covered by agreements with its original publishers. Second collection can only be accessed in the Library. According to Czech copyright law online access to archived websites is based on agreement with website owner or on Creative Commons licence. Website without this agreement are blocked from the online archive and they are accessible only from the library terminals.^[3]

Comprehensive harvests

The main focus of comprehensive crawls is to automatically harvest the biggest number of Czech web resources. The list of URLs is from organisation CZ.NIC.

Selective harvests

Collection of resources with historical, scientific or cultural value manually selected. Collection is accessible online due to contracts with publishers.

The main focus of comprehensive crawls is to automatically harvest the biggest number of Czech web resources. The requirements of comprehensive crawls are:

Domain – Czech domain (.cz) web resources are collected. Resources with other domains can be also harvested, but they have to meet the optional requirements:

Other requirements are optional:^[4]

Format – harvesting different formats of resources depends on a technical settings of the harvester^[4]

Access – only freely accessible resources are harvested^[4]

Number of files – maximum 5000 files from one domain^[4]

Topic harvests

Topic collections are collections of resources which are related to certain event of topic, for example elections.

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

<span class="mw-page-title-main">World Wide Web</span> Linked hypertext system on the Internet

The World Wide Web (WWW), commonly known as the Web, is an information system that enables content sharing over the Internet through user-friendly ways meant to appeal to users beyond IT specialists and hobbyists. It allows documents and other web resources to be accessed over the Internet according to specific rules of the Hypertext Transfer Protocol (HTTP).

In computer technology and telecommunications, online indicates a state of connectivity and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but could refer to any piece of equipment or functional unit that is connected to a larger system. Being online means that the equipment or subsystem is connected, or that it is ready for use.

The Internet Archive is an American digital library founded on May 10, 1996, and chaired by free information advocate Brewster Kahle. It provides free access to collections of digitized materials including websites, software applications, music, audiovisual and print materials. The Archive also advocates for a free and open Internet. As of January 1, 2023, the Internet Archive holds more than 38 million print materials, 11.6 million pieces of audiovisual content, 2.6 million software programs, 15 million audio files, 4.7 million images, 251,000 concerts, and over 832 billion web pages in its Wayback Machine. Their mission is to provide "universal access to all knowledge."

Internet research is the practice of using Internet information, especially free information on the World Wide Web, or Internet-based resources in research.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

PANDORA, or Pandora, is a national web archive for the preservation of Australia's online publications. Established by the National Library of Australia in 1996, it has been built in collaboration with Australian state libraries and cultural collecting organisations, including the Australian Institute of Aboriginal and Torres Strait Islander Studies, the Australian War Memorial, and the National Film and Sound Archive. It is now one of three components of the Australian Web Archive.

In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies, and it combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time. The Association for Library Collections and Technical Services Preservation and Reformatting Section of the American Library Association, defined digital preservation as combination of "policies, strategies and actions that ensure access to digital content over time." According to the Harrod's Librarian Glossary, digital preservation is the method of keeping digital material alive so that they remain usable as technological advances render original hardware and software specification obsolete.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the Wayback Machine, which strives to maintain an archive of the entire Web.

WebCite is an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted from it. The preservation service enabled verifiability of claims supported by the cited sources even when the original web pages are being revised, removed, or disappear for other reasons, an effect known as link rot.

A digital library, also called an online library, an internet library, a digital repository, a library without walls, or a digital collection, is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.

PADICAT acronym for Patrimoni Digital de Catalunya, in Catalan; or Digital Heritage of Catalonia, in English, is the Web Archive of Catalonia.

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" to see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

The International Internet Preservation Consortium is an international organization of libraries and other organizations established to coordinate efforts to preserve internet content for the future. It was founded in July 2003 by 12 participating institutions, and had grown to 35 members by January 2010. As of January 2022, there are 52 members.

The Australian Web Archive (AWA) is an publicly available online database of archived Australian websites, hosted by the National Library of Australia (NLA) on its Trove platform, an online library database aggregator. It comprises the NLA's own PANDORA archive, the Australian Government Web Archive (AGWA) and the National Library of Australia's ".au" domain collections. Access is through a single interface in Trove, which is publicly available. The Australian Web Archive was created in March 2019, and is one of the biggest web archives in the world. Its purpose is to provide a resource for historians and researchers, now and into the future.

The End of Term Web Archive preserves U.S. federal government websites during administration changes.

References

↑ "Overview of the WebArchiv project". WebArchiv. Retrieved 18 March 2014.
↑ "About Webarchiv | Webarchiv.cz".
↑ "Frequently Asked Questions | Webarchiv.cz".
1 2 3 4 "Comprehensive Harvests" . Retrieved 2023-10-31.

External links

Webarchiv homepage (Czech, English language option available)
Archiving the Czech Web: Issues and Challenges. Petr Žabička, 2003

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Overview of the WebArchiv project". WebArchiv. Retrieved 18 March 2014.

[2] "About Webarchiv | Webarchiv.cz".

[3] "Frequently Asked Questions | Webarchiv.cz".

[comp_hr-4] 1 2 3 4 "Comprehensive Harvests" . Retrieved 2023-10-31.

[1]

[2]

[3]

[4]