Archive site

Last updated

In web archiving, an archive site is a website that stores information on webpages from the past for anyone to view.

Contents

Common techniques

Two common techniques for archiving websites are using a web crawler or soliciting user submissions:

  1. Using a web crawler : By using a web crawler (e.g., the Internet Archive) the service will not depend on an active community for its content, and thereby can build a larger database faster. However, web crawlers are only able to index and archive information the public has chosen to post to the Internet, or that is available to be crawled, as website developers and system administrators have the ability to block web crawlers from accessing [certain] web pages (using a robots.txt).
  2. User submissions: While it can be difficult to start user submission services due to potentially low rates of user submissions, this system can yield some of the best results. By crawling web pages one is only able to obtain the information the public has chosen to post online; however, potential content providers may not bother to post certain information, assuming no one would be interested in it, because they lack a proper venue in which to post it, or because of copyright concerns. [1] However, users who see someone wants their information may be more apt to submit it.

Examples

Google Groups

On 12 February 2001, Google acquired the usenet discussion group archives from Deja.com and turned it into their Google Groups service. [2] They allow users to search old discussions with Google's search technology, while still allowing users to post to the mailing lists.

Internet Archive

The Internet Archive is building a compendium of websites and digital media. Starting in 1996, the Archive has been employing a web crawler to build up their database. It is one of the best known archive sites.

NBCUniversal Archives

NBCUniversal Archives offer access to exclusive content from NBCUniversal and its subsidiaries. Their NBCUniversal Archives website provides easy viewing of past and recent news clips, and it is a prime example of a news archive. [3]

Nextpoint

Nextpoint offers an automated cloud-based, SaaS for marketing, compliance, and litigation related needs including electronic discovery.

PANDORA Archive

PANDORA (Pandora Archive), founded in 1996 by the National Library of Australia, stands for Preserving and Accessing Networked Documentary Resources of Australia, which encapsulates their mission. They provide a long-term catalog of select online publications and web sites authored by Australians or that are of an Australian topic. They employ their PANDAS (PANDORA Digital Archiving System) when building their catalog.

textfiles.com

textfiles.com is a large library of old text files maintained by Jason Scott Sadofsky. Its mission is to archive the old documents that had floated around the bulletin board systems (BBS) of his youth and to document other people's experiences on the bulletin board systems.

See also

Related Research Articles

Web crawler Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard can be used in conjunction with Sitemaps, a robot inclusion standard for websites.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

Googlebot Web crawler used by Google

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

A discussion group is a group of individuals, typically who share a similar interest, who gather either formally or informally to discuss ideas, solve problems, or make comments. Common methods of conversing including meeting in person, conducting conference calls, using text messaging, or using a website such as an Internet forum. People respond, add comments, and make posts on such forums, as well as on established mailing lists, in news groups, or in IRC channels. Other group members could choose to respond by posting text or image.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engines. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer-scientist Michael K. Bergman is credited with coining the term in 2001 as a search-indexing term.

In computing, a user agent is any software, acting on behalf of a user, which "retrieves, renders and facilitates end-user interaction with Web content". A user agent is therefore a special kind of software agent.

Google Groups is a service from Google that provides discussion groups for people sharing common interests. The Groups service also provides a gateway to Usenet newsgroups via a shared user interface.

Email harvesting or scraping is the process of obtaining lists of email addresses using various methods. Typically these are then used for bulk email or spam.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

Search engine Software system that is designed to search for information on the World Wide Web

A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs). The information may be a mix of links to web pages, images, videos, infographics, articles, research papers, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. Any internet content that can't be indexed and searched by a web search engine falls under the category of deep web.

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the Wayback Machine, which strives to maintain an archive of the entire Web.

A search engine is an information retrieval software program that discovers, crawls, transforms and stores information for retrieval and presentation in response to user queries.

Usenet Worldwide computer-based distributed discussion system

Usenet is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it was established in 1980. Users read and post messages to one or more categories, known as newsgroups. Usenet resembles a bulletin board system (BBS) in many respects and is the precursor to Internet forums that became widely used. Discussions are threaded, as with web forums and BBSs, though posts are stored on the server sequentially.

DeepPeep was a search engine that aimed to crawl and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called Deep web, World Wide Web content only available via for instance typed queries into databases. The project started at the University of Utah and was overseen by Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the National Science Foundation. It generated worldwide interest.

Wayback Machine Digital archive founded by the Internet Archive

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

The Australian Web Archive (AWA) is an publicly available online database of archived Australian websites, hosted by the National Library of Australia (NLA) on its Trove platform, an online library database aggregator. It comprises the NLA's own PANDORA archive, the Australian Government Web Archive (AGWA) and the National Library of Australia's ".au" domain collections. Access is through a single interface in Trove, which is publicly available. The Australian Web Archive was created in March 2019, and is one of the biggest web archives in the world. Its purpose is to provide a resource for historians and researchers, now and into the future.

Search engine cache

Search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.

References

  1. Jinfang Niu, University of South Florida (March–April 2012). "An Overview of Web Archiving". D-Lib Magazine. Vol. 18, no. 3/4. doi:10.1045/march2012-niu1.
  2. "Google Acquires Usenet Discussion Service and Significant Assets from Deja.com". 12 February 2001.
  3. NBCUniversal Archives