End of Term Web Archive

Last updated
End of Term Web Archive (EOTArchive)
North America Geological Tapestry.gif
A version of this USGS map was archived by project partner UNT in the 2008 End of Term collection.
Mission statement "The End of Term Web Archive captures and saves U.S. Government websites at the end of presidential administrations."
Commercial?No
Type of projectCollaborative government web archive
Established2008
Website eotarchive.org

The End of Term Web Archive is an archival project that preserves U.S. federal government websites during administration changes. [1]

Contents

Background

The End of Term Web Archive was set up following a 2008 announcement from National Archives and Records Administration (NARA) that they would not be archiving government websites during transition, after carrying out such crawls in 2000 and 2004. [2] The 2004 federal web harvest can be accessed alongside congressional web harvests, beginning with the 109th United States Congress, at National Archives. [3]

The first project partners were the Library of Congress, George Washington University, Stanford University, University of North Texas, the US Government Publishing Office, California Digital Library and the Internet Archive, all members of the International Internet Preservation Consortium. The project was initially sketched out after a General Assembly of the IIPC in 2008. [4] NARA and the Environmental Data & Governance Initiative (EDGI) joined the 2020/21 project. [5]

The project

Custom error page used to direct whitehouse.gov visitors as the website changed in 2009. White House.gov 404 error 1-20-09.JPG
Custom error page used to direct whitehouse.gov visitors as the website changed in 2009.

The project archives websites and documents for public access and research use. [6] A UNT study into the risk to document files found that 83% of PDFs on the .gov domain in 2008 were missing four years later. [7] This is consistent with the requirement to manage websites, but their status means that changes may be of interest to the public and watchdog groups. [8] Evidence of the demand for continued access to historical web material can be found in an announcement made by the EPA in response to concerns about changes in 2017, stating that pages from the previous administration would be carefully archived. [9] These snapshot pages were clearly marked to distinguish them from contemporary content. [10]

The archive prioritizes sites administering areas regarded as likely to be updated or removed over the period of transition. [11] The public are encouraged to nominate important sites and these are combined with broad crawls of government domains to create the collection. [12] [13] Although it is extensive - the 2016 crawl preserved 11,382 sites - it stops short of being comprehensive. [14] [15] Researchers have used these collections to examine the history of climate change policy and reuse of suspended U.S. government Twitter accounts. [16] [17]

See also

Related Research Articles

robots.txt Filename used to indicate portions for web crawling

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

CiteSeerX is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.

<span class="mw-page-title-main">Alexa Internet</span> American web traffic analysis company (1996–2022)

Alexa Internet, Inc. was a web traffic analysis company based in San Francisco, California. It was founded as an independent company by Brewster Kahle and Bruce Gilliat in 1996. Alexa provided web traffic data, global rankings, and other information on over 30 million websites. It was acquired by Amazon in 1999 for $250 million in stock. Amazon discontinued the Alexa Internet service on May 1, 2022.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

The Invisible Internet Project (I2P) is an anonymous network layer that allows for censorship-resistant, peer-to-peer communication. Anonymous connections are achieved by encrypting the user's traffic, and sending it through a volunteer-run network of roughly 55,000 computers distributed around the world. Given the high number of possible paths the traffic can transit, a third party watching a full connection is unlikely. The software that implements this layer is called an "I2P router", and a computer running I2P is called an "I2P node". I2P is free and open sourced, and is published under multiple licenses.

<span class="mw-page-title-main">YaCy</span> Peer-to-peer search engine

YaCy is a free distributed search engine built on the principles of peer-to-peer (P2P) networks, created by Michael Christen in 2003. The engine is written in Java and distributed on several hundred computers, as of September 2006, so-called YaCy-peers.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

<span class="mw-page-title-main">Open Library</span> Online project for book data of the Internet Archive

Open Library is an online project intended to create "one web page for every book ever published". Created by Aaron Swartz, Brewster Kahle, Alexis Rossi, Anand Chitipothu, and Rebecca Hargrave Malamud, Open Library is a project of the Internet Archive, a nonprofit organization. It has been funded in part by grants from the California State Library and the Kahle/Austin Foundation. Open Library provides online digital copies in multiple formats, created from images of many public domain, out-of-print, and in-print books.

<span class="mw-page-title-main">Search engine</span> Software system for finding relevant information on the Web

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

<span class="mw-page-title-main">Greenhouse gas emissions by the United States</span> Climate changing gases from the North American country

The United States produced 5.2 billion metric tons of carbon dioxide equivalent greenhouse gas (GHG) emissions in 2020, the second largest in the world after greenhouse gas emissions by China and among the countries with the highest greenhouse gas emissions per person. In 2019 China is estimated to have emitted 27% of world GHG, followed by the United States with 11%, then India with 6.6%. In total the United States has emitted a quarter of world GHG, more than any other country. Annual emissions are over 15 tons per person and, amongst the top eight emitters, is the highest country by greenhouse gas emissions per person.

TOXMAP was a geographic information system (GIS) from the United States National Library of Medicine (NLM) that was deprecated on December 16, 2019. The application used maps of the United States to help users explore data from the United States Environmental Protection Agency's (EPA) Toxics Release Inventory (TRI) and Superfund programs with visual projections and maps.

DeepPeep was a search engine that aimed to crawl and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called Deep web, World Wide Web content only available via for instance typed queries into databases. The project started at the University of Utah and was overseen by Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the National Science Foundation. It generated worldwide interest.

<span class="mw-page-title-main">Wayback Machine</span> Digital archive by the Internet Archive

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, an American nonprofit organization based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows users to go "back in time" to see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

This article contains a list with gratis satellite navigation software for a range of devices. Some of the free software mentioned here does not have detailed maps or the ability to follow streets or type in street names. However, in many cases, it is also that which makes the program free, avoid the need of an Internet connection, and make it very lightweight. Very basic programs like this may not be suitable for road navigation in cars, but serve their purpose for navigation while walking or trekking, and for use at sea. To determine the GPS coordinates of a destination, one can use sites such as GPScoordinates.eu and GPS visualizer.

<span class="mw-page-title-main">Matrix (protocol)</span> Networking protocol for real-time communication and data synchronization

Matrix is an open standard and communication protocol for real-time communication. It aims to make real-time communication work seamlessly between different service providers, in the way that standard Simple Mail Transfer Protocol email currently does for store-and-forward email service, by allowing users with accounts at one communications service provider to communicate with users of a different service provider via online chat, voice over IP, and videotelephony. It therefore serves a similar purpose to protocols like XMPP, but is not based on any existing communication protocol.

The Great Cannon of China is an Internet attack tool that is used by the Chinese government to launch distributed denial-of-service attacks on websites by performing a man-in-the-middle attack on large amounts of web traffic and injecting code which causes the end-user's web browsers to flood traffic to targeted websites. According to the researchers at the Citizen Lab, the International Computer Science Institute, and Princeton University's Center for Information Technology Policy, who coined the term, the Great Cannon hijacks foreign web traffic intended for Chinese websites and re-purposes them to flood targeted web servers with enormous amounts of traffic in an attempt to disrupt their operations. While it is co-located with the Great Firewall, the Great Cannon is "a separate offensive system, with different capabilities and design."

Pacific Forest Trust is an accredited non-profit conservation land trust that advances forest conservation and stewardship solutions. Its mission is to sustain America's forests for their public benefits of wood, water, wildlife, and people's wellbeing, in cooperation with landowners and communities.

Data rescue is a movement among scientists, researchers and others to preserve primarily government-hosted data sets, often scientific in nature, to ward off their removal from publicly available websites. While the concept of preserving federal data existed before, it gained new impetus with the election in 2016 of U.S. President Donald Trump.

<span class="mw-page-title-main">Ruffle (software)</span> Open source emulator for Adobe Flash

Ruffle is a free and open source emulator for playing Adobe Flash (SWF) animation files.

References

  1. Dwyer, Jim (2016-12-02). "Harvesting Government History, One Web Page at a Time (Published 2016)". The New York Times. ISSN   0362-4331. Archived from the original on 18 Jan 2020. Retrieved 2020-12-07.
  2. Webster, Peter (2017). Brügger, Niels (ed.). "Users, technologies, organisations: Towards a cultural history of world web archiving". Web 25. Histories from 25 Years of the World Wide Web: 179–190. doi:10.3726/b11492. hdl: 2318/1770557 . ISBN   9781433140655. Archived from the original on 2020-10-21.
  3. "National Archives". Congressional & Federal Government Web Harvests. Archived from the original on 2017-09-18. Retrieved 2021-01-18.
  4. Seneca, Tracy; Grotke, Abbie; Hartman, Cathy Nelson; Carpenter, Kris (2012). "It Takes a Village to Save the Web: The End of Term Web Archive" (PDF). DTTP: Documents to the People. 40: 16. ISSN   0091-2085. Archived from the original (PDF) on 2015-09-08.
  5. "GitHub - end-of-term/eot2020". GitHub . Archived from the original on 2020-12-05. Retrieved 2020-12-14.
  6. "End of Term Web Archive: U.S. Government Websites". 2020-12-06. Archived from the original on 2020-12-06. Retrieved 2020-12-15.
  7. Gilmore, Courtney (4 Dec 2020). "UNT Part of Team Archiving Obama Administration Web Content". NBC 5 Dallas-Fort Worth. Archived from the original on 7 Dec 2020. Retrieved 2020-12-04.
  8. "Website Monitoring". Environmental Data and Governance Initiative. Archived from the original on 2020-12-06. Retrieved 2021-02-24.
  9. Mooney, Chris; Eilperin, Juliet. "EPA website removes climate science site from public view after two decades". Washington Post. ISSN   0190-8286. Archived from the original on 2017-04-29. Retrieved 2021-02-18.
  10. "Climate Change | US EPA". 2017-04-29. Archived from the original on 2017-04-29. Retrieved 2021-04-08.
  11. "Guerrilla Archiving". The Politics of Evidence. 2016-12-05. Archived from the original on 4 Aug 2020. Retrieved 2020-12-07.
  12. Jacobs, James R. (2020-08-10). "Nominations sought for the U.S. Federal Government Domain End of Term 2020 Web Archive". Free Government Information (FGI). Archived from the original on 4 Oct 2020. Retrieved 2020-12-07.
  13. "End of Term Archive on Twitter: "And so it begins. We have officially started crawling the websites nominated for the End of Term 2020 web archive! But don't worry, you still have time to nominate more! What are your favorite government sites? #WebArchiveWednesday #WebArchives #GovDocs"". 2020-10-07. Archived from the original on 7 Oct 2020. Retrieved 2020-11-06.
  14. O'Keefe, Ed (2015-10-08). "How many .gov sites exist? Thousands. - The Washington Post". The Washington Post . Archived from the original on 8 Oct 2015. Retrieved 2020-12-04.
  15. Young, Lauren J. "The Librarians Saving The Internet". Science Friday. Archived from the original on 9 Nov 2020. Retrieved 2020-12-04.
  16. EDGI, Toly Rinberg, Maya Anjur-Dietrich, Marcy Beck, Andrew Bergman, Justin Derry, Lindsey Dillon, Gretchen Gehrke, Rebecca Lave, Chris Sellers, Nick Shapiro, Anastasia Aizman, Dan Allan, Madelaine Britt, Raymond Cha, Janak Chadha, Morgan Currie, Sara Johns, Abby Klionsky, Stephanie Knutson, Katherine Kulik, Aaron Lemelin, Kevin Nguyen, Eric Nost, Kendra Ouellette, Lindsay Poirier, Sara Rubinow, Justin Schell, Lizz Ultee, Julia Upfal, Tyler Wedrosky, Jacob Wylie. "Changing the Digital Climate". 100days.envirodatagov.org. Archived from the original on 2018-04-04. Retrieved 2021-01-14.{{cite web}}: CS1 maint: multiple names: authors list (link)
  17. Littman, Justin (2017-11-04). "Suspended U.S. government Twitter accounts". Social Feed Manager. Archived from the original on 2017-11-07. Retrieved 2020-12-07.