Link rot

Last updated
A rotten link usually leads to an error message 404 not found.png
A rotten link usually leads to an error message

Link rot (also called link death, link breaking, or reference rot) is the phenomenon of hyperlinks tending over time to cease to point to their originally targeted file, web page, or server due to that resource being relocated to a new address or becoming permanently unavailable. A link that no longer points to its target, often called a broken, dead, or orphaned link, is a specific form of dangling pointer.

Contents

The rate of link rot is a subject of study and research due to its significance to the internet's ability to preserve information. Estimates of that rate vary dramatically between studies. Information professionals have warned that link rot could make important archival data disappear, potentially impacting the legal system and scholarship.

Prevalence

A number of studies have examined the prevalence of link rot within the World Wide Web, in academic literature that uses URLs to cite web content, and within digital libraries.

A 2002 study suggested that link rot within digital libraries is considerably slower than on the web, finding that about 3% of the objects were no longer accessible after one year [1] (equating to a half-life of nearly 23 years).

A 2003 study found that on the Web, about one link out of every 200 broke each week, [2] suggesting a half-life of 138 weeks. This rate was largely confirmed by a 2016–2017 study of links in Yahoo! Directory (which had stopped updating in 2014 after 21 years of development) that found the half-life of the directory's links to be two years. [3]

A 2004 study showed that subsets of Web links (such as those targeting specific file types or those hosted by academic institutions) could have dramatically different half-lives. [4] The URLs selected for publication appear to have greater longevity than the average URL. A 2015 study by Weblock analyzed more than 180,000 links from references in the full-text corpora of three major open access publishers and found a half-life of about 14 years, [5] generally confirming a 2005 study that found that half of the URLs cited in D-Lib Magazine articles were active 10 years after publication. [6] Other studies have found higher rates of link rot in academic literature but typically suggest a half-life of four years or greater. [7] [8] A 2013 study in BMC Bioinformatics analyzed nearly 15,000 links in abstracts from Thomson Reuters's Web of Science citation index and found that the median lifespan of web pages was 9.3 years, and just 62% were archived. [9] A 2021 study of external links in New York Times articles published between 1996 and 2019 found a half-life of about 15 years (with significant variance among content topics) but noted that 13% of functional links no longer lead to the original content—a phenomenon called content drift. [10]

A 2013 study found that 49% of links in U.S. Supreme court opinions are dead. [11]

A 2023 study looking at United States COVID-19 dashboards found that 23% of the state dashboards available in February 2021 were no longer available at the previous URLs in April 2023. [12]

Causes

Link rot can result from several occurrences. A target web page may be removed. The server that hosts the target page could fail, be removed from service, or relocate to a new domain name. As far back as 1999, it was noted that with the amount of material that can be stored on a hard drive, "a single disk failure could be like the burning of the library at Alexandria." [13] A domain name's registration may lapse or be transferred to another party. Some causes will result in the link failing to find any target and returning an error such as HTTP 404. Other causes will cause a link to target content other than what was intended by the link's author.

Other reasons for broken links include:

Prevention and detection

Strategies for preventing link rot can focus on placing content where its likelihood of persisting is higher, authoring links that are less likely to be broken, taking steps to preserve existing links, or repairing links whose targets have been relocated or removed. [ citation needed ]

The creation of URLs that will not change with time is the fundamental method of preventing link rot. Preventive planning has been championed by Tim Berners-Lee and other web pioneers. [14]

Strategies pertaining to the authorship of links include:

Strategies pertaining to the protection of existing links include:

The detection of broken links may be done manually or automatically. Automated methods include plug-ins for content management systems as well as standalone broken-link checkers such as like Xenu's Link Sleuth. Automatic checking may not detect links that return a soft 404 or links that return a 200 OK response but point to content that has changed. [24]

See also

Further reading

Related Research Articles

<span class="mw-page-title-main">Transclusion</span> Including one data set inside another automatically

In computer science, transclusion is the inclusion of part or all of an electronic document into one or more other documents by reference via hypertext. Transclusion is usually performed when the referencing document is displayed, and is normally automatic and transparent to the end user. The result of transclusion is a single integrated document made of parts assembled dynamically from separate sources, possibly stored on different computers in disparate places.

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

<span class="mw-page-title-main">World Wide Web</span> Linked hypertext system on the Internet

The World Wide Web is an information system that enables content sharing over the Internet through user-friendly ways meant to appeal to users beyond IT specialists and hobbyists. It allows documents and other web resources to be accessed over the Internet according to specific rules of the Hypertext Transfer Protocol (HTTP).

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

<span class="mw-page-title-main">Hyperlink</span> Method of referencing visual computer data

In computing, a hyperlink, or simply a link, is a digital reference to data that the user can follow or be guided to by clicking or tapping. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks. The text that is linked from is known as anchor text. A software system that is used for viewing and creating hypertext is a hypertext system, and to create a hyperlink is to hyperlink. A user following hyperlinks is said to navigate or browse the hypertext.

In the context of the World Wide Web, deep linking is the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website, rather than the website's home page. The URL contains all the information needed to point to a particular item. Deep linking is different from mobile deep linking, which refers to directly linking to in-app content using a non-HTTP URI.

Hypermedia, an extension of the term hypertext, is a nonlinear medium of information that includes graphics, audio, video, plain text and hyperlinks. This designation contrasts with the broader term multimedia, which may include non-interactive linear presentations as well as hypermedia. It is also related to the field of electronic literature. The term was first used in a 1965 article written by Ted Nelson.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. Similarly, domain redirection or domain forwarding is when all pages in a URL domain are redirected to a different domain, as when wikipedia.com and wikipedia.net are automatically redirected to wikipedia.org.

A permalink or permanent link is a URL that is intended to remain unchanged for many years into the future, yielding a hyperlink that is less susceptible to link rot. Permalinks are often rendered simply, that is, as clean URLs, to be easier to type and remember. Most modern blogging and content-syndication software systems support such links. Sometimes URL shortening is used to create them.

<span class="mw-page-title-main">Anchor text</span> Visible, clickable text in a hyperlink

The anchor text, link label or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the a element, or <a>. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.

URL shortening is a technique on the World Wide Web in which a Uniform Resource Locator (URL) may be made substantially shorter and still direct to the required page. This is achieved by using a redirect which links to the web page that has a long URL. For example, the URL "https://example.com/assets/category_B/subcategory_C/Foo/" can be shortened to "https://example.com/Foo", and the URL "https://en.wikipedia.org/wiki/URL_shortening" can be shortened to "https://w.wiki/U". Often the redirect domain name is shorter than the original one. A friendly URL may be desired for messaging technologies that limit the number of characters in a message, for reducing the amount of typing required if the reader is copying a URL from a print source, for making it easier for a person to remember, or for the intention of a permalink. In November 2009, the shortened links of the URL shortening service Bitly were accessed 2.1 billion times.

<span class="mw-page-title-main">Search engine</span> Software system for finding relevant information on the Web

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

WebCite is an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted from it. The preservation service enabled verifiability of claims supported by the cited sources even when the original web pages are being revised, removed, or disappear for other reasons, an effect known as link rot.

<span class="mw-page-title-main">TV Links</span>

TV Links was a user contributed online video directory for television programmes, films, and music videos. In a similar style to BitTorrent trackers such as The Pirate Bay, video content was not hosted by TV Links. Instead, videos were hosted by third-party video sharing websites. The website was operated as a hobby by David Rock of Cheltenham, England.

DeepPeep was a search engine that aimed to crawl and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called Deep web, World Wide Web content only available via for instance typed queries into databases. The project started at the University of Utah and was overseen by Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the National Science Foundation. It generated worldwide interest.

<span class="mw-page-title-main">Wayback Machine</span> Digital archive by the Internet Archive

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, an American nonprofit organization based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" to see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

<span class="mw-page-title-main">PageRank</span> Algorithm used by Google Search to rank web pages

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in April 2012.

<span class="mw-page-title-main">Persistent identifier</span> Long-lasting digital name

A persistent identifier is a long-lasting reference to a document, file, web page, or other object.

References

  1. Nelson, Michael L.; Allen, B. Danette (2002). "Object Persistence and Availability in Digital Libraries". D-Lib Magazine. 8 (1). doi: 10.1045/january2002-nelson . Archived from the original on 2020-07-19. Retrieved 2019-09-24.
  2. Fetterly, Dennis; Manasse, Mark; Najork, Marc; Wiener, Janet (2003). "A large-scale study of the evolution of web pages". Proceedings of the 12th international conference on World Wide Web. Archived from the original on 9 July 2011. Retrieved 14 September 2010.
  3. van der Graaf, Hans. "The half-life of a link is two year". ZOMDir's blog. Archived from the original on 2017-10-17. Retrieved 2019-01-31.
  4. 1 2 Koehler, Wallace (2004). "A longitudinal study of web pages continued: a consideration of document persistence". Information Research. 9 (2). Archived from the original on 2017-09-11. Retrieved 2019-01-31.
  5. "All-Time Weblock Report". August 2015. Archived from the original on 4 March 2016. Retrieved 12 January 2016.
  6. 1 2 McCown, Frank; Chan, Sheffan; Nelson, Michael L.; Bollen, Johan (2005). "The Availability and Persistence of Web References in D-Lib Magazine" (PDF). Proceedings of the 5th International Web Archiving Workshop and Digital Preservation (IWAW'05). Archived from the original (PDF) on 2012-07-17. Retrieved 2005-10-12.
  7. Spinellis, Diomidis (2003). "The Decay and Failures of Web References". Communications of the ACM. 46 (1): 71–77. CiteSeerX   10.1.1.12.9599 . doi:10.1145/602421.602422. S2CID   17750450. Archived from the original on 2020-07-23. Retrieved 2007-09-29.
  8. Steve Lawrence; David M. Pennock; Gary William Flake; et al. (March 2001). "Persistence of Web References in Scientific Research". Computer . 34 (3): 26–31. CiteSeerX   10.1.1.97.9695 . doi:10.1109/2.901164. ISSN   0018-9162. Wikidata   Q21012586.
  9. Hennessey, Jason; Xijin Ge, Steven (2013). "A Cross Disciplinary Study of Link Decay and the Effectiveness of Mitigation Techniques". BMC Bioinformatics. 14 (Suppl 14): S5. doi: 10.1186/1471-2105-14-S14-S5 . PMC   3851533 . PMID   24266891.
  10. "What the ephemerality of the Web means for your hyperlinks". Columbia Journalism Review. Archived from the original on 2021-08-02. Retrieved 2021-08-02.
  11. Garber, Megan (2013-09-23). "49% of the Links Cited in Supreme Court Decisions Are Broken". The Atlantic. Retrieved 2024-01-10.
  12. 1 2 Adams, Aaron M.; Chen, Xiang; Li, Weidong; Chuanrong, Zhang (27 July 2023). "Normalizing the pandemic: exploring the cartographic issues in state government COVID-19 dashboards". Journal of Maps. 19 (5): 1–9. doi: 10.1080/17445647.2023.2235385 .
  13. McGranaghan, Matthew (1999). "The Web, Cartography and Trust". Cartographic Perspectives (32): 3–5. doi: 10.14714/CP32.624 .
  14. Berners-Lee, Tim (1998). "Cool URIs Don't Change". Archived from the original on 2000-03-02. Retrieved 2019-01-31.
  15. 1 2 Kille, Leighton Walter (8 November 2014). "The Growing Problem of Internet "Link Rot" and Best Practices for Media and Online Publishers". Journalist's Resource, Harvard Kennedy School. Archived from the original on 12 January 2015. Retrieved 16 January 2015.
  16. Sicilia, Miguel-Angel, et al. "Decentralized Persistent Identifiers: a basic model for immutable handlers Archived 2023-05-10 at the Wayback Machine ." Procedia computer science 146 (2019): 123-130.
  17. "Internet Archive: Digital Library of Free Books, Movies, Music & Wayback Machine". 2001-03-10. Archived from the original on 26 January 1997. Retrieved 7 October 2013.
  18. Eysenbach, Gunther; Trudel, Mathieu (2005). "Going, going, still there: Using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research. 7 (5): e60. doi: 10.2196/jmir.7.5.e60 . PMC   1550686 . PMID   16403724.
  19. Zittrain, Jonathan; Albert, Kendra; Lessig, Lawrence (12 June 2014). "Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations" (PDF). Legal Information Management. 14 (2): 88–99. doi:10.1017/S1472669614000255. S2CID   232390360. Archived (PDF) from the original on 1 November 2020. Retrieved 10 June 2020.
  20. "Harvard University's Berkman Center Releases Amber, a "Mutual Aid" Tool for Bloggers & Website Owners to Help Keep the Web Available | Berkman Center". cyber.law.harvard.edu. Archived from the original on 2016-02-02. Retrieved 2016-01-28.
  21. "Arweave - A community-driven ecosystem". arweave.org. Archived from the original on 2023-03-15. Retrieved 2023-03-15.
  22. Rønn-Jensen, Jesper (2007-10-05). "Software Eliminates User Errors And Linkrot". Justaddwater.dk. Archived from the original on 11 October 2007. Retrieved 5 October 2007.
  23. Mueller, John (2007-12-14). "FYI on Google Toolbar's Latest Features". Google Webmaster Central Blog. Archived from the original on 13 September 2008. Retrieved 9 July 2008.
  24. Bar-Yossef, Ziv; Broder, Andrei Z.; Kumar, Ravi; Tomkins, Andrew (2004). "Sic transit gloria telae: towards an understanding of the Web's decay". Proceedings of the 13th international conference on World Wide Web – WWW '04. pp. 328–337. CiteSeerX   10.1.1.1.9406 . doi:10.1145/988672.988716. ISBN   978-1581138443.