WebCite

Last updated

WebCite
WebCite.svg
Available inEnglish
Owner University of Toronto [1]
Created by Gunther Eysenbach
URL www.webcitation.org
CommercialNo
Launched1997;24 years ago (1997)
Current statusOnline, read-only

WebCite is an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by making snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted from it. The preservation service enables verifiability of claims supported by the cited sources even when the original web pages are being revised, removed, or disappear for other reasons, an effect known as link rot.

Contents

Sometime between 9 and 17 July 2019, WebCite stopped accepting new archiving requests. [2] As of February 2021, it does not accept new archiving requests, but continues to serve existing archives. As of 19 August 2021, there was a maintenance announcement, and no archived content was accessible; WebCite came back online sometime between then and 8 October 2021.

Service features

All types of web content, including HTML web pages, PDF files, style sheets, JavaScript and digital images can be preserved. It also archives metadata about the collected resources such as access time, MIME type, and content length.

WebCite is a non-profit consortium supported by publishers and editors,[ who? ] and it can be used by individuals without charge.[ clarification needed ] It was one of the first services to offer on-demand archiving of pages, a feature later adopted by many other archiving services, such as archive.today and the Wayback Machine. It does not do web page crawling.

History

Conceived in 1997 by Gunther Eysenbach, WebCite was publicly described the following year when an article on Internet quality control declared that such a service could also measure the citation impact of web pages. [3] In the next year, a pilot service was set up at the address webcite.net. Although it seemed that the need for WebCite decreased when Google's short term copies of web pages began to be offered by Google Cache and the Internet Archive expanded their crawling (which started in 1996), [4] WebCite was the only one allowing "on-demand" archiving by users. WebCite also offered interfaces to scholarly journals and publishers to automate the archiving of cited links. By 2008, over 200 journals had begun routinely using WebCite. [5]

WebCite used to be, but is no longer, a member of the International Internet Preservation Consortium. [1] In a 2012 message on Twitter, Eysenbach commented that "WebCite has no funding, and IIPC charges €4000 per year in annual membership fees." [6]

WebCite "feeds its content" to other digital preservation projects, including the Internet Archive. [1] Lawrence Lessig, an American academic who writes extensively on copyright and technology, used WebCite in his amicus brief in the Supreme Court of the United States case of MGM Studios, Inc. v. Grokster, Ltd. [7]

Fundraising

WebCite ran a fund-raising campaign using FundRazr from January 2013 with a target of $22,500, a sum which its operators stated was needed to maintain and modernize the service beyond the end of 2013. [8] This includes relocating the service to Amazon EC2 cloud hosting and legal support. As of 2013 it remained undecided whether WebCite would continue as a non-profit or as a for-profit entity. [9]

Usage

WebCite allows on-demand prospective archiving. It is not crawler-based; pages are only archived if the citing author or publisher requests it. No cached copy will appear in a WebCite search unless the author or another person has specifically cached it beforehand.

To initiate the caching and archiving of a page, an author may use WebCite's "archive" menu option or use a WebCite bookmarklet that will allow web surfers to cache pages just by clicking a button in their bookmarks folder. [10]

One can retrieve or cite archived pages through a transparent format such as

http://webcitation.org/query?url=URL&date=DATE

where URL is the URL that was archived, and DATE indicates the caching date. For example,

http://webcitation.org/query?url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMain_Page&date=2008-03-04

or the alternate short form http://webcitation.org/5W56XTY5h retrieves an archived copy of the URL http://en.wikipedia.org/wiki/Main_Page that is closest to the date of March 4, 2008. The ID (5W56XTY5h) is the UNIX time in base 62.

WebCite does not work for pages which contain a no-cache tag. WebCite respects the author's request to not have their web page cached.

One can archive a page by simply navigating in their browser to a link formatted like this:

http://webcitation.org/archive?url=urltoarchive&email=youremail

Compared to Wayback Machine

https://web.archive.org/urltoarchive

replacing urltoarchive with the full URL of the page to be archived, and youremail with their e-mail address. This is how the WebCite bookmarklet works. [11]

Once archived on WebCite, users can try to create an independent second-level backup copy of the starting URL, saving a second time the new WebCite's domain URL on web.archive.org, and on archive.is. Users can more conveniently do this using a browser add-on for archiving. [12]

Business model

The term "WebCite" is a registered trademark. [13] WebCite does not charge individual users, journal editors and publishers [14] any fee to use their service. WebCite earns revenue from publishers who want to "have their publications analyzed and cited webreferences archived", [1] and accepts donations. Early support was from the University of Toronto. [1]

WebCite maintains the legal position that its archiving activities [5] are allowed by the copyright doctrines of fair use and implied license. [1] To support the fair use argument, WebCite notes that its archived copies are transformative, socially valuable for academic research, and not harmful to the market value of any copyrighted work. [1] WebCite argues that caching and archiving web pages is not considered a copyright infringement when the archiver offers the copyright owner an opportunity to "opt-out" of the archive system, thus creating an implied license. [1] To that end, WebCite will not archive in violation of Web site "do-not-cache" and "no-archive" metadata, as well as robot exclusion standards, the absence of which creates an "implied license" for web archive services to preserve the content. [1]

In a similar case involving Google's web caching activities, on January 19, 2006, the United States District Court for the District of Nevada agreed with that argument in the case of Field v. Google (CV-S-04-0413-RCJ-LRL), holding that fair use and an "implied license" meant that Google's caching of Web pages did not constitute copyright violation. [1] The "implied license" referred to general Internet standards. [1]

DMCA requests

According to their policy, after receiving legitimate DMCA requests from the copyright holders, WebCite removes saved pages from public access, as the archived pages are still under the safe harbor of being citations. The pages are removed to a "dark archive" and in cases of legal controversies or evidence requests there is pay-per-view access of "$200 (up to 5 snapshots) plus $100 for each further 10 snapshots" to the copyrighted content. [15]

See also

Related Research Articles

Hypertext Transfer Protocol Application protocol for distributed, collaborative, hypermedia information systems

The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, where hypertext documents include hyperlinks to other resources that the user can easily access, for example by a mouse click or by tapping the screen in a web browser.

World Wide Web System of interlinked hypertext documents accessed over the Internet

The World Wide Web (WWW), commonly known as the Web, is an information system where documents and other web resources are identified by Uniform Resource Locators, which may be interlinked by hyperlinks, and are accessible over the Internet. The resources of the Web are transferred via the Hypertext Transfer Protocol (HTTP), may be accessed by users by a software application called a web browser, and are published by a software application called a web server. The World Wide Web is not synonymous with the Internet, which pre-dated the Web in some form by over two decades and upon the technologies of which the Web is built.

Web browser Software for using the World Wide Web

A web browser is application software for accessing the World Wide Web. When a user requests a web page from a particular website, the web browser retrieves the necessary content from a web server and then displays the page on the user's device.

The HTTP 404, 404 not found, 404, 404 error, page not found or file not found error message is a hypertext transfer protocol (HTTP) standard response code, in computer network communications, to indicate that the browser was able to communicate with a given server, but the server could not find what was requested. The error may also be used when a server does not wish to disclose whether it has the requested information.

Proxy server Computer server that makes and receives requests on behalf of a user

In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource.

Bookmarklet

A bookmarklet is a bookmark stored in a web browser that contains JavaScript commands that add new features to the browser. Bookmarklets are JavaScripts stored as the URL of a bookmark in a web browser or as a hyperlink on a web page. Bookmarklets are usually small snippets of JavaScript executed when user clicks on them. Regardless of whether bookmarklet utilities are stored as bookmarks or hyperlinks, they add one-click functions to a browser or web page. When clicked, a bookmarklet performs one of a wide variety of operations, such as running a search query or extracting data from a table. For example, clicking on a bookmarklet after selecting text on a webpage could run an Internet search on the selected text and display a search engine results page.

In the context of the World Wide Web, deep linking is the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website, rather than the website's home page. The URL contains all the information needed to point to a particular item. Deep linking is different from mobile deep linking, which refers to directly linking to in-app content using a non-HTTP URI.

Inline linking is the use of a linked object, often an image, on one site by a web page belonging to a second site. One site is said to have an inline link to the other site where the object is located.

A Web cache is a system for optimizing the World Wide Web. It is implemented both client-side and server-side. The caching of images and other files can result in less overall delay when browsing the Web.

URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. Similarly, domain redirection or domain forwarding is when all pages in a URL domain are redirected to a different domain, as when wikipedia.com and wikipedia.net are automatically redirected to wikipedia.org.

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the Wayback Machine, which strives to maintain an archive of the entire Web.

Gunther Eysenbach Canadian hesalthcare researcher

Gunther Eysenbach is a German-Canadian researcher on healthcare, especially health policy, eHealth, and consumer health informatics.

Most Internet censorship in Thailand prior to the September 2006 military coup d'état was focused on blocking pornographic websites. The following years have seen a constant stream of sometimes violent protests, regional unrest, emergency decrees, a new cybercrimes law, and an updated Internal Security Act. Year by year Internet censorship has grown, with its focus shifting to lèse majesté, national security, and political issues. By 2010, estimates put the number of websites blocked at over 110,000. In December 2011, a dedicated government operation, the Cyber Security Operation Center, was opened. Between its opening and March 2014, the Center told ISPs to block 22,599 URLs.

Field v. Google, Inc., 412 F.Supp. 2d 1106 is a case where Google Inc. successfully defended a lawsuit for copyright infringement. Field argued that Google infringed his exclusive right to reproduce his copyrighted works when it "cached" his website and made a copy of it available on its search engine. Google raised multiple defenses: fair use, implied license, estoppel, and Digital Millennium Copyright Act safe harbor protection. The court granted Google's motion for summary judgment and denied Field's motion for summary judgment.

Wayback Machine Digital archive founded by the Internet Archive

The Wayback Machine is a digital archive of the World Wide Web. It was founded by the Internet Archive, a nonprofit library based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

Memento Project

Memento is a United States National Digital Information Infrastructure and Preservation Program (NDIIPP)–funded project aimed at making Web-archived content more readily discoverable.

Content Security Policy (CSP) is a computer security standard introduced to prevent cross-site scripting (XSS), clickjacking and other code injection attacks resulting from execution of malicious content in the trusted web page context. It is a Candidate Recommendation of the W3C working group on Web Application Security, widely supported by modern web browsers. CSP provides a standard method for website owners to declare approved origins of content that browsers should be allowed to load on that website—covered types are JavaScript, CSS, HTML frames, web workers, fonts, images, embeddable objects such as Java applets, ActiveX, audio and video files, and other HTML5 features.

archive.today Online web archive

archive.today is an archive site which stores snapshots of web pages. It retrieves one page at a time similar to WebCite, smaller than 50 MB each, but with support for JavaScript-heavy sites such as Google Maps and progressive web applications such as Twitter.

Search engine cache

Search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 "WebCite Consortium FAQ". WebCitation.org. WebCite via archive.org.
  2. "WebCite 17th July 2019". July 17, 2019. Archived from the original on July 17, 2019. Retrieved January 17, 2021.
  3. Eysenbach, Gunther; Diepgen, Thomas L. (November 28, 1998). "Towards quality management of medical information on the internet: evaluation, labelling, and filtering of information". The BMJ . 317 (7171): 1496–1502. doi:10.1136/bmj.317.7171.1496. ISSN   0959-8146. OCLC   206118688. PMC   1114339 . PMID   9831581. BL Shelfmark 2330.000000.
  4. Fixing Broken Links on the Internet , Internet Archive blog, October 25, 2013.
  5. 1 2 Eysenbach, Gunther; Trudel, Mathieu (2005). "Going, Going, Still There: Using the WebCite Service to Permanently Archive Cited Web Pages". Journal of Medical Internet Research . 7 (5): e60. doi:10.2196/jmir.7.5.e60. ISSN   1438-8871. OCLC   107198227. PMC   1550686 . PMID   16403724. - Profile at Resarchgate
  6. "Twitter post". June 11, 2012. Archived from the original on March 5, 2016. Retrieved March 10, 2013.
  7. Cohen, Norm (January 29, 2007). "Courts Turn to Wikipedia, but Selectively". The New York Times .
  8. "Fund WebCite". Wikimedia Foundation . Retrieved December 6, 2013.
  9. "Conversation between GiveWell and Webcite on 4/10/13" (PDF). GiveWell . Retrieved October 18, 2009. Dr. Eysenbach is trying to decide whether Webcite should continue as a non-profit project or a business with revenue streams built into the system.
  10. WebCite Best Practices Guide .pdf
  11. "WebCite Bookmarklet". WebCitation.org. WebCite. Retrieved May 14, 2017.
  12. "GitHub – rahiel/archiveror: Archiveror will help you preserve the webpages you love". GitHub. Retrieved December 12, 2018.
  13. "WebCite Legal and Copyright Information". WebCitation.org. WebCite. Retrieved June 16, 2009.
  14. "WebCite Member List". WebCitation.org. WebCite Consortium. Retrieved June 16, 2009. Membership is currently free
  15. "WebCite takedown requests policy". WebCitation.org. WebCite. Archived from the original on April 22, 2021. Retrieved May 14, 2017.