Search engine cache

Last updated May 15, 2024

The link for the cached version of a web page in search results from Google (top), Bing (middle) and Yandex (bottom)

A search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.^[1]

A web crawler collects the contents of a web page, which is then indexed by a web search engine. The search engine might make the copy accessible to users. Web crawlers that obey restrictions in robots.txt ^[2] or meta tags ^[3] by the site webmaster may not make a cached copy available to search engine users if instructed not to.

Search engine cache can be used for crime investigation,^[4] legal proceedings ^[5] and journalism.^[6]^[1] Examples of search engines that offer their users cached versions of web pages are Bing, Yandex Search, and Baidu.

Search engine cache may not be fully protected by the usual laws that protect technology providers from copyright infringement claims.^[7]

Google retired its web caching service in 2024. The service was designed for websites that might show up in a Google search result, but are temporarily offline. It was not designed for long or even medium term archiving purposes. Google said the Internet as of 2024 is much more reliable than it was "way back" in earlier days, and therefore its cache service is no longer an important service to maintain. Google pointed to the Wayback Machine as a better alternative, and suggested Google might work with them in the future.^[8]

Related Research Articles

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

In computing, a search engine is an information retrieval software system designed to help find information stored on one or more computer systems. Search engines discover, crawl, transform, and store information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called hits. The most widely used type of search engine is a web search engine, which searches for information on the World Wide Web.

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

Dogpile is a metasearch engine for information on the World Wide Web that fetches results from Google, Yahoo!, Yandex, Bing, and other popular search engines, including those from audio and video content providers such as Yahoo!.

<span class="mw-page-title-main">Metasearch engine</span> Online information retrieval tool

A metasearch engine is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. Sufficient data is gathered, ranked, and presented to the users.

The noindex value of an HTML robots meta tag requests that automated Internet bots avoid indexing a web page. Reasons why one might want to use this meta tag include advising robots not to index a very large database, web pages that are very transitory, web pages that are under development, web pages that one wishes to keep slightly more private, or the printer and mobile-friendly versions of pages. Since the burden of honoring a website's noindex tag lies with the author of the search robot, sometimes these tags are ignored. Also the interpretation of the noindex tag is sometimes slightly different from one search engine company to the next.

Yahoo! Search is a search engine owned and operated by Yahoo!, using Microsoft Bing to power results.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow setting allows website authors to indicate that the presence of a link is not an endorsement of the target site's importance.

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

A search engine results page (SERP) is a webpage that is displayed by a search engine in response to a query by a user. The main component of a SERP is the listing of results that are returned by the search engine in response to a keyword query.

Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.

Field v. Google, Inc., 412 F.Supp. 2d 1106 is a case where Google Inc. successfully defended a lawsuit for copyright infringement. Field argued that Google infringed his exclusive right to reproduce his copyrighted works when it "cached" his website and made a copy of it available on its search engine. Google raised multiple defenses: fair use, implied license, estoppel, and Digital Millennium Copyright Act safe harbor protection. The court granted Google's motion for summary judgment and denied Field's motion for summary judgment.

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, an American nonprofit organization based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" to see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

MetaCrawler is a search engine. It is a registered trademark of InfoSpace and was created by Erik Selberg.

References

1 2 Wilfried Ruetten (2012). The Data Journalism Handbook. O'Reilly Media, Inc. ISBN 9781449330064. When a page becomes controversial, the publishers may take it down or alter it without acknowledgment. If you suspect you're running into the problem, the first place to turn is Google's cache of the page as it was when it did its last crawl.
↑ "Robots meta tag, data-nosnippet, and X-Robots-Tag specifications". noarchive: Do not show a cached link in search results.
↑ "Special tags that Google understands - Search Console Help". noarchive - Don't show a Cached link for a page in search results.
↑ Todd G. Shipley, Art Bowker (2013). Investigating Internet Crimes: An Introduction to Solving Crimes in Cyberspace. Newnes. ISBN 9780124079298. For the investigator this can be a valuable piece of information. Depending on when Google crawled the site, the last page may contain information different from the current page. Documenting and capturing Google's cached page of a webpage can therefore be important step to ensure this time snapshot is preserved.
↑ Steven Mark Levy (2011). Regulation of Securities: SEC Answer Book. Aspen Publishers Online. ISBN 9781454805434. The World Wide Web is not as ephemeral as one might think. An increasing number of older web pages are available online through such services as the Wayback Machine, Yahoo Cache, or Bing Cache. Some plaintiffs' lawyers and corporate gadflies use these services as a matter of routine.
↑ Cleland Thom (2014-10-23). "Google's caches and .com search engine provide 'right to be forgotten' solutions". Press Gazette . Journalists can also access delisted content via the Google cache.
↑ Herman De Bauw, Valerie Vandenweghe (June 2011). "Brussels Court of Appeal upholds judgment against Google News and Google Cache". Archived from the original on 2015-04-26. For the cache function, the Court rejected the exception of a "technically necessary copy". This exception exempts temporary reproduction which is a necessary part of a technical process applied by an intermediary for transmission in a network between third parties. According to the Court, the cache copy that Google stores on its server is not technically necessary for efficient transmission.
↑ "Google Search's cache links are officially being retired". 2 February 2024.

This World Wide Web–related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[journalismhandbook-1] 1 2 Wilfried Ruetten (2012). The Data Journalism Handbook. O'Reilly Media, Inc. ISBN 9781449330064. When a page becomes controversial, the publishers may take it down or alter it without acknowledgment. If you suspect you're running into the problem, the first place to turn is Google's cache of the page as it was when it did its last crawl.

[2] "Robots meta tag, data-nosnippet, and X-Robots-Tag specifications". noarchive: Do not show a cached link in search results.

[3] "Special tags that Google understands - Search Console Help". noarchive - Don't show a Cached link for a page in search results.

[4] Todd G. Shipley, Art Bowker (2013). Investigating Internet Crimes: An Introduction to Solving Crimes in Cyberspace. Newnes. ISBN 9780124079298. For the investigator this can be a valuable piece of information. Depending on when Google crawled the site, the last page may contain information different from the current page. Documenting and capturing Google's cached page of a webpage can therefore be important step to ensure this time snapshot is preserved.

[5] Steven Mark Levy (2011). Regulation of Securities: SEC Answer Book. Aspen Publishers Online. ISBN 9781454805434. The World Wide Web is not as ephemeral as one might think. An increasing number of older web pages are available online through such services as the Wayback Machine, Yahoo Cache, or Bing Cache. Some plaintiffs' lawyers and corporate gadflies use these services as a matter of routine.

[pressgazette-6] Cleland Thom (2014-10-23). "Google's caches and .com search engine provide 'right to be forgotten' solutions". Press Gazette . Journalists can also access delisted content via the Google cache.

[eubelius-7] Herman De Bauw, Valerie Vandenweghe (June 2011). "Brussels Court of Appeal upholds judgment against Google News and Google Cache". Archived from the original on 2015-04-26. For the cache function, the Court rejected the exception of a "technically necessary copy". This exception exempts temporary reproduction which is a necessary part of a technical process applied by an intermediary for transmission in a network between third parties. According to the Court, the cache copy that Google stores on its server is not technically necessary for efficient transmission.

[8] "Google Search's cache links are officially being retired". 2 February 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]