Googlebot

Last updated
Googlebot
Original author(s) Google
Type Web crawler
Website Googlebot FAQ

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler (to simulate desktop users) and a mobile crawler (to simulate a mobile user). [1]

Contents

Behavior

A website will probably be crawled by both Googlebot Desktop and Googlebot Mobile. However starting from September 2020, all sites were switched to mobile-first indexing, meaning Google is crawling the web using a smartphone Googlebot. [2] The subtype of Googlebot can be identified by looking at the user agent string in the request. However, both crawler types obey the same product token (useent token) in robots.txt, and so a developer cannot selectively target either Googlebot mobile or Googlebot desktop using robots.txt.

Google provides various methods that enable website owners to manage the content displayed in Google's search results. If a webmaster chooses to restrict the information on their site available to a Googlebot, or another spider, they can do so with the appropriate directives in a robots.txt file, [3] or by adding the meta tag <meta name="Googlebot" content="nofollow" /> to the web page. [4] Googlebot requests to Web servers are identifiable by a user-agent string containing "Googlebot" and a host address containing "googlebot.com". [5]

Currently, Googlebot follows HREF links and SRC links. [3] There is increasing evidence Googlebot can execute JavaScript and parse content generated by Ajax calls as well. [6] There are many theories regarding how advanced Googlebot's ability is to process JavaScript, with opinions ranging from minimal ability derived from custom interpreters. [7] Currently, Googlebot uses a web rendering service (WRS) that is based on the Chromium rendering engine (version 74 as on 7 May 2019). [8] Googlebot discovers pages by harvesting every link on every page that it can find. Unless prohibited by a nofollow-tag, it then follows these links to other web pages. New web pages must be linked to from other known pages on the web in order to be crawled and indexed, or manually submitted by the webmaster.

A problem that webmasters with low-bandwidth Web hosting plans[ citation needed ] have often noted with the Googlebot is that it takes up an enormous amount of bandwidth.[ citation needed ] This can cause websites to exceed their bandwidth limit and be taken down temporarily. This is especially troublesome for mirror sites which host many gigabytes of data. Google provides "Search Console" that allow website owners to throttle the crawl rate. [9]

How often Googlebot will crawl a site depends on the crawl budget. Crawl budget is an estimation of how typically a website is updated.[ citation needed ] Technically, Googlebot's development team (Crawling and Indexing team) uses several defined terms internally to take over what "crawl budget" stands for. [10] Since May 2019, Googlebot uses the latest Chromium rendering engine, which supports ECMAScript 6 features. This will make the bot a bit more "evergreen" and ensure that it is not relying on an outdated rendering engine compared to browser capabilities. [8]

Mediabot

Mediabot is the web crawler that Google uses for analyzing the content so Google AdSense can serve contextually relevant advertising to a web page. Mediabot identifies itself with the user agent string "Mediapartners-Google/2.1".

Unlike other crawlers, Mediabot does not follow links to discover new crawlable URLs, instead only visiting URLs that have included the AdSense code. [11] Where that content resides behind a login, the crawler can be given a log in so that it is able to crawl protected content. [12]

Inspection Tool Crawlers

InspectionTool is the crawler used by Search testing tools such as the Rich Result Test and URL inspection in Google Search Console. Apart from the user agent and user agent token, it mimics Googlebot. [13]

A guide to the crawlers was independently published. [14] It details four (4) distinctive crawler agents based on Web server directory index data - one (1) non-chrome and three (3) chrome crawlers.

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

robots.txt Internet protocol

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

In computing, the User-Agent header is an HTTP header intended to identify the user agent responsible for making a given HTTP request. Whereas the character sequence User-Agent comprises the name of the header itself, the header value that a given user agent uses to identify itself is colloquially known as its user agent string. The user agent for the operator of a computer used to access the Web has encoded within the rules that govern its behavior the knowledge of how to negotiate its half of a request-response transaction; the user agent thus plays the role of the client in a client–server system. Often considered useful in networks is the ability to identify and distinguish the software facilitating a network session. For this reason, the User-Agent HTTP header exists to identify the client software to the responding server.

Doorway pages are web pages that are created for the deliberate manipulation of search engine indexes (spamdexing). A doorway page will affect the index of a search engine by inserting results for particular phrases while sending visitors to a different page. Doorway pages that redirect visitors without their knowledge use some form of cloaking. This usually falls under Black Hat SEO.

<span class="mw-page-title-main">Mobile browser</span> Web browser designed for use on mobile devices

A mobile browser is a web browser designed for use on a mobile device such as a mobile phone, PDA, smartphone, or tablet. Mobile browsers are optimized to display web content most effectively on small screens on portable devices. Some mobile browsers, especially older versions, are designed to be small and efficient to accommodate the low memory capacity and low bandwidth of certain wireless handheld devices. Traditional smaller feature phones use stripped-down mobile web browsers; however, most current smartphones have full-fledged browsers that can handle the latest web technologies, such as CSS 3, JavaScript, and Ajax.

The noindex value of an HTML robots meta tag requests that automated Internet bots avoid indexing a web page. Reasons why one might want to use this meta tag include advising robots not to index a very large database, web pages that are very transitory, web pages that are under development, web pages that one wishes to keep slightly more private, or the printer and mobile-friendly versions of pages. Since the burden of honoring a website's noindex tag lies with the author of the search robot, sometimes these tags are ignored. Also the interpretation of the noindex tag is sometimes slightly different from one search engine company to the next.

Link prefetching allows web browsers to pre-load resources. This speeds up both the loading and rendering of web pages. Prefetching was first introduced in HTML5.

A sitemap is a list of pages of a web site within a domain.

In web archiving, an archive site is a website that stores information on webpages from the past for anyone to view.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

A spider trap is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambots or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages with links that continually point to the next day or year.

nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow setting allows website authors to indicate that the presence of a link is not an endorsement of the target site's importance.

<span class="mw-page-title-main">Search engine</span> Software system for finding relevant information on the Web

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

Google Search Console is a web service by Google which allows webmasters to check indexing status, search queries, crawling errors and optimize visibility of their websites.

<span class="mw-page-title-main">Bing Webmaster Tools</span>

Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.

A single-page application (SPA) is a web application or website that interacts with the user by dynamically rewriting the current web page with new data from the web server, instead of the default method of a web browser loading entire new pages. The goal is faster transitions that make the website feel more like a native app.

<span class="mw-page-title-main">Bingbot</span> Web-crawling robot

Bingbot is a web-crawling robot, deployed by Microsoft October 2010 to supply Bing. It collects documents from the web to build a searchable index for the Bing. It performs the same as Google's Googlebot.

<span class="mw-page-title-main">Search engine cache</span>

Search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.

References

  1. "Googlebot". Google. 2019-03-11. Retrieved 2019-03-11.
  2. "Announcing mobile first indexing for the whole web". Google Developers. Retrieved 2021-03-17.
  3. 1 2 "Google Search Console". Google.com.
  4. "Google Search Console". search.google.com. Retrieved 2019-03-11.
  5. "What is Googlebot | Google Search Central | Documentation". May 2022.
  6. "Understand the JavaScript SEO basics | Search for Developers". Google Developers. Retrieved 2020-07-26.
  7. Splitt, Martin. "How Google Search indexes JavaScript sites - JavaScript SEO". YouTube. Archived from the original on 2021-12-12.
  8. 1 2 "The new evergreen Googlebot". Official Google Webmaster Central Blog. Retrieved 2019-06-07.
  9. "Google - Webmasters" . Retrieved 2012-12-15.
  10. "What Crawl Budget Means for Googlebot". Official Google Webmaster Central Blog. Retrieved 2018-07-04.
  11. "About the AdSense Crawler".
  12. "Display ads on login-protected pages".
  13. "Google Crawler (User Agent) Overview".
  14. "The Ultimate Guide to the New InspectionTool Crawlers".