Noindex

Last updated

The noindex value of an HTML robots meta tag requests that automated Internet bots avoid indexing a web page. [1] [2] Reasons why one might want to use this meta tag include advising robots not to index a very large database, web pages that are very transitory, web pages that are under development, web pages that one wishes to keep slightly more private, or the printer and mobile-friendly versions of pages. Since the burden of honoring a website's noindex tag lies with the author of the search robot, sometimes these tags are ignored. Also the interpretation of the noindex tag is sometimes slightly different from one search engine company to the next.

Contents

Noindexing entire pages

<html><head><metaname="robots"content="noindex"><title>Don't index this page</title></head>

Possible values for the meta tag content are: "none", "all", "index", "noindex", "nofollow", and "follow". A combination of the values is also possible, [1] for example:

<metaname="robots"content="noindex, follow">

Bot-specific directives

The noindex directive can be restricted only to certain bots by specifying a different "name" value in the meta tag. For example, to specifically block Google's bot, [3] specify:

<metaname="googlebot"content="noindex">

Or, to block Bing's bot, specify:

<metaname="bingbot"content="noindex">

Or, to block Baidu's bot, specify:

<metaname="baiduspider"content="noindex">

robots.txt file

A robots.txt file can be used to block crawling.

Noindexing part of a page

It is also possible to exclude part of a Web page, for example navigation text, from being indexed rather than the whole page. There are various techniques for doing this; it is possible to use several in combination. Google's main indexing spider, Googlebot, is not known to recognize any of these techniques.

<noindex> tag

The Russian search engine Yandex introduced a new <noindex> tag which prevents indexing of the content between the tags. To allow the source code to validate, <!--noindex--> alternatively can be used: [4]

<p> Do index this text. <noindex>Don't index this text.</noindex><!--noindex-->Don't index this text.<!--/noindex--></p>

Other indexing spiders also recognize the <noindex> tag, including Atomz. [5]

microformat

There is a 2005 draft microformats specification with the same functionality. The Robot Exclusion Profile looks for the attribute and value class="robots-noindex" in HTML tags: [6]

<p>Do index this text.</p><divclass="robots-noindex">Don't index this text.</div><spanclass="robots-noindex">Don't index this text.</span><pclass="robots-noindex">Don't index this text.</p>

A combination of values is also possible, [6] for example:

<divclass="robots-noindex robots-follow">Text.</div>

Yahoo!

In 2007, Yahoo! introduced similar functionality to the microformat into its spider. However, Yahoo!'s spider is incompatible in that it looks for the value class="robots-nocontent" and only this value: [7]

<p>Do index this text.</p><divclass="robots-nocontent">Don't index this text.</div><spanclass="robots-nocontent">Don't index this text.</span><pclass="robots-nocontent">Don't index this text.</p>

SharePoint

SharePoint 2010’s iFilter excludes content inside of a <div> tag with the attribute and value class="noindex". Inner <div>s were initially not excluded, but this may have changed. It is also unknown whether the attribute can be applied to tags other than <div>. [8]

<p>Do index this text.</p><divclass="noindex">Don't index this text.</div>

Structured comments

Google Search Appliance

The Google Search Appliance uses structured comments: [9]

<p> Do index this text. <!--googleoff: all--> Don't index this text. <!--googleon: all--></p>

Other indexing spiders also use their own structured comments.

See also

Related Research Articles

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript, a programming language.

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating related and/or unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

robots.txt Filename used to indicate portions for web crawling

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid search traffic rather than direct traffic, referral traffic, social media traffic, or paid traffic.

<span class="mw-page-title-main">Googlebot</span> Web crawler used by Google

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

Microformats (μF) are predefined HTML markup created to serve as descriptive and consistent metadata about elements, designating them as representing a certain type of data. They allow software to process the information reliably by having set classes refer to a specific type of data rather than being arbitrary.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

hCard is a microformat for publishing the contact details of people, companies, organizations, and places, in HTML, Atom, RSS, or arbitrary XML. The hCard microformat does this using a 1:1 representation of vCard properties and values, identified using HTML classes and rel attributes.

nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow setting allows website authors to indicate that the presence of a link is not an endorsement of the target site's importance.

<span class="mw-page-title-main">Search engine</span> Software system for finding relevant information on the Web

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

Geo is a microformat used for marking up geographical coordinates in HTML. Coordinates are expected in angular units of degrees and geodetic datum WGS84. Although termed a "draft" specification, the format is a de facto standard, stable and in widespread use; not least as a sub-set of the published hCalendar and hCard microformat specifications, neither of which is still a draft.

Google Search Console is a web service by Google which allows webmasters to check indexing status, search queries, crawling errors and optimize visibility of their websites.

<span class="mw-page-title-main">Semantic HTML</span> HTML used to reinforce meaning of documents or webpages

Semantic HTML is the use of HTML markup to reinforce the semantics, or meaning, of the information in web pages and web applications rather than merely to define its presentation or look. Semantic HTML is processed by traditional web browsers as well as by many other user agents. CSS is used to suggest how it is presented to human users.

<span class="mw-page-title-main">Bing Webmaster Tools</span> Tool to provide better indexing and search performance on Bing

Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.

Microdata is a WHATWG HTML specification used to nest metadata within existing content on web pages. Search engines, web crawlers, and browsers can extract and process Microdata from a web page and use it to provide a richer browsing experience for users. Search engines benefit greatly from direct access to Microdata because it allows them to understand the information on web pages and provide more relevant results to users. Microdata uses a supporting vocabulary to describe an item and name-value pairs to assign values to its properties. Microdata is an attempt to provide a simpler way of annotating HTML elements with machine-readable tags than the similar approaches of using RDFa and microformats.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

Schema.org is a reference website that publishes documentation and guidelines for using structured data mark-up on web-pages. Its main objective is to standardize HTML tags to be used by webmasters for creating rich results about a certain topic of interest. It is a part of the semantic web project, which aims to make document mark-up codes more readable and meaningful to both humans and machines.

References

  1. 1 2 Robots and the META element, Official W3 specification
  2. About the Robots <META> tag
  3. Using meta tags to block access to your site, Google Webmasters Tools Help
  4. "Using HTML tags". webmaster → help. Yandex. Section: <noindex> tag. Retrieved March 25, 2013.
  5. "General Search FAQ". Help. Atomz. 2013. Section: How do I exclude parts of my site from being searched?. Archived from the original on December 8, 2021. Retrieved March 23, 2013. Need to prevent parts of individual pages from being searched? If you want to exclude portions of a page from indexing, surround the text with <noindex> and </noindex> tags. This is useful, for example, if you want to exclude navigation text from searches.(registration required)
  6. 1 2 Janes, Peter (June 18, 2005). "Robot Exclusion Profile". Microformats. Retrieved March 24, 2013.
  7. Garg, Priyank (May 2, 2007). "Introducing Robots-Nocontent for Page Sections". Yahoo! Search Blog. Yahoo!. Archived from the original on August 20, 2014. Retrieved March 23, 2013.
  8. "Control Search Indexing (Crawling) Within a Page with Noindex". Microsoft Developer. Microsoft. June 7, 2010. Archived from the original on November 4, 2017. Retrieved November 4, 2017.
  9. "Administering Crawl: Preparing for a Crawl". Google Search Appliance . Google Inc. August 23, 2012. Section: Excluding Unwanted Text from the Index. Archived from the original on November 23, 2012. Retrieved March 23, 2013.