Sitemaps

Last updated

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt , a URL exclusion protocol.

Contents

History

Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites. [1] Google, Yahoo! and Microsoft announced joint support for the Sitemaps protocol in November 2006. [2] The schema version was changed to "Sitemap 0.90", but no other changes were made.

In April 2007, Ask.com and IBM announced support for Sitemaps. [3] Also, Google, Yahoo, MSN announced auto-discovery for sitemaps through robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites. [4]

The Sitemaps protocol is based on ideas [5] from "Crawler-friendly Web Servers," [6] with improvements including auto-discovery through robots.txt and the ability to specify the priority and change frequency of pages.

Purpose

Sitemaps are particularly beneficial on websites where:

File format

The Sitemap Protocol format consists of XML tags. The file itself must be UTF-8 encoded. Sitemaps can also be just a plain text list of URLs. They can also be compressed in .gz format.

A sample Sitemap that contains just one URL and uses all optional tags is shown below.

<?xml version='1.0' encoding='UTF-8'?><urlsetxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc><lastmod>2006-11-18</lastmod><changefreq>daily</changefreq><priority>0.8</priority></url></urlset>

The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 50  MiB or 50,000 URLs [8] means this is necessary for large sites.

An example of Sitemap index referencing one separate sitemap follows.

<?xml version="1.0" encoding="UTF-8"?><sitemapindexxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd"xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><sitemap><loc>https://www.example.com/sitemap1.xml.gz</loc><lastmod>2014-10-01T18:23:17+00:00</lastmod></sitemap></sitemapindex>

Element definitions

The definitions for the elements are shown below: [8]

ElementRequired?Description
<urlset>YesThe document-level element for the Sitemap. The rest of the document after the '<?xml version>' element must be contained in this.
<url>YesParent element for each entry.
<sitemapindex>YesThe document-level element for the Sitemap index. The rest of the document after the '<?xml version>' element must be contained in this.
<sitemap>YesParent element for each entry in the index.
<loc>YesProvides the full URL of the page or sitemap, including the protocol (e.g. http, https) and a trailing slash, if required by the site's hosting server. This value must be shorter than 2,048 characters. Note that ampersands in the URL need to be escaped as &amp;.
<lastmod>NoThe date that the file was last modified, in ISO 8601 format. This can display the full date and time or, if desired, may simply be the date in the format YYYY-MM-DD.
<changefreq>NoHow frequently the page may change:
  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

"Always" is used to denote documents that change each time that they are accessed. "Never" is used to denote archived URLs (i.e. files that will not be changed again).

This is used only as a guide for crawlers, and is not used to determine how frequently pages are indexed.

Does not apply to <sitemap> elements.

<priority>NoThe priority of that URL relative to other URLs on the site. This allows webmasters to suggest to crawlers which pages are considered more important.

The valid range is from 0.0 to 1.0, with 1.0 being the most important. The default value is 0.5.

Rating all pages on a site with a high priority does not affect search listings, as it is only used to suggest to the crawlers how important pages of the site are to one another.

Does not apply to <sitemap> elements.

Support for the elements that are not required can vary from one search engine to another. [8]

Other formats

Text file

The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 50MiB (uncompressed) or contain more than 50,000 URLs. Sitemaps that exceed these limits should be broken up into multiple sitemaps with a sitemap index file (a file that points to multiple sitemaps). [9]

Syndication feed

A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling. [8]

It can be beneficial to have a syndication feed as a delta update (containing only the newest content) to supplement a complete sitemap.

Search engine submission

If Sitemaps are submitted directly to a search engine (pinged), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the robots.txt file by adding the following line:

Sitemap: <sitemap_location>

The <sitemap_location> should be the complete URL to the sitemap, such as:

https://www.example.org/sitemap.xml

This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, multiple "Sitemap:" records may be included in robots.txt, or the URL can simply point to the main sitemap index file.

The following table lists the sitemap submission URLs for a few major search engines:

Search engineSubmission URLHelp pageMarket
Baidu https://zhanzhang.baidu.com/dashboard/index Baidu Webmaster Dashboard China, Singapore
Bing (and Yahoo!)https://www.bing.com/webmaster/ping.aspx?siteMap= Bing Webmaster Tools Global
Google https://www.google.com/ping?sitemap= Build and Submit a Sitemap Global
Yandex https://webmaster.yandex.com/site/map.xml Sitemaps files Russia, Belarus, Kazakhstan, Turkey

Sitemap URLs submitted using the sitemap submission URLs need to be URL-encoded, for example: replace : (colon) with %3A, replace / (slash) with %2F. [8]

Limitations for search engine indexing

Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results. Specific examples are provided below.

Sitemap limits

Sitemap files have a limit of 50,000 URLs and 50MiB (52,428,800 bytes) per sitemap. Sitemaps can be compressed using gzip, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 50MiB and can be compressed. You can have more than one Sitemap index file. [8]

As with all XML files, any data values (including URLs) must use entity escape codes for the characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>).

Best practice for optimising a sitemap index for search engine crawlability is to ensure the index refers only to sitemaps as opposed to other sitemap indexes. Nesting a sitemap index within a sitemap index is invalid according to Google. [11]

Additional sitemap types

A number of additional XML sitemap types outside of the scope of the Sitemaps protocol are supported by Google to allow webmasters to provide additional data on the content of their websites. Video and image sitemaps are intended to improve the capability of websites to rank in image and video searches. [12] [13]

Video sitemaps

Video sitemaps indicate data related to embedding and autoplaying, preferred thumbnails to show in search results, publication date, video duration, and other metadata. [13] Video sitemaps are also used to allow search engines to index videos that are embedded on a website, but that are hosted externally, such as on Vimeo or YouTube.

Image sitemaps

Image sitemaps are used to indicate image metadata, such as licensing information, geographic location, and an image's caption. [12]

Google News Sitemaps

Google supports a Google News sitemap type for facilitating quick indexing of time-sensitive news subjects. [14] [15]

Multilingual and multinational sitemaps

In December 2011, Google announced the annotations for sites that want to target users in many languages and, optionally, countries. A few months later Google announced, on their official blog, [16] that they are adding support for specifying the rel="alternate" and hreflang annotations in Sitemaps. Instead of the (until then only option) HTML link elements the Sitemaps option offered many advantages which included a smaller page size and easier deployment for some websites.

One example of the multilingual sitemap would be as follows:

If for example we have a site that targets English language users through https://www.example.com/en and Greek language users through https://www.example.com/gr, up until then the only option was to add the hreflang annotation either in the HTTP header or as HTML elements on both URLs like this

<linkrel="alternate"hreflang="en"href="https://www.example.com/en"/><linkrel="alternate"hreflang="gr"href="https://www.example.com/gr"/>

But now, one can alternatively use the following equivalent markup in Sitemaps:

<url><loc>https://www.example.com/en</loc><xhtml:linkrel="alternate"hreflang="gr"href="https://www.example.com/gr"/><xhtml:linkrel="alternate"hreflang="en"href="https://www.example.com/en"/></url><url><loc>https://www.example.com/gr</loc><xhtml:linkrel="alternate"hreflang="gr"href="https://www.example.com/gr"/><xhtml:linkrel="alternate"hreflang="en"href="https://www.example.com/en"/></url>

See also

Related Research Articles

robots.txt Standard used to advise web crawlers and scrapers not to index a web page or site

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

<span class="mw-page-title-main">Atom (web standard)</span> Web standards

The name Atom applies to a pair of related Web standards. The Atom Syndication Format is an XML language used for web feeds, while the Atom Publishing Protocol is a simple HTTP-based protocol for creating and updating web resources.

XInclude is a generic mechanism for merging XML documents, by writing inclusion tags in the "main" document to automatically include other documents or parts thereof. The resulting document becomes a single composite XML Information Set. The XInclude mechanism can be used to incorporate content from either XML files or non-XML text files.

Apache Wicket, commonly referred to as Wicket, is a component-based web application framework for the Java programming language conceptually similar to JavaServer Faces and Tapestry. It was originally written by Jonathan Locke in April 2004. Version 1.0 was released in June 2005. It graduated into an Apache top-level project in June 2007.

Microformats (μF) are a set of defined HTML classes created to serve as consistent and descriptive metadata about an element, designating it as representing a certain type of data. They allow software to process the information reliably by having set classes refer to a specific type of data rather than being arbitrary. Microformats emerged around 2005 and were predominantly designed for use by search engines, web syndication and aggregators such as RSS.

A sitemap is a list of pages of a web site within a domain.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

The Pronunciation Lexicon Specification (PLS) is a W3C Recommendation, which is designed to enable interoperable specification of pronunciation information for both speech recognition and speech synthesis engines within voice browsing applications. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use.

Google Search Console is a web service by Google which allows webmasters to check indexing status, search queries, crawling errors and optimize visibility of their websites.

Haml is a templating system that is designed to avoid writing inline code in a web document and make the HTML cleaner. Haml gives you the flexibility to have some dynamic content in HTML. Similar to other template systems like eRuby, Haml also embeds some code that gets executed during runtime and generates HTML code in order to provide some dynamic content. In order to run Haml code, files need to have a .haml extension. These files are similar to .erb or .eRuby files, which also help embed Ruby code while developing a web application.

<span class="mw-page-title-main">Bing Webmaster Tools</span>

Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.

In computing, Facelets is an open-source Web template system under the Apache license and the default view handler technology for Jakarta Faces. The language requires valid input XML documents to work. Facelets supports all of the JSF UI components and focuses completely on building the JSF component tree, reflecting the view for a JSF application.

<span class="mw-page-title-main">EPUB</span> E-book format

EPUB is an e-book file format that uses the ".epub" file extension. The term is short for electronic publication and is sometimes stylized as ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers. EPUB is a technical standard published by the International Digital Publishing Forum (IDPF). It became an official standard of the IDPF in September 2007, superseding the older Open eBook (OEB) standard.

Apache Click is a page and component oriented web application framework for the Java language and is built on top of the Java Servlet API.

The Office Open XML file formats are a set of file formats that can be used to represent electronic office documents. There are formats for word processing documents, spreadsheets and presentations as well as specific formats for material such as mathematical formulas, graphics, bibliographies etc.

In computing, Open Data Protocol (OData) is an open protocol that allows the creation and consumption of queryable and interoperable Web service APIs in a standard way. Microsoft initiated OData in 2007. Versions 1.0, 2.0, and 3.0 are released under the Microsoft Open Specification Promise. Version 4.0 was standardized at OASIS, with a release in March 2014. In April 2015 OASIS submitted OData v4 and OData JSON Format v4 to ISO/IEC JTC 1 for approval as an international standard. In December 2016, ISO/IEC published OData 4.0 Core as ISO/IEC 20802-1:2016 and the OData JSON Format as ISO/IEC 20802-2:2016.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

The Open Publication Distribution System (OPDS) catalog format is a syndication format for electronic publications based on Atom and HTTP. OPDS catalogs enable the aggregation, distribution, discovery, and acquisition of electronic publications. OPDS catalogs use existing or emergent open standards and conventions, with a priority on simplicity.

A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in April 2012.

The rel="alternate" hreflang="x" link attribute is a HTML meta element described in RFC 8288. Hreflang specifies the language and optional geographic restrictions for a document. Hreflang is interpreted by search engines and can be used by webmasters to clarify the lingual and geographical targeting of a website.

References

  1. Shivakumar, Shiva (2005-06-02). "Google Blog: Webmaster-friendly". Archived from the original on 2005-06-08. Retrieved 2021-12-31.
  2. "Major Search Engines Unite to Support a Common Mechanism for Website Submission". News from Google. November 16, 2006. Retrieved 2021-12-31.
  3. Pathak, Vivek (2007-05-11). "The Ask.com Blog: Sitemaps Autodiscovery". Ask's Official Blog. Archived from the original on 2007-05-18. Retrieved 2021-12-31.
  4. "Information for Public Sector Organizations". Archived from the original on 2007-04-30.
  5. M.L. Nelson; J.A. Smith; del Campo; H. Van de Sompel; X. Liu (2006). "Efficient, Automated Web Resource Harvesting" (PDF). WIDM'06.
  6. O. Brandman, J. Cho, Hector Garcia-Molina, and Narayanan Shivakumar (2000). "Crawler-friendly web servers". Proceedings of ACM SIGMETRICS Performance Evaluation Review, Volume 28, Issue 2. doi:10.1145/362883.362894.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  7. 1 2 3 4 "Learn about sitemaps | Search Central". Google Developers. Retrieved 2021-06-01.
  8. 1 2 3 4 5 6 "Sitemaps XML format". Sitemaps.org. 2016-11-21. Retrieved 2016-12-01.
  9. "Build and submit a sitemap - Search Console Help". Support.google.com. Retrieved 30 November 2020.
  10. "About Google Sitemaps". 2016-12-01. Retrieved 2016-12-01.
  11. "Sitemaps report - Search Console Help". support.google.com. Retrieved 2020-04-15.
  12. 1 2 "Image Sitemaps". Google Search Console. Retrieved 28 December 2018.
  13. 1 2 "Video Sitemaps". Google Search Console. Retrieved 28 December 2018.
  14. Bigby, Garenne. "Why You should be using a Google News Sitemap". Dyno Mapper. Retrieved 28 December 2018.
  15. "Google News Sitemaps". Google Search Console. Retrieved 28 December 2018.
  16. "Multilingual and multinational site annotations in Sitemaps". Google Webmaster Central Blog. Pierre Far. May 24, 2012.