Site map

Last updated

A sitemap is a list of pages of a web site within a domain.

Contents

There are three primary kinds of sitemap:

Types of sitemaps

A sitemap of what links from the English Wikipedia's Main Page Main Page Usability.png
A sitemap of what links from the English Wikipedia's Main Page
Sitemap of Google in 2006 Sitemap google.jpg
Sitemap of Google in 2006

Sitemaps may be addressed to users or to software.

Many sites have user-visible sitemaps which present a systematic view, typically hierarchical, of the site. These are intended to help visitors find specific pages, and can also be used by crawlers. They also act as a navigation aid [1] by providing an overview of a site's content at a single glance. Alphabetically organized sitemaps, sometimes called site indexes, are a different approach.

For use by search engines and other crawlers, there is a structured format, the XML Sitemap, which lists the pages in a site, their relative importance, and how often they are updated. [2] This is pointed to from the robots.txt file and is typically called sitemap.xml. The structured format is particularly important for websites which include pages that are not accessible through links from other pages, but only through the site's search tools or by dynamic construction of URLs in JavaScript.

XML sitemaps

Google introduced the Sitemaps protocol so web developers can publish lists of links from across their sites. The basic premise is that some sites have a large number of dynamic pages that are only available through the use of forms and user entries. The Sitemap files contain URLs to these pages so that web crawlers can find them. Bing, Google, Yahoo and Ask now jointly support the Sitemaps protocol.

Since the major search engines use the same protocol, [3] having a Sitemap lets them have the updated page information. Sitemaps do not guarantee all links will be crawled, and being crawled does not guarantee indexing. [4] Google Webmaster Tools allow a website owner to upload a sitemap that Google will crawl, or they can accomplish the same thing with the robots.txt file. [5]

Sample

Below is an example of a validated XML sitemap for a simple three-page website. Sitemaps are a useful tool for making sites searchable, particularly those written in non-HTML languages.

<?xml version="1.0" encoding="UTF-8"?><urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>http://www.example.net/?id=who</loc><lastmod>2009-09-22</lastmod><changefreq>monthly</changefreq><priority>0.8</priority></url><url><loc>http://www.example.net/?id=what</loc><lastmod>2009-09-22</lastmod><changefreq>monthly</changefreq><priority>0.5</priority></url><url><loc>http://www.example.net/?id=how</loc><lastmod>2009-09-22</lastmod><changefreq>monthly</changefreq><priority>0.5</priority></url></urlset>

See also

Related Research Articles

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

robots.txt Internet protocol

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

In the context of the World Wide Web, deep linking is the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website, rather than the website's home page. The URL contains all the information needed to point to a particular item. Deep linking is different from mobile deep linking, which refers to directly linking to in-app content using a non-HTTP URI.

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

<span class="mw-page-title-main">Googlebot</span> Web crawler used by Google

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

The anchor text, link label or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the a element, or <a>. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.

In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow setting allows website authors to indicate that the presence of a link is not an endorsement of the target site's importance.

<span class="mw-page-title-main">Search engine</span> Software system for finding relevant information on the Web

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

Google Search Console is a web service by Google which allows webmasters to check indexing status, search queries, crawling errors and optimize visibility of their websites.

<span class="mw-page-title-main">Bing Webmaster Tools</span> Tool to provide better indexing and search performance on Bing

Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.

Yahoo! Site Explorer (YSE) was a Yahoo! service which allowed users to view information on websites in Yahoo!'s search index. The service was closed on November 21, 2011 and merged with Bing Webmaster Tools, a tool similar to Google Search Console. In particular, it was useful for finding information on backlinks pointing to a given webpage or domain because YSE offered full, timely backlink reports for any site. After merging with Bing Webmaster Tools, the service only offers full backlink reports to sites owned by the webmaster. Reports for sites not owned by the webmaster are limited to 1,000 links.

A single-page application (SPA) is a web application or website that interacts with the user by dynamically rewriting the current web page with new data from the web server, instead of the default method of a web browser loading entire new pages. The goal is faster transitions that make the website feel more like a native app.

PowerMapper is a web crawler that automatically creates a site map of a website using thumbnails from each web page.

The rel="alternate" hreflang="x" link attribute is a HTML meta element described in RFC 8288. Hreflang specifies the language and optional geographic restrictions for a document. Hreflang is interpreted by search engines and can be used by webmasters to clarify the lingual and geographical targeting of a website.

<span class="mw-page-title-main">Search engine cache</span>

A search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.

References

  1. Sitemap Usability Jakob Nielsen's Alertbox, August 12, 2008
  2. Nadik, Tessa (2023-02-09). "What Is A Sitemap? Do I Need One?". Search Engine Journal. Retrieved 2023-09-16.
  3. "Google, Yahoo!, Microsoft Standardize Against Google Sitemap Protocol". Oreilly . Retrieved 2012-07-24.
  4. Joint announcement from Google, Yahoo, and Bing supporting Sitemaps
  5. "Submitting Sitemaps". Google Inc. Retrieved 2012-07-06.