The SITEMAPS protocol allows a webmaster to inform search engines
about URLs on a website that are available for crawling. A Sitemap is
Sitemaps are particularly beneficial on websites where:
* some areas of the website are not available through the browsable interface * webmasters use rich Ajax , Silverlight , or Flash content that is not normally processed by search engines. * The site is very large and there is a chance for the web crawlers to overlook some of the new or recently updated content * When websites have a huge number of pages that are isolated or not well linked together, or * When a website has few external links
* 1 History
* 2 File format
* 2.1 Element definitions
* 3 Other formats
* 3.1 Text file * 3.2 Syndication feed
* 4 Search engine submission
* 4.1 Limitations for search engine indexing
* 5 Sitemap limits * 6 Multilingual and multinational Sitemaps * 7 See also * 8 References * 9 External links
In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MS announced auto-discovery for sitemaps through robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.
The Sitemaps protocol is based on ideas from "Crawler-friendly Web Servers," with improvements including auto-discovery through robots.txt and the ability to specify the priority and change frequency of pages.
A sample Sitemap that contains just one URL and uses all optional tags is shown below.
http://example.com/ 2006-11-18 daily 0.8
An example of Sitemap index referencing one separate sitemap follows.
The definitions for the elements are shown below:
ELEMENT REQUIRED? DESCRIPTION
Yes The document-level element for the Sitemap. The rest of the document after the '' element must be contained in this.
Yes Parent element for each entry.
Yes The document-level element for the Sitemap index. The rest of the document after the '' element must be contained in this.
Yes Parent element for each entry in the index.
Yes Provides the full URL of the page or sitemap, including the protocol (e.g. http, https) and a trailing slash, if required by the site's hosting server. This value must be shorter than 2,048 characters. Note that ampersands in the URL need to be escaped as .
No The date that the file was last modified, in ISO 8601 format. This can display the full date and time or, if desired, may simply be the date in the format YYYY-MM-DD.
No How frequently the page may change:
* always * hourly * daily * weekly * monthly * yearly * never
"Always" is used to denote documents that change each time that they are accessed. "Never" is used to denote archived URLs (i.e. files that will not be changed again).
This is used only as a guide for crawlers , and is not used to determine how frequently pages are indexed.
Does not apply to elements.
No The priority of that URL relative to other URLs on the site. This allows webmasters to suggest to crawlers which pages are considered more important.
The valid range is from 0.0 to 1.0, with 1.0 being the most important. The default value is 0.5.
Rating all pages on a site with a high priority does not affect search listings, as it is only used to suggest to the crawlers how important pages in the site are to one another.
Does not apply to elements.
Support for the elements that are not required can vary from one search engine to another.
Sitemaps protocol allows the Sitemap to be a simple list of URLs
in a text file. The file specifications of
A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling.
It can be beneficial to have a syndication feed as a delta update (containing only the newest content) to supplement a complete sitemap.
SEARCH ENGINE SUBMISSION
If Sitemaps are submitted directly to a search engine (pinged ), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the robots.txt file by adding the following line: Sitemap:
The should be the complete URL to the sitemap, such as: _http://www.example.org/sitemap.xml_
This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, multiple "Sitemap:" records may be included in robots.txt, or the URL can simply point to the main sitemap index file.
The following table lists the sitemap submission URLs for several major search engines:
SEARCH ENGINE SUBMISSION URL HELP PAGE MARKET
Bing (and Yahoo! ) HTTP://WWW.BING.COM/WEBMASTER/PING.ASPX?SITEMAP= Bing Webmaster Tools Global
Sitemap URLs submitted using the sitemap submission URLs need to be URL-encoded , for example: replacing : (colon) with %3A, / (slash) with %2F.
LIMITATIONS FOR SEARCH ENGINE INDEXING
Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results. Specific examples are provided below.
Sitemap files have a limit of 50,000 URLs and 50MiB per sitemap. Sitemaps can be compressed using gzip , reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 50MiB (52,428,800 bytes) and can be compressed. You can have more than one Sitemap index file.
As with all
* ^ M.L. Nelson; J.A. Smith; del Campo; H. Van de Sompel; X. Liu
(2006). "Efficient, Automated Web Resource Harvesting" (PDF).
* ^ O. Brandman, J. Cho,
Hector Garcia-Molina , and Narayanan
Shivakumar (2000). "Crawler-friendly web servers". _Proceedings of ACM
SIGMETRICS Performance Evaluation Review, Volume 28, Issue 2_. doi
:10.1145/362883.362894 . CS1 maint: Multiple names: authors list (link
* ^ _A_ _B_ _C_ _D_ _E_ _F_ _G_ "