The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.
History
Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites. Google, Yahoo! and Microsoft announced joint support for the Sitemaps protocol in November 2006. The schema version was changed to "Sitemap 0.90", but no other changes were made.
In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MSN announced auto-discovery for sitemaps through robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.
The Sitemaps protocol is based on ideas from "Crawler-friendly Web Servers," with improvements including auto-discovery through robots.txt and the ability to specify the priority and change frequency of pages.
Purpose
Sitemaps are particularly beneficial on websites where:
* Some areas of the website are not available through the browsable interface
* Webmasters use rich
Ajax
Ajax may refer to:
Greek mythology and tragedy
* Ajax the Great, a Greek mythological hero, son of King Telamon and Periboea
* Ajax the Lesser, a Greek mythological hero, son of Oileus, the king of Locris
* ''Ajax'' (play), by the ancient Greek ...
,
Silverlight
Microsoft Silverlight is a discontinued application framework designed for writing and running rich web applications, similar to Adobe Inc., Adobe's Run time environment, runtime, Adobe Flash. A plugin for Silverlight is still available for a v ...
, or
Flash
Flash, flashes, or FLASH may refer to:
Arts, entertainment, and media
Fictional aliases
* Flash (DC Comics character), several DC Comics superheroes with super speed:
** Flash (Barry Allen)
** Flash (Jay Garrick)
** Wally West, the first Kid ...
content that is not normally processed by
search engines
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
.
* The site is very large and there is a chance for the web crawlers to overlook some of the new or recently updated content
* When websites have a huge number of pages that are isolated or not well linked together, or
* When a website has few external links
File format
The Sitemap Protocol format consists of XML tags. The file itself must be
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
encoded. Sitemaps can also be just a plain text list of URLs. They can also be compressed in .gz format.
A sample Sitemap that contains just one URL and uses all optional tags is shown below.
http://example.com/2006-11-18daily0.8
The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 50 MiB or 50,000 URLs means this is necessary for large sites.
An example of Sitemap index referencing one separate sitemap follows.
http://www.example.com/sitemap1.xml.gz2014-10-01T18:23:17+00:00
Element definitions
The definitions for the elements are shown below:
Support for the elements that are not required can vary from one search engine to another.
Other formats
Text file
The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 50MB (uncompressed) or contain more than 50,000 URLs. Sitemaps that exceed these limits should be broken up into multiple sitemaps with a sitemap index file (a file that points to multiple sitemaps).
Syndication feed
A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling.
It can be beneficial to have a syndication feed as a delta update (containing only the newest content) to supplement a complete sitemap.
Search engine submission
If Sitemaps are submitted directly to a search engine ( pinged), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the
robots.txt
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the site they are allowed to visit.
Th ...
file by adding the following line:
::Sitemap: <sitemap_location>
The <sitemap_location> should be the complete URL to the sitemap, such as:
::''https://www.example.org/sitemap.xml''
This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, multiple "Sitemap:" records may be included in robots.txt, or the URL can simply point to the main sitemap index file.
The following table lists the sitemap submission URLs for a few major search engines:
Sitemap URLs submitted using the sitemap submission URLs need to be URL-encoded, for example:
replace : (colon) with %3A,
replace / (slash) with %2F.
Limitations for search engine indexing
Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results. Specific examples are provided below.
* Google - Webmaster Support on Sitemaps: "Using a sitemap doesn't guarantee that all the items in your sitemap will be crawled and indexed, as Google processes rely on complex algorithms to schedule crawling. However, in most cases, your site will benefit from having a sitemap, and you'll never be penalized for having one."
* Bing - Bing uses the standard sitemaps.org protocol and is very similar to the one mentioned below.
* Yahoo - After the search deal commenced between Yahoo! Inc. and Microsoft, Yahoo! Site Explorer has merged wit Bing Webmaster Tools
Sitemap limits
Sitemap files have a limit of 50,000 URLs and 50 MB per sitemap. Sitemaps can be compressed using
gzip
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and in ...
, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 50 MiB (52,428,800 bytes) and can be compressed. You can have more than one Sitemap index file.
As with all XML files, any data values (including URLs) must use entity escape codes for the characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>).
Best practice for optimising a sitemap index for search engine crawlability is to ensure the index refers only to sitemaps as opposed to other sitemap indexes. Nesting a sitemap index within a sitemap index is invalid according to Google.
Additional sitemap types
A number of additional XML sitemap types outside of the scope of the Sitemaps protocol are supported by Google to allow webmasters to provide additional data on the content of their websites. Video and image sitemaps are intended to improve the capability of websites to rank in image and video searches.
Video sitemaps
Video sitemaps indicate data related to embedding and autoplaying, preferred thumbnails to show in search results, publication date, video duration, and other metadata. Video sitemaps are also used to allow search engines to index videos that are embedded on a website, but that are hosted externally, such as on
Vimeo
Vimeo, Inc. () is an American video hosting, sharing, and services platform provider headquartered in New York City. Vimeo focuses on the delivery of high-definition video across a range of devices. Vimeo's business model is through software as ...
or
YouTube
YouTube is a global online video platform, online video sharing and social media, social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by ...
.
Image sitemaps
Image sitemaps are used to indicate image metadata, such as licensing information, geographic location, and an image's caption.
Google News Sitemaps
Google supports a Google News sitemap type for facilitating quick indexing of time-sensitive news subjects.
Multilingual and multinational sitemaps
In December 2011, Google announced the annotations for sites that want to target users in many languages and, optionally, countries. A few months later Google announced, on their official blog, that they are adding support for specifying the rel="alternate" and
hreflang
The rel="alternate" hreflang="x" link attribute is a HTML meta element described in RFC 8288. Hreflang specifies the language and optional geographic restrictions for a document. Hreflang is interpreted by search engines and can be used by webmast ...
annotations in Sitemaps. Instead of the (until then only option) HTML link elements the Sitemaps option offered many advantages which included a smaller page size and easier deployment for some websites.
One example of the multilingual sitemap would be as follows:
If for example we have a site that targets English language users through http://www.example.com/en and Greek language users through http://www.example.com/gr, up until then the only option was to add the hreflang annotation either in the HTTP header or as HTML elements on both URLs like this
But now, one can alternatively use the following equivalent markup in Sitemaps:
https://www.example.com/enhttps://www.example.com/gr
See also
*
Biositemap A Biositemap is a way for a biomedical research institution of organisation to show how biological information is distributed throughout their Information Technology systems and networks. This information may be shared with other organisations and r ...
*
Metadata
Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive metadata – the descriptive ...
*
Resources of a Resource
Resources of a Resource (ROR) is an XML format for describing the content of an internet resource or website in a generic fashion so this content can be better understood by search engines, spiders, web applications, etc. The ROR format provides s ...
*
Yahoo! Site Explorer Yahoo! Site Explorer (YSE) was a Yahoo! service which allowed users to view information on websites in Yahoo!'s search index. The service was closed on November 21, 2011 and merged with Bing Webmaster Tools, a tool similar to Google Search Console ...
*
Google Webmaster Tools
Google Search Console is a web service by Google which allows webmasters to check indexing status, search queries, crawling errors and optimize visibility of their websites.
Until 20 May 2015, the service was called Google Webmaster Tools. In ...