|
|
|
BACK TO BASICS: XML Sitemaps DefinedPart One: Pairing Traditional Site Maps with XML SitemapsBy Bradley Leese, April 15, 2009 At Bruce Clay, Inc., one of the things that we strive to do in our everyday client work is to get the pages that we consider important into the search engines. In order to provide the search engines the most complete look at what is in a site and give those pages the best chance to be indexed, it's necessary to create maps of your site's pages. There are two ways of doing this. The traditional way is to create an HTML site map but in recent years search engines have developed the XML Sitemap protocol to assist in spidering. Google in particular has embraced the format and created several variations that enable them to easily discover all manner of content varieties. The first part of this two-part series merely outlines the basic differences between HTML site maps and XML Sitemaps and gets you started on creating your own XML Sitemaps. The second will tackle the variations that Google has created in order to better help you serve them content. HTML Site Map Page(s)A traditional HTML site map communicates to site users and search engine spiders how the site's information is organized. Essentially, the purpose of the site map is to document the site content relevance. If a site owner identifies errors when attempting to match the site map to the structure of the site and reveals that the information is confusing or in an unempirical format, the site needs to be reorganized in order to provide clear subject expertise. HTML site maps are extremely important for usability. Your visitors will find your site map when they can't use or don't understand your navigation. Take a moment to view Google's site map for a clearer understand of their recommendations in their Webmaster Guidelines section. http://www.google.com/sitemap.html
Snippet from Google HTML Site Map
Site Map RestrictionsHTML site maps' purpose was always to lead the search engines to identify and (hopefully) conclude that the site's navigation and content were proof alone that the site was worthy of high keyword rankings. There are many limitations with the HTML format, not least of which is the somewhat restrictive format that Google outlines in its webmaster guidelines: Design and Content Guidelines http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769 Large sites will have trouble fitting their entire site within the confines of Google's Webmaster Guidelines and documented above and, of course, Bruce Clay writers have long offered solutions on how to combat these obvious issues. Following the design guidelines exactly leads to either not listing every page on your site in your site map, or creating nested site maps that may not be crawled entirely. Either way, most site maps are just pages of links with good anchor text but very little in the way of content. The most important thing to know about using site maps successfully is that Google does not expect you to list every page on your site within your site map. Now large site owners will scoff at this statement with loud pronouncements that this is obvious and everyone should know that if they are an Internet marketer. However, this recommendation might give you pause. After all, wouldn't it be wise if you could document all the pages on your site despite their volume to verify that the search engines have access to site information? The answer is yes, and that is where the Sitemaps Protocol comes into play. Sitemaps (XML)XML Sitemaps - usually called just Sitemaps - are a way for you to give the search engines information about your site. There are three ways to point search engines to your XML Sitemap. You can use your robots.txt file, you can submit it directly to the engines using their submission forms or you can issue an HTTP request to the URL provided by the search engine. You can find how to build your Sitemap, and the full Sitemaps protocol at http://sitemaps.org. Sitemaps (XML) Page(s):
In some cases, Sitemaps are helpful if your site has dynamic content or pages featuring technologies like AJAX or Flash that might not be easily found and crawled during a normal spidering process. While the search engines have increased leaps and bounds in their ability to follow links in Flash, supplementing those links with an XML Sitemap can make your life easier. An XML Sitemap is also useful if your site is new and has few links to it or if your site has a large archive of content pages that are not well linked to each other, or are not linked at all. Because Google and the other search engines discover new pages by following from link to link, poorly or under-linked pages may have a harder time getting spidered and indexed. An XML Sitemap provides those URLs to the search engines directly so that they can spider them and consider them for indexing. Using a Sitemap provides additional site information to Google, which complements Google's normal methods of crawling. Sitemaps allow Google to crawl a Web site in a much more timely fashion. Google does state, however, that there is no guarantee that URLs from a Web site's Sitemap will end up in the Google index. Web sites are also never penalized for submitting Sitemaps. Dynamic XML Sitemaps Do Not Replace Static Site MapsAfter pointing out the differences between these two site map / Sitemap formats it may seem the traditional format is obsolete. This is, however, explicitly incorrect. Traditional site maps and XML Sitemaps work best when paired together, creating a stronger and fuller picture of your site for the search engines. In addition, the HTML version will be useful for your human visitors as well, since they might use the site map pages to navigate their way around the site if your global navigation is confusing or broken. Sitemaps are for Spiders and Site Maps are for Silos It is vital that this seemingly minor distinction is taken into consideration. XML Sitemaps do help prioritize which sections of the site are silos and which sections are supporting or sublevel content sections; however, they do not explicitly silo. Only site maps (HTML) succeed in this task when properly nested throughout the Web site. Consider that Google requires link text in order to clearly identify silos and subdirectory content. Site maps (HTML) are cached by search engines, which means they'll show up in the search results, something that could be useful to your site in the long run.
XML Sitemaps on the other hand are only a batch of links to be followed. They're not human readable (unless the human is particularly fond of reading code) and they won't pass any link equity. They are strictly for search engines. The upside is that because you don't have to worry about any pesky human eyes, the code can be extremely efficient, and things like font size and text content are not a concern. http://www.google.com/hostednews/sitemap_index.xml
It's clear that the best way to build up your site is to make both a traditional HTML site map as well as an XML Sitemap. You can add your XML Sitemap through Google Webmaster Tools.
Sitemap GuidelinesSitemaps all adhere to the same general guidelines; a Sitemap may contain a list of URLs or a list of other Sitemaps. If a Sitemap does contain a list of other Sitemaps, it can be saved as a Sitemap index file using the XML format provided for that file type. For those with larger Sites, be aware that an XML Sitemap index file cannot contain more than 1,000 Sitemaps. There are also size restrictions for URLs and file sizes in a Sitemap file. A sitemap file cannot have more than 50,000 URLs and be no larger than 10MB when uncompressed. If a Sitemap has more than 50,000 URLs or is too large, it can be broken into several smaller Sitemaps. These limits make sure that the Web server is not overloaded by large files. Just like the best practice for linking within your site, all URLs in your XML Sitemap must also be referred to the same way every time. If a site specifies its site location as http://www.peanutbutterville.com/, the URL list should not contain URLs that begin with the non-www version, http://peanutbutterville.com/. Likewise, if the site location is named as http://peanutbutterville.com/, the URL list should not contain URLs that begin with http://www.peanutbutterville.com/. Direct image URLs should also not be included in the Sitemaps as Google indexes the page the image appears on, not the image itself directly. If your URLs include session IDs make sure that you strip those out for the XML Sitemap. The Sitemap URL must also be readable by the Web server where the Sitemap is located, and may only contain ASCII characters. An XML Sitemap containing upper ASCII characters, certain control codes or special characters such as * and {} will receive an error and can't be added. It is possible to create a specialized Sitemap for certain types of content. However, certain Sitemaps are only accepted by specific search engines. Next month, the second article in this series will be covering the following types of Sitemaps which are specific only to Google, so Yahoo! and Microsoft Live Search and Ask won't be able to read them: For permission to reprint or reuse any materials, please contact us. To learn more about our authors, please visit the Bruce Clay Authors page. Copyright 2009 Bruce Clay, Inc. |