http://googleblog.blogspot.com/2005/06/webmaster-friendly.html
Google came up with something new(nothing new about that, is there?) - a method in which webmasters tell the search engines which all files must be indexed, where it can be found, and how important it is. Till date, search engines index a site by scanning the html files for links to other files. Upon finding one, it adds this link to its list and scans it next. The problems with this approach are many...
- Indexes many files that may be private. This is exploited by hackers in a process called google hack. If the webmaster did not include this file in the robots.txt exclusion list, it will be indexed by google bot. And if the webmaster includes the private file's URL in the robots.txt file, it will be an invitation for the hackers - as it will tell them exactly where they can find the private files.
- Can be wrong in estimating the importance of a page. For example, I would consider my JavaScript Tutorial page to be more important than my CGI-Perl Tutorial page - but a bot may not be able to guess that.
- Orphan files won't be listed - If a page don't have any other pages linking to it, it will not be listed.
- And much more...
The new method will let the webmaster submit the location of a XML file that will have the location of all the pages in his website.
The format of the XML file is fairly simple.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd"> <url> <loc>http://www.geocities.com/binnyva/code/javascript</loc> <lastmod>2004-06-08T09:28:34Z</lastmod> <priority>0.8</priority> </url> <url> <loc>http://www.geocities.com/binnyva/code/javascript/basic_tutorial</loc> <lastmod>2005-01-01T15:49:32Z</lastmod> <priority>1.0</priority> </url> <url> <loc>http://www.geocities.com/binnyva/code/javascript/basic_tutorial/contents.html</loc> <lastmod>2005-04-03T18:14:20Z</lastmod> <priority>0.6</priority> </url> ... </urlset>
Lets have a look at one block.
<url> <loc>http://www.geocities.com/binnyva/code/javascript</loc> - The location of the page(URL) <lastmod>2004-06-08T09:28:34Z</lastmod> - The 'Last Modified' date of that file. <priority>0.8</priority> - The importance of the page - this can be anywhere between 0.0 and 1.0. </url>
For more details, go to Google's page on GSM protocol
But you don't have to worry about making a XML file with the list of all your pages - good ol' google has provided a Python script that will do the job for you. This script(called Sitemap Generator) can be downloaded from its SourceForge page at http://sourceforge.net/project/showfiles.php?group_id=137793&package_id=153422
Basic Info...
Name: sitemap_gen Version: 1.0 Summary: Sitemap Generator Home-page: http://sourceforge.net/projects/goog-sitemap_gen/ Author: Google Inc. Author-email: opensource@google.com License: BSD
From the README file...
The sitemap_gen.py script analyzes your web server and generates one or more Sitemap files. These files are XML listings of content you make available on your web server. The files can be directly submitted to search engines as hints for the search engine web crawlers as they index your web site. This can result in better coverage of your web content in search engine indices, and less of your bandwidth spent doing it. The sitemap_gen.py script is written in Python 2.2 and released to the open source community for continuous improvements under the BSD 2.0 new license, which can be found at: http://www.opensource.org/licenses/bsd-license.php The original release notes for the script, including a walk-through for webmasters on how to use it, can be found at the following site: http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html
How to use the script.
First, you have to create a configuration XML file that has the details of your site. Just copy 'example_config.xml' file that comes with the sitemap_gen script and edit it. The file has enough explanations so it should be easy enough to create a config file that will match your site. The config file that I used for one of my sites are given below...
<?xml version="1.0" encoding="UTF-8"?>
<!--
sitemap_gen.py configuration script - Bin-Co
-->
<site
base_url="http://www.geocities.com/binnyva/code/"
store_into="D:/code/sitemap.xml"
verbose="1"
suppress_search_engine_notify="1"
>
<directory path="D:/code" url="http://www.geocities.com/binnyva/code/" />
<!-- Exclude URLs that point to UNIX-style hidden files -->
<filter action="drop" type="regexp" pattern="/\.[^/]*$" />
<!-- Exclude URLs that end with a '~' (IE: emacs backup files) -->
<filter action="drop" type="wildcard" pattern="*~" />
<!-- Exclude URLs that point to default index.html files.
URLs for directories get included, so these files are redundant. -->
<filter action="drop" type="wildcard" pattern="*index.htm*" />
<!-- Custom Drops -->
<!-- Downloads -->
<filter action="drop" type="wildcard" pattern="*.zip" />
<filter action="drop" type="wildcard" pattern="*.gz" />
<filter action="drop" type="wildcard" pattern="*.bz" />
<filter action="drop" type="wildcard" pattern="*.exe" />
<!-- Images -->
<filter action="drop" type="wildcard" pattern="*.gif" />
<filter action="drop" type="wildcard" pattern="*.jpg" />
<filter action="drop" type="wildcard" pattern="*.png" />
<!-- Code Files -->
<filter action="drop" type="wildcard" pattern="*.tcl" />
<filter action="drop" type="wildcard" pattern="*.pl" />
<filter action="drop" type="wildcard" pattern="*.cgi" />
<!-- Script files -->
<filter action="drop" type="wildcard" pattern="*.js" />
<filter action="drop" type="wildcard" pattern="*.css" />
<!-- Some Folders must not get in. -->
<filter action="drop" type="wildcard" pattern="*Temp*" />
<filter action="drop" type="wildcard" pattern="*cgi-bin*" />
</site>
The sitemap_gen.py script is actually meant to be run from your webserver - but that is not the way I did it. I ran the script on my local machine and then upload the XML files created to my server. Just make sure that the locations you gave in the config file correctly points to your pages. Now run the sitemap_gen.py script using python. I used this command.
python sitemap_gen.py --config=binco.xml
Just type this command in the terminal - you do have python installed don't you?
If everything goes well, a message will be shown which in my case was...
Reading configuration file: binco.xml Walking DIRECTORY "D:/code\" Sorting and normalizing collected URLs. Writing Sitemap file "D:\code\sitemap.xml" with 183 URLs Search engine notification is suppressed. Count of file extensions on URLs: 45 (no extension) 123 .html 12 .txt 3 .xml Number of errors: 0 Number of warnings: 0
Have a look at the sitemap.xml file that was created. All the pages of your site in one file. Now we have to submit this file to google. Before doing that, upload the XML file to your server and note its location. Then, login to to the google sitemap site at Webmasters using your gmail account. If you don't have one, create it now. Then click on the 'Add a Sitemap' link and input the location of the XML file we just created in the input field. This will add the XML file to the list of sitemaps that must be downloaded and parsed by google - this will take some time so be patient.
Thats all there is to it. The whole process can be automated - this is how google what it to be. Just upload the Python script and its config files to your server and set a Cron job to run every week. Make sure than the config file do NOT have the ' suppress_search_engine_notify="1" ' option. Now when ever the script runs, it will create the XML file and notify google that a new Sitemap was made.
This is another good idea by google. But even they are not confident this will work. The following text is taken from Google Blog.
'We're undertaking an experiment called Google Sitemaps that will either fail miserably, or succeed beyond our wildest dreams.'
Everything that could be used to improve the position of a site in google search will be abused by some webmasters. And this technology gives a lot of control to webmasters - so undoubtedly they will find a way to cheat using Sitemap. Till then, let us enjoy this technology.
4 Comments:
thanks for the great tutorial, this really helped out, knowing how to run it locally
Check out the sitemap xml generator software at fnoware
fnoware SiteMap XML
Do you have any idea of how I can modify the configuration file to get all the pages on a specific folder with a priority=1 ?
hi
iam very new to xml sitemap generator...in my website(php website)i have one url(main url) .under the main url there are many sub urls(thousands of urls). how can i dispaly those urls when user want sub urls(those sub urls are based on id's)...can i use wildcards(*)...
Post a Comment