Site Moved

This site has been moved to a new location - Bin-Blog. All new post will appear at the new location.

Bin-Blog

SiteMaps - A New Technology from Google

http://googleblog.blogspot.com/2005/06/webmaster-friendly.html

Google came up with something new(nothing new about that, is there?) - a method in which webmasters tell the search engines which all files must be indexed, where it can be found, and how important it is. Till date, search engines index a site by scanning the html files for links to other files. Upon finding one, it adds this link to its list and scans it next. The problems with this approach are many...

  • Indexes many files that may be private. This is exploited by hackers in a process called google hack. If the webmaster did not include this file in the robots.txt exclusion list, it will be indexed by google bot. And if the webmaster includes the private file's URL in the robots.txt file, it will be an invitation for the hackers - as it will tell them exactly where they can find the private files.
  • Can be wrong in estimating the importance of a page. For example, I would consider my JavaScript Tutorial page to be more important than my CGI-Perl Tutorial page - but a bot may not be able to guess that.
  • Orphan files won't be listed - If a page don't have any other pages linking to it, it will not be listed.
  • And much more...

The new method will let the webmaster submit the location of a XML file that will have the location of all the pages in his website.

The format of the XML file is fairly simple.

<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.google.com/schemas/sitemap/0.84"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84
                    http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">

<url>
<loc>http://www.geocities.com/binnyva/code/javascript</loc>
<lastmod>2004-06-08T09:28:34Z</lastmod>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.geocities.com/binnyva/code/javascript/basic_tutorial</loc>
<lastmod>2005-01-01T15:49:32Z</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>http://www.geocities.com/binnyva/code/javascript/basic_tutorial/contents.html</loc>
<lastmod>2005-04-03T18:14:20Z</lastmod>
<priority>0.6</priority>
</url>

...

</urlset>

Lets have a look at one block.

 <url>
<loc>http://www.geocities.com/binnyva/code/javascript</loc> - The location of the page(URL)
<lastmod>2004-06-08T09:28:34Z</lastmod> - The 'Last Modified' date of that file.
<priority>0.8</priority> - The importance of the page - this can be anywhere between 0.0 and 1.0.
</url>

For more details, go to Google's page on GSM protocol

But you don't have to worry about making a XML file with the list of all your pages - good ol' google has provided a Python script that will do the job for you. This script(called Sitemap Generator) can be downloaded from its SourceForge page at http://sourceforge.net/project/showfiles.php?group_id=137793&package_id=153422

Basic Info...

Name: sitemap_gen
Version: 1.0
Summary: Sitemap Generator
Home-page: http://sourceforge.net/projects/goog-sitemap_gen/
Author: Google Inc.
Author-email: opensource@google.com
License: BSD

From the README file...

The sitemap_gen.py script analyzes your web server and generates one or more
Sitemap files.  These files are XML listings of content you make available on
your web server.  The files can be directly submitted to search engines as
hints for the search engine web crawlers as they index your web site.  This
can result in better coverage of your web content in search engine indices,
and less of your bandwidth spent doing it.

The sitemap_gen.py script is written in Python 2.2 and released to the open
source community for continuous improvements under the BSD 2.0 new license,
which can be found at:

http://www.opensource.org/licenses/bsd-license.php

The original release notes for the script, including a walk-through for
webmasters on how to use it, can be found at the following site:

http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html

How to use the script.

First, you have to create a configuration XML file that has the details of your site. Just copy 'example_config.xml' file that comes with the sitemap_gen script and edit it. The file has enough explanations so it should be easy enough to create a config file that will match your site. The config file that I used for one of my sites are given below...

<?xml version="1.0" encoding="UTF-8"?>
<!--
sitemap_gen.py configuration script - Bin-Co
-->

<site
base_url="http://www.geocities.com/binnyva/code/"
store_into="D:/code/sitemap.xml"
verbose="1"
suppress_search_engine_notify="1"
>

<directory  path="D:/code"    url="http://www.geocities.com/binnyva/code/" />


<!-- Exclude URLs that point to UNIX-style hidden files               -->
<filter  action="drop"  type="regexp"    pattern="/\.[^/]*$"    />

<!-- Exclude URLs that end with a '~'   (IE: emacs backup files)      -->
<filter  action="drop"  type="wildcard"  pattern="*~"           />

<!-- Exclude URLs that point to default index.html files.
  URLs for directories get included, so these files are redundant. -->
<filter  action="drop"  type="wildcard"  pattern="*index.htm*"  />

<!-- Custom Drops -->
<!-- Downloads -->
<filter  action="drop"  type="wildcard"  pattern="*.zip"  />
<filter  action="drop"  type="wildcard"  pattern="*.gz"  />
<filter  action="drop"  type="wildcard"  pattern="*.bz"  />
<filter  action="drop"  type="wildcard"  pattern="*.exe"  />

<!-- Images -->
<filter  action="drop"  type="wildcard"  pattern="*.gif"  />
<filter  action="drop"  type="wildcard"  pattern="*.jpg"  />
<filter  action="drop"  type="wildcard"  pattern="*.png"  />

<!-- Code Files -->
<filter  action="drop"  type="wildcard"  pattern="*.tcl"  />
<filter  action="drop"  type="wildcard"  pattern="*.pl"  />
<filter  action="drop"  type="wildcard"  pattern="*.cgi"  />

<!-- Script files  -->
<filter  action="drop"  type="wildcard"  pattern="*.js"  />
<filter  action="drop"  type="wildcard"  pattern="*.css"  />

<!-- Some Folders must not get in. -->
<filter  action="drop"  type="wildcard"  pattern="*Temp*"  />
<filter  action="drop"  type="wildcard"  pattern="*cgi-bin*"  />

</site>

The sitemap_gen.py script is actually meant to be run from your webserver - but that is not the way I did it. I ran the script on my local machine and then upload the XML files created to my server. Just make sure that the locations you gave in the config file correctly points to your pages. Now run the sitemap_gen.py script using python. I used this command.

python sitemap_gen.py --config=binco.xml

Just type this command in the terminal - you do have python installed don't you?

If everything goes well, a message will be shown which in my case was...

Reading configuration file: binco.xml
Walking DIRECTORY "D:/code\"
Sorting and normalizing collected URLs.
Writing Sitemap file "D:\code\sitemap.xml" with 183 URLs
Search engine notification is suppressed.
Count of file extensions on URLs:
    45  (no extension)
   123  .html
    12  .txt
     3  .xml
Number of errors: 0
Number of warnings: 0

Have a look at the sitemap.xml file that was created. All the pages of your site in one file. Now we have to submit this file to google. Before doing that, upload the XML file to your server and note its location. Then, login to to the google sitemap site at Webmasters using your gmail account. If you don't have one, create it now. Then click on the 'Add a Sitemap' link and input the location of the XML file we just created in the input field. This will add the XML file to the list of sitemaps that must be downloaded and parsed by google - this will take some time so be patient.

Thats all there is to it. The whole process can be automated - this is how google what it to be. Just upload the Python script and its config files to your server and set a Cron job to run every week. Make sure than the config file do NOT have the ' suppress_search_engine_notify="1" ' option. Now when ever the script runs, it will create the XML file and notify google that a new Sitemap was made.

This is another good idea by google. But even they are not confident this will work. The following text is taken from Google Blog.

'We're undertaking an experiment called Google Sitemaps that will either fail miserably, or succeed beyond our wildest dreams.'

Everything that could be used to improve the position of a site in google search will be abused by some webmasters. And this technology gives a lot of control to webmasters - so undoubtedly they will find a way to cheat using Sitemap. Till then, let us enjoy this technology.

Filed Under...

4 Comments:

M said...

thanks for the great tutorial, this really helped out, knowing how to run it locally

Anonymous said...

Check out the sitemap xml generator software at fnoware

fnoware SiteMap XML

Unknown said...

Do you have any idea of how I can modify the configuration file to get all the pages on a specific folder with a priority=1 ?

Anonymous said...

hi
iam very new to xml sitemap generator...in my website(php website)i have one url(main url) .under the main url there are many sub urls(thousands of urls). how can i dispaly those urls when user want sub urls(those sub urls are based on id's)...can i use wildcards(*)...