GoogleSitemaps

Summary: How to submit a complete list of web pages to google
Version:
Prerequisites:
Status:
Maintainer:
Categories: RSS, Integration, Robots, Google

Search engines and especially google are major source of visitors for many if not most websites.

Optimal indexing of a webpage by means of the search engine spider (for example googlebot) is a key issue in achieving good search engine results.

A spider visits a web page, the page is indexed and the spider crawls on following the links on the page. PmWiki ensures a proper linkage between the different wiki-pages, and enables easy generation of a sitemap by means of the (:pagelist:) directive. Still, since a spider indexes a website step by step it can take a while before a site is fully indexed, and it will take a while before added or changed pages are re-spidered.

Recently google introduced a new method to have a website indexed: Google sitemaps, as usual as a beta program.

Google sitemaps allows a Webmaster to submit a complete list of web pages to google. Several content management systems provide a method to use Google Sitemaps. I think it's time for PmWiki as well

Using RSS

One method to provide a (partial) index to google sitemaps is to use the rss feed provided by pmwiki based on for example Main.AllRecentChanges:

http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=rss
NOTE: With recent changes to PMWiki you should now use: http://yoursite.com/pmwiki.php?n=Site.AllRecentChanges&action=rss

  • the rss module must be enabled ( include_once("scripts/rss.php")

Do not use the syntax like ..../pmwiki.php/Main/AllRecentChanges?action=rss . Why? from Google:

The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://yoursite.com/catalog/sitemap.xml can include any URLs starting with http://yoursite.com/catalog/ but cannot include URLs starting with http://yoursite.com/images/.

Thus the syntax above would not add .../pmwiki.php/Cookbook/... to the index

Set parameters for a more complete list

It might be useful to tweak the rss a little, by default the feed only displays the last 20 changes:

  if ( $action=="sitemap" ) {
    $RssMaxItems=50000;                           # maximum items to display
    $RssSourceSize=0;                        # max size to build desc from
    $RssDescSize=0;                          # max desc size
    $action="rss";
  }
  include_once("scripts/rss.php");

Above code didn't work

I used this line instead:

  if ( $action=="sitemap" ) {
	SDVA($_REQUEST, array('count' => 50000));
	$action="rss";
  }

Set .htaccess to overcome directory layout restrictions

Google is quite strict about the directory layout and the sitemap url must be in the top directory of your website. However redirects are accepted. So a little teak in the .htaccess can overcome that restriction:

 Redirect /sitemap.rss [(approve links)
  edit
  diff]

Now use a syntax like:

http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=sitemap

Submit this link to google sitemaps using the ping-link or the web form. (see the google pages for details)

Using XML-Sitemap

Google provides a special XML scheme for this purpose.

Benefit of using the XML-Sitemap scheme are the tags:

how important is this page ( relative to the other pages on the site)
how often is the page updates

The changefreq could be derived from the values of the page history. I’m not sure yet how to get a priority of a page. Probably using some patternarray

Any thoughts are welcome BrBrBr

A Basic script

Note: This script expects the page file to have additional "name" and "time" entries, see PageFileFormat.

Changelog

1.7support EnablePageListProtectTested with pmwiki 2.1beta14-15
 Added Site to exclude pattern

Contributors

Comments

See discussion at GoogleSitemaps-Talk