GoogleSitemaps
How can we assist search engine crawlers in finding new and updated pages on our website?
Search engines, especially Google, are major source of visitors for many if not most websites. Optimal indexing of a webpage by means of the search engine spider (for example googlebot) is a key issue in achieving good search engine results.
A spider visits a web page, the page is indexed and the spider crawls on following the links on the page. Although PmWiki ensures a proper linkage between different wiki-pages, and enables easy generation of a sitemap by means of the (:pagelist:) directive, a spider indexes a website step-by-step and it can take a while before it discovers newly added or updated pages.
Google allows you to assist robots in discovering updated pages through a specific site index "file" or feed: see https://support.google.com/webmasters/answer/183668 for details.
Using RSS
One method to provide an index of desired pages to Google sitemaps is to use the rss feed provided by pmwiki based on for example Main.AllRecentChanges:
http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=rss
NOTE: With recent changes to PMWiki you should now use: http://yoursite.com/pmwiki.php?n=Site.AllRecentChanges&action=rss
or with .htaccess
rewrites it could be http://yoursite.com/?n=Site.AllRecentChanges&action=rss
.
- the rss module must be enabled:
include_once("scripts/rss.php");
Do not use the syntax like ..../pmwiki.php/Main/AllRecentChanges?action=rss
. Why? from Google:
The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://yoursite.com/catalog/sitemap.xml
can include any URLs starting with http://yoursite.com/catalog/
but cannot include URLs starting with http://yoursite.com/images/
.
Thus the syntax above would not add .../pmwiki.php/Cookbook/...
to the index
Set parameters for a more complete list
It might be useful to tweak the rss a little, by default the feed only displays the last 20 changes:
if ( $action=="sitemap" ) { $RssMaxItems=50000; # maximum items to display $RssSourceSize=0; # max size to build desc from $RssDescSize=0; # max desc size $action="rss"; } include_once("scripts/rss.php");
Above code didn't work
I used this line instead:
if ( $action=="sitemap" ) { SDVA($_REQUEST, array('count' => 50000)); $action="rss"; }
Set .htaccess to overcome directory layout restrictions
Google is quite strict about the directory layout and the sitemap url must be in the top directory of your website. However redirects are accepted. So a little teak in the .htaccess
can overcome that restriction:
Redirect /sitemap.rss http://yoursite.com/index.php/Site/AllRecentChanges?action=sitemap
Now use a syntax like:
http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=sitemap
Submit this link to Google sitemaps using https://www.google.com/webmasters/tools/sitemap-list or the web form. (see the Google pages for details)
Using XML-Sitemap
Google provides a special XML scheme for this purpose.
Benefit of using the XML-Sitemap scheme are the tags:
The changefreq could be derived from the values of the page history. I’m not sure yet how to get a priority of a page. Probably using some patternarray
Any thoughts are welcome BrBrBr
A Basic script
Changelog
1.7 | support EnablePageListProtect | Tested with pmwiki 2.1beta14-15 |
Added Site to exclude pattern |
Contributors
Comments
See discussion at GoogleSitemaps-Talk