How can I control web robots that try to scan (or index) my wiki? The robots.txt file is too complicated to maintain. In particular, I don't want robots following Edit or History links.
To some extent, PmWiki already controls robots, but you can add custom markup to refine your control.
As distributed, PmWiki adds <meta content='robots' ... /> tags automatically to every page. For normal browsing of pages not in the PmWiki group, the value is "index,follow"; for all other actions (edit, upload, diff, etc.) the value is "noindex,nofollow".
The pages in the PmWiki group are not indexed, except for the PmWiki.PmWiki page itself.
An admin can explicitly control the value of the robots meta-tag by setting
An admin can add this custom markup to a config.php file
Markup('robots', 'directives', '/\\(:robots\\s+(\\w[\\w\\s,]*):\\)/e', "PZZ(\$GLOBALS['MetaRobots'] = '$1')");
Then one can do any of
(:robots index,follow:) (:robots index,nofollow:) (:robots noindex,follow:) (:robots noindex,nofollow:)
to change the <meta name='robots' ... /> tag.
If you want to make sure that robots go ahead and index a page and follow links on all pages (including the PmWiki docs), then you can set (in local/config.php):
Newer versions of PmWiki (since 2.1.beta8) automatically return "403 Forbidden" errors to robots for any action other than ?action=browse, ?action=rss, or ?action=dc. You can extend this functionality to cookie setting configuration actions like ?setskin=... etc or other queries in links by adding a dummy action to the link:
In addition, if $EnableRobotCloakActions is set, then any ?action= parameters are removed from page links when viewed by a robot, so that those robots won't blindly follow links to unimportant pages. At the moment $EnableRobotCloakActions is disabled by default, because some admins may feel that presenting robots with such modified views of a page might cause their sites to be negatively rated by search engines. (I've seen opinions on both sides of the issue here.) - Pm on Pmwiki-users list
In your skin's template file, add a
Well-behaved search robots that would follow this link
<a href='$PageUrl?action=diff' title='$[History of this page]'>$[Page History]</a>
would not follow this one
<a href='$PageUrl?action=diff' title='$[History of this page]' rel='nofollow'>$[Page History]</a>
Actually, if one reads the Google link carefully it doesn't say that
For PmWiki 2.2, here's something you can use if you want to allow robots to follow links to external sites and avoid wasting bandwidth by having robots blindly follow links to unimportant wiki pages.
# Remove the default "rel='nofollow'" attribute for external links. $UrlLinkFmt = "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>"; # Eliminate forbidden ?action= values from page links returned to robots. $EnableRobotCloakActions = 1;
See discussion at ControllingWebRobots-Talk