How can I control web robots that try to scan (or index) my wiki? The robots.txt file is too complicated to maintain. In particular, I don't want robots following Edit or History links.
To some extent, PmWiki already controls robots, but you can add custom markup to refine your control.
As distributed, PmWiki adds <meta content='robots' ... /> tags automatically to every page. For normal browsing of pages not in the PmWiki group, the value is "index,follow"; for all other actions (edit, upload, diff, etc.) the value is "noindex,nofollow".
The pages in the PmWiki group are not indexed, except for the PmWiki.PmWiki page itself.
An admin can explicitly control the value of the robots meta-tag by setting
$MetaRobots in a configuration file. The robots meta-tag can be disabled entirely by setting
$MetaRobots to "". However, this applies to the entire site.
An admin can add this custom markup to a config.php file
Markup('robots', 'directives', '/\\(:robots\\s+(\\w[\\w\\s,]*):\\)/e', "PZZ(\$GLOBALS['MetaRobots'] = '$1')");
For PmWiki 2.2.58 and newer for PHP 5.5 installations, use the following code:
Markup_e('robots', 'directives', '/\\(:robots\\s+(\\w[\\w\\s,]*):\\)/', "PZZ(\$GLOBALS['MetaRobots'] = \$m)");
Then one can do any of
(:robots index,follow:) (:robots index,nofollow:) (:robots noindex,follow:) (:robots noindex,nofollow:)
to change the <meta name='robots' ... /> tag.
If you want to make sure that robots go ahead and index a page and follow links on all pages (including the PmWiki docs), then you can set (in local/config.php):
$UrlLinkFmt= "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>";
Newer versions of PmWiki (since 2.1.beta8) automatically return "403 Forbidden" errors to robots for any action other than ?action=browse, ?action=rss, or ?action=dc. You can extend this functionality to cookie setting configuration actions like ?setskin=... etc or other queries in links by adding a dummy action to the link:
PmWiki interpretes any action it does not know as action=browse, but since the action is not in the list of actions allowed for robots, robots get a "403 Forbidden" error page instead.
In addition, if $EnableRobotCloakActions is set, then any ?action= parameters are removed from page links when viewed by a robot, so that those robots won't blindly follow links to unimportant pages. At the moment $EnableRobotCloakActions is disabled by default, because some admins may feel that presenting robots with such modified views of a page might cause their sites to be negatively rated by search engines. (I've seen opinions on both sides of the issue here.) - Pm on Pmwiki-users list
In your skin's template file, add a
"rel='nofollow'" attribute to the Edit Page and Page History links. Light Skin already has this done for you, and the next release of Lean Skin will have it also. See this Google Blog entry for information about
Well-behaved search robots that would follow this link
<a href='$PageUrl?action=diff' title='$[History of this page]'>$[Page History]</a>
would not follow this one
<a href='$PageUrl?action=diff' title='$[History of this page]' rel='nofollow'>$[Page History]</a>
Actually, if one reads the Google link carefully it doesn't say that
rel='nofollow' causes a search robot to not follow a link. What it says is that the robot will not give the link any weight in its ranking algorithm. In keeping with Google's philosophy, I suspect Googlebot still follows the link. --Pm
For PmWiki 2.2, here's something you can use if you want to allow robots to follow links to external sites and avoid wasting bandwidth by having robots blindly follow links to unimportant wiki pages.
# Remove the default "rel='nofollow'" attribute for external links. $UrlLinkFmt = "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>"; # Eliminate forbidden ?action= values from page links returned to robots. $EnableRobotCloakActions = 1;
See discussion at ControllingWebRobots-Talk