ControllingWebRobots

Summary: How to control web robots or bots trying to scan files
Version: 1.0
Prerequisites:
Status: stable
Maintainer: Petko
Categories: Security PHP72

Question

How can I control web robots that try to scan (or index) my wiki? The robots.txt file is too complicated to maintain. In particular, I don't want robots following Edit or History links.

Answer

To some extent, PmWiki already controls robots, but you can add custom markup to refine your control.

As distributed, PmWiki adds %hlt html<meta content='robots' ... /> tags automatically to every page. For normal browsing of pages not in the PmWiki, Site, or SiteAdmin group, the value is "index,follow"; for all other actions (edit, upload, diff, etc.) the value is "noindex,nofollow".

The pages in the PmWiki group are not indexed, except for the PmWiki.PmWiki page itself.

An admin can explicitly control the value of the robots meta-tag by setting $MetaRobots in a configuration file. The robots meta-tag can be disabled entirely by setting $MetaRobots to "". However, this applies to the entire site.

An admin can add this custom markup to a config.php file:

Markup('robots', 'directives',
     '/\\(:robots\\s+(\\w[\\w\\s,]*):\\)/',
     "MarkupRobots");
function MarkupRobots($m) {
  $GLOBALS['MetaRobots'] = $m[1];
}

The above works for PmWiki 2.2.56 or newer, PHP versions 4 through 7.2.

For a PmWiki 2.2.55 or older, the following will only work for PHP 5.4 or older:

   Markup('robots', 'directives',
     '/\\(:robots\\s+(\\w[\\w\\s,]*):\\)/e',
     "PZZ(\$GLOBALS['MetaRobots'] = '$1')");

Then one can do any of

(:robots index,follow:)
(:robots index,nofollow:)
(:robots noindex,follow:)
(:robots noindex,nofollow:)

to change the <meta name='robots' ... /> tag.

If you want to make sure that robots go ahead and index a page and follow links on all pages (including the PmWiki docs), then you can set (in local/config.php):

$UrlLinkFmt =
    "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>";
$MetaRobots = 'index,follow'; 

Newer versions of PmWiki (since 2.1.beta8) automatically return "403 Forbidden" errors to robots for any action other than ?action=browse, ?action=rss, or ?action=dc. You can extend this functionality to cookie setting configuration actions like ?setskin=... etc or other queries in links by adding a dummy action to the link: ?action=set?setskin=... etc. PmWiki interpretes any action it does not know as action=browse, but since the action is not in the list of actions allowed for robots, robots get a "403 Forbidden" error page instead.

In addition, if $EnableRobotCloakActions is set, then any ?action= parameters are removed from page links when viewed by a robot, so that those robots won't blindly follow links to unimportant pages. At the moment $EnableRobotCloakActions is disabled by default, because some admins may feel that presenting robots with such modified views of a page might cause their sites to be negatively rated by search engines. (I've seen opinions on both sides of the issue here.) - Pm on Pmwiki-users list

Another Answer

In your skin's template file, add a "rel='nofollow'" attribute to the Edit Page and Page History links. Light Skin already has this done for you, and the next release of Lean Skin will have it also. See this Google Blog entry for information about "rel='nofollow'".

Well-behaved search robots that would follow this link

 
<a href='$PageUrl?action=diff'
  title='$[History of this page]'>$[Page History]</a> 

would not follow this one

 
<a href='$PageUrl?action=diff'
  title='$[History of this page]' rel='nofollow'>$[Page History]</a> 

--HaganFox

Actually, if one reads the Google link carefully it doesn't say that rel='nofollow' causes a search robot to not follow a link. What it says is that the robot will not give the link any weight in its ranking algorithm. In keeping with Google's philosophy, I suspect Googlebot still follows the link. --Pm

Updated Answer

For PmWiki 2.2, here's something you can use if you want to allow robots to follow links to external sites and avoid wasting bandwidth by having robots blindly follow links to unimportant wiki pages.

# Remove the default "rel='nofollow'" attribute for external links.
$UrlLinkFmt = "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>";
# Eliminate forbidden ?action= values from page links returned to robots.
$EnableRobotCloakActions = 1;

--HaganFox

See also the recipe External links for how to configure just some links to drop the nofollow attribute. --Petko November 09, 2017, at 03:58 AM

Contributors

Comments

See discussion at ControllingWebRobots-Talk


Category: Robots Security

User notes +1: If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.