Recent Changes - Search:

Cookbook

PmWiki

pmwiki.org

ControllingWebRobots

Summary: How to control web robots or bots trying to scan files
Version: 1.0
Prerequisites:
Status: stable
Maintainer:
Categories: Security

Question

How can I control web robots that try to scan (or index) my wiki? The robots.txt file is too complicated to maintain. In particular, I don't want robots following Edit or History links.

Answer

To some extent, PmWiki already controls robots, but you can add custom markup to refine your control.

As distributed, PmWiki adds <meta content='robots' ... /> tags automatically to every page. For normal browsing of pages not in the PmWiki group, the value is "index,follow"; for all other actions (edit, upload, diff, etc.) the value is "noindex,nofollow".

The pages in the PmWiki group are not indexed, except for the PmWiki.PmWiki page itself.

An admin can explicitly control the value of the robots meta-tag by setting $MetaRobots in a configuration file. The robots meta-tag can be disabled entirely by setting $MetaRobots to "". However, this applies to the entire site.

An admin can add this custom markup to a config.php file

 
   Markup('robots', 'directives',
     '/\\(:robots\\s+(\\w[\\w\\s,]*):\\)/e',
     "PZZ(\$GLOBALS['MetaRobots'] = '$1')");

Then one can do any of

   (:robots index,follow:)
   (:robots index,nofollow:)
   (:robots noindex,follow:)
   (:robots noindex,nofollow:)

to change the <meta name='robots' ... /> tag.

If you want to make sure that robots go ahead and index a page and follow links on all pages (including the PmWiki docs), then you can set (in local/config.php):

 $UrlLinkFmt =
   "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>";
 $MetaRobots = 'index,follow';

Newer versions of PmWiki (since 2.1.beta8) automatically return "403 Forbidden" errors to robots for any action other than ?action=browse, ?action=rss, or ?action=dc. You can extend this functionality to cookie setting configuration actions like ?setskin=... etc or other queries in links by adding a dummy action to the link: ?action=set?setskin=... etc. PmWiki interpretes any action it does not know as action=browse, but since the action is not in the list of actions allowed for robots, robots get a "403 Forbidden" error page instead.

In addition, if $EnableRobotCloakActions is set, then any ?action= parameters are removed from page links when viewed by a robot, so that those robots won't blindly follow links to unimportant pages. At the moment $EnableRobotCloakActions is disabled by default, because some admins may feel that presenting robots with such modified views of a page might cause their sites to be negatively rated by search engines. (I've seen opinions on both sides of the issue here.) - Pm on Pmwiki-users list

Another Answer

In your skin's template file, add a "rel='nofollow'" attribute to the Edit Page and Page History links. Light Skin already has this done for you, and the next release of Lean Skin will have it also. See this Google Blog entry for information about "rel='nofollow'".

Well-behaved search robots that would follow this link

 <a href='$PageUrl?action=diff'
  title='$[History of this page]'>$[Page History]</a>

would not follow this one

 <a href='$PageUrl?action=diff'
  title='$[History of this page]' rel='nofollow'>$[Page History]</a>

--HaganFox

Actually, if one reads the Google link carefully it doesn't say that rel='nofollow' causes a search robot to not follow a link. What it says is that the robot will not give the link any weight in its ranking algorithm. In keeping with Google's philosophy, I suspect Googlebot still follows the link. --Pm

Updated Answer

For PmWiki 2.2, here's something you can use if you want to allow robots to follow links to external sites and avoid wasting bandwidth by having robots blindly follow links to unimportant wiki pages.

# Remove the default "rel='nofollow'" attribute for external links.
$UrlLinkFmt = "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>";
# Eliminate forbidden ?action= values from page links returned to robots.
$EnableRobotCloakActions = 1;

--HaganFox

Discussion

I cannot see any reason to add rel='nofollow' to the Edit Page and Page History links since pmwiki adds 'noindex, nofollow' automatically to the meta tag on the edit and history page, so they are not indexed by default. So it seems to me that pmwiki is controlling search bots and preventing well behaved bots (ones which look at the meta tag) to index the history and the edit pages.

Not well behaved search bots may be better excluded from searching through the wiki by means of an exclusion in a robots.txt file. It would be good to have advise about this here too.

Robots.txt is simply a text file that well behaved robots read to determine which subdirectories they should index. A badly behaved robot just ignores it. It is a bit like a notice on an unlocked door that says "Do not open this door, please!". Controlling malicious robots requires more aggressive methods such as IP banning at the web-server or router. NeilHerber January 31, 2005, at 06:18 AM
A way to tie up spam harvest bots. This site claims that the use of wpoison has already caused spam bots to be more respectful of the meta tags, and hence that using (:robots noindex,nofollow:) on pages with email addresses should reduce the risk of spam harvest bots picking them up. Francis September 17, 2006, at 04:48 PM
Are there "better behaved " robots which will read robots.txt but will ignore robots meta tags? ~Hans
There are probably poorly programmed robots that behave that way, but the "good" ones obey both. For more info see http://robotstxt.org/wc/exclusion.html

With pmwiki 2.0 beta 20 the attribute rel='nofollow' is added automatically to all links pointing to external sites, i.e. all url links. This extends pmwiki's attempts to control search bots even further and will help to reduce link-spamming. HansB


From a post by Pm about "comment spamming":

All of these options are presently available in PmWiki v2:

1. rel="nofollow" for all external links (new default for beta20)

      $UrlLinkFmt = "<a class='urllink' href='\$LinkUrl' rel='nofollow'>\$LinkText</a>";

2. rel="nofollow" for unapproved external links only

      $UnapprovedLinkFmt = "<a class='apprlink' href='\$LinkUrl' rel='nofollow'>\$LinkText</a>";

3. no rel="nofollow" at all (default for beta19 and earlier)

       $UrlLinkFmt = "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>";

4. not linking unapproved external links at all

       $UnapprovedLinkFmt = "\$LinkText";

Personally, on pmwiki.org I'm going to do #2 -- i.e., rel="nofollow" for unapproved external links only, because I want approved links to gain the page rank benefit of having been listed on my site. I'll probably also add an icon or marker after the unapproved links that lets them be quickly approved (via the appropriate password).

There's also a fifth category of links -- those that are generated via the InterMap. It's my feeling that InterMap links should not receive the rel="nofollow", as those sites have already been approved by the site maintainer. But that will come in another release.

In PmWiki v1, one can add rel="nofollow" to external links via:

      $UrlLinkFmt = "<a class='urllink' href='\$Url' rel='nofollow'>\$LinkText</a>";

The custom markup doesn't have any effect, since $HTMLHeaderFmt['robots'] is defined earlier in stdconfig.php, with a fixed value based on $MetaRobots at this stage (global default).

--DidierLebrun


How do you prevent robots from indexing or following links like ...Group/PageName/?setprefs=.... or any other cookie setting action link? (HansB)

Note that you cannot prevent robots from following links, all you can do is advise them not to do so (and then refuse to serve content when they do).

In July 2005, Google said that it would honor rel="nofollow" on individual links and not follow them, so that will help with Google. Other robots haven't declared if they follow such links. (Note that choosing to follow the link and choosing to count the link in page weighting results are two separate issues. Originally Google followed links with rel="nofollow" but simply didn't give the link any weight.)

So, there are two approaches -- you can hide such links from the robot ("cloaking"), or you can forbid content to a robot that follows the link. Many people feel that cloaking is an unwise practice, so that pretty much leaves forbidding content. So, the basic approach would have to be to send a "403 Forbidden" response if a robot sends a url that contains any query parameters other than ?action= and ?n=.

--Pm


Category: Robots Security

User notes? : If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.

Edit - History - Print - Recent Changes - Search
Page last modified on September 10, 2011, at 11:42 AM