PITS /

00563: Don't show ?action= links to web spiders/robots

Summary: Don't show ?action= links to web spiders/robots

Created: 2005-10-20 00:31

Status: Closed - added to 2.1.beta8

Category: CoreCandidate

From: Pm

Assigned:

Priority: 55433

Version: 2.0.12

OS:

Description: Currently when a web spider such as Googlebot, Yahoo! Slurp, or others visit a PmWiki site, they tend to follow all of the ?action= links on a page (including ?action=edit, ?action=diff, etc.).

PmWiki's default configuration provides a <meta> tag to tell robots not to index pages when ?action= is specified in the link, however by the time this occurs the server has already incurred the expense of generating the page and sending it to the robot.

Pm proposes a module that detects when a robot is retrieving a page, and strips all "?action=" parameters from page links within the page. This prevents robots from seeing the ?action= links in the first place, reducing server overhead and bandwidth.

Comments?

Stripping the action when a robot arrives is a good idea, I add a rel=nofollow to the edit link. Works for google my best friend
some action should be excluded from exclusing
1. browse ( :) )
2. rss
3. dc
Some aftercare is needed as well, what to do when a robot arrives using a ?action= link ( most indexes already have a lot of links in the databases, or a robot may arrive via an external page (like site stats). On my site I return a 401 (Unauthorized) whenever a robot arrives.
List of robots I catch at the moment
- slurp
- googlebot
- mediapartners
- xenu
- grub
- ingrid
- baiduspider
- metaweb
- nutch
- aipbot
- societyrobot
- teoma
- zoekybot
- gigabot
- yahoo
- vagabondo
- msnbot
- mirago
- omni
- zyborg
- (and a bunch in robots.txt and .htaccess)

good luck BrBrBr