ControllingWebRobots-Talk
This space is for User-contributed commentary and notes. Please include your name and a date along with your comment.
Comments
I cannot see any reason to add rel='nofollow' to the Edit Page and Page History links since pmwiki adds 'noindex, nofollow' automatically to the meta tag on the edit and history page, so they are not indexed by default. So it seems to me that pmwiki is controlling search bots and preventing well behaved bots (ones which look at the meta tag) to index the history and the edit pages.
Not well behaved search bots may be better excluded from searching through the wiki by means of an exclusion in a robots.txt file. It would be good to have advise about this here too.
- Robots.txt is simply a text file that well behaved robots read to determine which subdirectories they should index. A badly behaved robot just ignores it. It is a bit like a notice on an unlocked door that says "Do not open this door, please!". Controlling malicious robots requires more aggressive methods such as IP banning at the web-server or router. NeilHerber January 31, 2005, at 06:18 AMA way to tie up spam harvest bots. This site claims that the use of wpoison has already caused spam bots to be more respectful of the meta tags, and hence that using (:robots noindex,nofollow:) on pages with email addresses should reduce the risk of spam harvest bots picking them up. Francis September 17, 2006, at 04:48 PMAre there "better behaved " robots which will read robots.txt but will ignore robots meta tags? ~HansThere are probably poorly programmed robots that behave that way, but the "good" ones obey both. For more info see http://robotstxt.org/wc/exclusion.html
With pmwiki 2.0 beta 20 the attribute rel='nofollow' is added automatically to all links pointing to external sites, i.e. all url links. This extends pmwiki's attempts to control search bots even further and will help to reduce link-spamming. HansB
From a post by Pm about "comment spamming":
All of these options are presently available in PmWiki v2:
1. rel="nofollow" for all external links (new default for beta20)
$UrlLinkFmt = "<a class='urllink' href='\$LinkUrl' rel='nofollow'>\$LinkText</a>";
2. rel="nofollow" for unapproved external links only
$UnapprovedLinkFmt = "<a class='apprlink' href='\$LinkUrl' rel='nofollow'>\$LinkText</a>";
3. no rel="nofollow" at all (default for beta19 and earlier)
$UrlLinkFmt = "<a class='urllink' href='\$LinkUrl'>\$LinkText</a>";
4. not linking unapproved external links at all
$UnapprovedLinkFmt = "\$LinkText";
Personally, on pmwiki.org I'm going to do #2 -- i.e., rel="nofollow" for unapproved external links only, because I want approved links to gain the page rank benefit of having been listed on my site. I'll probably also add an icon or marker after the unapproved links that lets them be quickly approved (via the appropriate password).
There's also a fifth category of links -- those that are generated via the InterMap. It's my feeling that InterMap links should not receive the rel="nofollow", as those sites have already been approved by the site maintainer. But that will come in another release.
In PmWiki v1, one can add rel="nofollow" to external links via:
$UrlLinkFmt = "<a class='urllink' href='\$Url' rel='nofollow'>\$LinkText</a>";
The custom markup doesn't have any effect, since $HTMLHeaderFmt['robots'] is defined earlier in stdconfig.php, with a fixed value based on $MetaRobots at this stage (global default).
How do you prevent robots from indexing or following links like ...Group/PageName/?setprefs=.... or any other cookie setting action link? (HansB)
Note that you cannot prevent robots from following links, all you can do is advise them not to do so (and then refuse to serve content when they do).
In July 2005, Google said that it would honor rel="nofollow"
on individual links and not follow them, so that will help with Google. Other robots haven't declared if they follow such links. (Note that choosing to follow the link and choosing to count the link in page weighting results are two separate issues. Originally Google followed links with rel="nofollow" but simply didn't give the link any weight.)
So, there are two approaches -- you can hide such links from the robot ("cloaking"), or you can forbid content to a robot that follows the link. Many people feel that cloaking is an unwise practice, so that pretty much leaves forbidding content. So, the basic approach would have to be to send a "403 Forbidden" response if a robot sends a url that contains any query parameters other than ?action=
and ?n=
.
--Pm
The (:robots :)
markup is great for setting noindex for an individual page.
Is it possible from within the wiki (rather than in config) to apply a setting to a group of pages, eg via the GroupAttributes page?
Just place the markup in a GroupHeader or a GroupFooter. --Petko
Talk page for the ControllingWebRobots recipe (users).