[pmwiki-users] Google local site search

Thu Dec 29 18:38:34 CST 2005

On Thu, Dec 29, 2005 at 04:13:02PM -0700, H. Fox wrote:
> On 12/29/05, Patrick R. Michaud <pmichaud at pobox.com> wrote:
> > Ummm... search indexes normally index the contents of the page,
> > not the contents of links to the page.  Or am I mistaken here?
> 
> [I seem to write about SEO here frequently, despite my lack of
> professional expertise on the subject, but here goes...]
> 
> I think the link text is taken into account when the search engine
> indexes a page's content.  

After some more research, you're correct, the link text does have
some bearing on the search results.  (I'm not entirely sure it matters 
for this discussion, but just wanted to confirm your statement.)

> > > In that case, why cloak for googlebot?  Why not keep the ?action=
> > > parameters intact and use the rel='nofollow' attribute for bots that
> > > understand it?
> >
> > Because there may be links where an author forgets the nofollow.
> 
> If that were the case it would be a small price to pay.  (Philosophy
> #3. Avoid gratuitous features)
> 
> I'm trying, perhaps a bit clumsily, to suggest that authors shouldn't
> need to remember that because the nofollow attribute would be slipped
> in automatically.

This is *exactly* the point I've been making all along -- authors 
shouldn't have to worry about robots when writing.  However, the point
that *I* seem to be having trouble getting across is that it's fairly
difficult for PmWiki to be automatically adding rel='nofollow' to links.
People keep saying "just add rel='nofollow'", and my point is that
this isn't easy, and beyond that it doesn't catch them all.

> > > > And even if the Skins include rel='nofollow' in the templates,
> > > > what about markup...?
> > > >
> > > >     [[OtherPage]]
> > > >     [[OtherPage?action=edit]]
> > > >     [[OtherPage?action=dc]]
> > >
> > Er, wrong.  ?action=dc (like ?action=rss) should be followable
> > by robots.
> 
> Then only the middle one.  (You tricked me there!)
> My point remains the same: Slip a rel='nofollow' attribute into links
> that robots shouldn't follow.
> ...
> Besides, I'm suggestion that such links get the rel='nofollow'
> attribute without doing anything.  Sorry if that wasn't clear.

I wasn't trying to trick you -- I was making the point that 
automatically adding rel='nofollow' isn't entirely trivial!
The big issue is that rel='nofollow' requires modifying the <a ...>
tags, which come from a variety of sources and can happen in
many places.  Suppressing an ?action= parameter after $ScriptUrl
is relatively easy, since it's always centralized in the FmtPageName()
function.

> > Even if there are three links to ?action=edit
> > with rel='nofollow' and one that omits it, then the link is likely to
> > get followed.
> 
> Then the robot will hit the <meta name='robots' ...> tag.

...which means that all of this is for naught, since the point
of rel='nofollow' and/or cloaking in this context is to prevent the 
robot from hitting the webserver for things it won't index anyway.

> > Can you give me an example of this -- i.e.,  how the links are
> > misleading, and how it won't "work" for robots that understand
> > rel='nofollow'?
> 
> 2) Omit the action and nofollow and you're saying either
> 
>    "We morphed this "Edit" link for your bot.  You'll need to
>    figure out how to deal with that.  We're not doing any
>    monkey business here.  Really!"

I don't see that Google has to "deal with that" beyond what they're
doing now.  Google will simply see a page that has several self-referencing
links -- surely we know by now that this doesn't present a huge
problem for Google.  (Yes, there is the possibility that morphed
links dilute the distribution of Page Rank(TM) to other pages.)

> Put another way:  Why morph the page if you don't need to?  The extra
> information you are omitting is potentially significant to the search
> engine indexing algorithm, even if the significance may be minor. 

1. Convince me that I don't need to morph the page for robots
   other than Googlebot.  

> * Links that robots shouldn't follow should automagically get a
> rel='nofollow' attribute.

2. Tell me an easy way to automatically locate and automatically add
   rel='nofollow' to all of the appropriate links in a page (which
   could be coming from any of templates, markup, and recipes).  
   While it's easy for FmtPageName to do the equivalent of

      preg_replace('/(\$ScriptUrl.*?)\?action=/', '$1', ...)

   to remove an unwanted ?action= link for a robot, it's *much* harder 
   to find only the appropriate <a ...> tags and add rel='nofollow' to 
   them.

> * Robots that disregard the nofollow attribute -- and only those
> robots -- should get the stealth treatment.

3.  Tell me which robots (besides Googlebot) understand rel='nofollow'
    to mean "don't spider this link".

But ultimately, why do we need to suppress the stealth treatment
for robots that understand rel='nofollow'?  I think the two
are actually orthogonal to each other.  Let's assume that
a skin and/or Site.PageActions properly add rel='nofollow' to links
(as they do now, as of beta17).  If that's the case, then cloaking
'?action=' parameters from the link shouldn't make any difference at 
all, since Google has already said that rel='nofollow' means that 
the link doesn't count when computing search results.

This would then mean that the only time Google encounters cloaked
links that have weight is when a skin omits rel='nofollow' attributes 
from its links or when authors create ?action= links directly in
page markup.  I don't think PmWiki should be second-guessing
skin template creators by automatically adding rel='nofollow' to 
their <a ...> tags, and the number of times that authors create 
?action= links is relatively rare (so that cloaking these links
doesn't distort the results much at all, or allowing the robot
to follow them isn't a huge penalty).

So, I think the current implementation has everything you're asking
for already (with the exception of automatically adding rel='nofollow'
to links in the markup, which I've already explained is a little
challenging).

* For a site that wants to control robots using rel='nofollow' and
  never cloaking any actions, use a skin that has rel='nofollow' 
  attributes in the correct places (including Site.PageActions if 
  appropriate).  Any ?action= links in the markup will still be
  given without rel='nofollow', but these ought to be relatively
  rare and/or easily adjusted with %rel=nofollow%.  (And ultimately
  we're just trying to optimize robot access -- it doesn't really hurt
  if the robot follows the link.)

* For a site that wants to control robots by cloaking actions, 
  set $EnableRobotCloakActions=1; .

* Doing both above the above should work out well for Google, since
  nearly all links with cloaked actions will also have rel='nofollow'
  attached to them (meaning Google ignores them anyway).

* If someone wants to use action cloaking for robots other than Google,
  simply use:

    $EnableRobotCloakActions = 
      !preg_match('/Googlebot/', @$_SERVER['HTTP_USER_AGENT']);

  This enables action cloaking for robots other than Googlebot.

Pm