[pmwiki-users] Google local site search

Thu Dec 29 21:17:56 CST 2005

[Hoo boy.  It's hard to keep up with you!  :-)]

On 12/29/05, Patrick R. Michaud <pmichaud at pobox.com> wrote:
> > I'm trying, perhaps a bit clumsily, to suggest that authors shouldn't
> > need to remember that because the nofollow attribute would be slipped
> > in automatically.
>
> This is *exactly* the point I've been making all along -- authors
> shouldn't have to worry about robots when writing.  However, the point
> that *I* seem to be having trouble getting across is that it's fairly
> difficult for PmWiki to be automatically adding rel='nofollow' to links.
> People keep saying "just add rel='nofollow'", and my point is that
> this isn't easy, and beyond that it doesn't catch them all.

Oh. Got it.

> > > Even if there are three links to ?action=edit
> > > with rel='nofollow' and one that omits it, then the link is likely to
> > > get followed.
> >
> > Then the robot will hit the <meta name='robots' ...> tag.
>
> ...which means that all of this is for naught, since the point
> of rel='nofollow' and/or cloaking in this context is to prevent the
> robot from hitting the webserver for things it won't index anyway.

It's all for naught *only if that happens* though.  A concerned
administrator should keep that from happening.

> > > Can you give me an example of this -- i.e.,  how the links are
> > > misleading, and how it won't "work" for robots that understand
> > > rel='nofollow'?
> >
> > 2) Omit the action and nofollow and you're saying either
> >
> >    "We morphed this "Edit" link for your bot.  You'll need to
> >    figure out how to deal with that.  We're not doing any
> >    monkey business here.  Really!"
>
> I don't see that Google has to "deal with that" beyond what they're
> doing now.

Here's what I mean:

The robot does *see* -- and probably index -- the link, even though it
doesn't follow it.  IOW it takes note of the link's URL and link text,
even if it ignores the link's target page.

I think a nofollow-aware robot should see the same link I'd see in
Firefox if that's possible.  Two reasons not to morph unless
necessary:

1) I want the the search engine's set of links to my pages to be "as
pertinent as possible".  That is, "dilution" is averted since the
?action=foo URLs are separate pages.

2) Perhaps more importantly, it wouldn't be out of the question to
assume that search engines might punish sites that do morphing when
they see a robot, all else being equal.

>  Google will simply see a page that has several self-referencing
> links -- surely we know by now that this doesn't present a huge
> problem for Google.

Now we're trying to decide whether it's a small problem or non-existent one

>  (Yes, there is the possibility that morphed
> links dilute the distribution of Page Rank(TM) to other pages.)

That's what we're trying to avoid.

> > Put another way:  Why morph the page if you don't need to?  The extra
> > information you are omitting is potentially significant to the search
> > engine indexing algorithm, even if the significance may be minor.
>
> 1. Convince me that I don't need to morph the page for robots
>    other than Googlebot.

No, I do agree you should morph pages that don't honor rel='nofollow'.

> > * Links that robots shouldn't follow should automagically get a
> > rel='nofollow' attribute.
>
> 2. Tell me an easy way to automatically locate and automatically add
>    rel='nofollow' to all of the appropriate links in a page (which
>    could be coming from any of templates, markup, and recipes).

So that will wind up being the responsibility of the template author,
wiki author, and recipe author respectively.  That's not unreasonable
to ask.

> > * Robots that disregard the nofollow attribute -- and only those
> > robots -- should get the stealth treatment.
>
> 3.  Tell me which robots (besides Googlebot) understand rel='nofollow'
>     to mean "don't spider this link".

That one's over my head, although there must be some resource out
there for that.

I'll guess that by far the robot most people are concerned about it Google's.

> But ultimately, why do we need to suppress the stealth treatment
> for robots that understand rel='nofollow'?  I think the two
> are actually orthogonal to each other.  Let's assume that
> a skin and/or Site.PageActions properly add rel='nofollow' to links
> (as they do now, as of beta17).  If that's the case, then cloaking
> '?action=' parameters from the link shouldn't make any difference at
> all, since Google has already said that rel='nofollow' means that
> the link doesn't count when computing search results.

Again, I think it makes a difference because
  (1) the link and it's text are seen, even if it's not followed, and
  (2) it's better not to morph the page, lest there be
      a "Guilty Until Proven Innocent" policy about morphing.

> So, I think the current implementation has everything you're asking
> for already (with the exception of automatically adding rel='nofollow'
> to links in the markup, which I've already explained is a little
> challenging).

Almost.

> * For a site that wants to control robots using rel='nofollow' and
>   never cloaking any actions, use a skin that has rel='nofollow'
>   attributes in the correct places (including Site.PageActions if
>   appropriate).  Any ?action= links in the markup will still be
>   given without rel='nofollow', but these ought to be relatively
>   rare and/or easily adjusted with %rel=nofollow%.  (And ultimately
>   we're just trying to optimize robot access -- it doesn't really hurt
>   if the robot follows the link.)

Poifect.

> * For a site that wants to control robots by cloaking actions,
>   set $EnableRobotCloakActions=1; .

Right-O.

> * Doing both above the above should work out well for Google, since
>   nearly all links with cloaked actions will also have rel='nofollow'
>   attached to them (meaning Google ignores them anyway).

Here we disagree because I think they're not completely ignored
("they" are the URLs themselves, not the target pages) and the
cloaking may not go unnoticed, and could wind up being detrimental.

> * If someone wants to use action cloaking for robots other than Google,
>   simply use:
>
>     $EnableRobotCloakActions =
>       !preg_match('/Googlebot/', @$_SERVER['HTTP_USER_AGENT']);
>
>   This enables action cloaking for robots other than Googlebot.

I'll do that if necessary, but I think it should be the default.

Thanks for the detailed explanations.

Hagan

> Pm
>