00563: Don't show ?action= links to web spiders/robots

Summary: Don't show ?action= links to web spiders/robots
Created: 2005-10-20 00:31
Status: Closed - added to 2.1.beta8
Category: CoreCandidate
From: Pm
Assigned:
Priority: 55433
Version: 2.0.12
OS:

Description: Currently when a web spider such as Googlebot, Yahoo! Slurp, or others visit a PmWiki site, they tend to follow all of the ?action= links on a page (including ?action=edit, ?action=diff, etc.).

PmWiki's default configuration provides a <meta> tag to tell robots not to index pages when ?action= is specified in the link, however by the time this occurs the server has already incurred the expense of generating the page and sending it to the robot.

Pm proposes a module that detects when a robot is retrieving a page, and strips all "?action=" parameters from page links within the page. This prevents robots from seeing the ?action= links in the first place, reducing server overhead and bandwidth.

Comments?

  1. Stripping the action when a robot arrives is a good idea, I add a rel=nofollow to the edit link. Works for google my best friend
  2. some action should be excluded from exclusing
    1. browse ( :) )
    2. rss
    3. dc
  3. Some aftercare is needed as well, what to do when a robot arrives using a ?action= link ( most indexes already have a lot of links in the databases, or a robot may arrive via an external page (like site stats). On my site I return a 401 (Unauthorized) whenever a robot arrives.
  4. List of robots I catch at the moment
    • slurp
    • googlebot
    • mediapartners
    • xenu
    • grub
    • ingrid
    • baiduspider
    • metaweb
    • nutch
    • aipbot
    • societyrobot
    • teoma
    • zoekybot
    • gigabot
    • yahoo
    • vagabondo
    • msnbot
    • mirago
    • omni
    • zyborg
    • (and a bunch in robots.txt and .htaccess)

good luck BrBrBr

Array
(
    [post_max_size] => 64M
    [$_POST keys] => 
    [$_REQUEST keys] => n
    [$_SERVER] => Array
        (
            [CONTEXT_DOCUMENT_ROOT] => /home/pmwiki/public_html
            [CONTEXT_PREFIX] => 
            [DOCUMENT_ROOT] => /home/pmwiki/public_html
            [GATEWAY_INTERFACE] => CGI/1.1
            [HTTPS] => on
            [HTTP_ACCEPT] => */*
            [HTTP_ACCEPT_ENCODING] => gzip, br, zstd, deflate
            [HTTP_COOKIE] => imstime=1777524977; PHPSESSID=1jtcgf2im6n7h0ovafsjcok1m0
            [HTTP_HOST] => www.pmwiki.org
            [HTTP_REFERER] => https://www.pmwiki.org/PITS/00563
            [HTTP_USER_AGENT] => Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
            [HTTP_X_HTTPS] => 1
            [PATH] => /bin:/usr/bin
            [PHP_INI_SCAN_DIR] => /opt/cpanel/ea-php70/root/etc:/opt/cpanel/ea-php70/root/etc/php.d:.
            [QUERY_STRING] => n=PITS%2f00563
            [REDIRECT_HTTPS] => on
            [REDIRECT_QUERY_STRING] => n=PITS%2f00563
            [REDIRECT_SCRIPT_URI] => https://www.pmwiki.org/wiki/PITS/00563
            [REDIRECT_SCRIPT_URL] => /wiki/PITS/00563
            [REDIRECT_SSL_TLS_SNI] => www.pmwiki.org
            [REDIRECT_STATUS] => 200
            [REDIRECT_UNIQUE_ID] => afLhBcKGNOuaORtRQz0bZwAAAQc
            [REDIRECT_URL] => /wiki/PITS/00563
            [REMOTE_ADDR] => 216.73.216.25
            [REMOTE_PORT] => 28151
            [REQUEST_METHOD] => GET
            [REQUEST_SCHEME] => https
            [REQUEST_URI] => /wiki/PITS/00563
            [SCRIPT_FILENAME] => /home/pmwiki/public_html/index.php
            [SCRIPT_NAME] => /index.php
            [SCRIPT_URI] => https://www.pmwiki.org/wiki/PITS/00563
            [SCRIPT_URL] => /wiki/PITS/00563
            [SERVER_ADDR] => 23.254.203.248
            [SERVER_ADMIN] => webmaster@pmwiki.org
            [SERVER_NAME] => www.pmwiki.org
            [SERVER_PORT] => 443
            [SERVER_PROTOCOL] => HTTP/1.1
            [SERVER_SIGNATURE] => 
            [SERVER_SOFTWARE] => Apache
            [SSL_TLS_SNI] => www.pmwiki.org
            [TZ] => America/Los_Angeles
            [UNIQUE_ID] => afLhBcKGNOuaORtRQz0bZwAAAQc
            [PHP_SELF] => /index.php
            [REQUEST_TIME_FLOAT] => 1777524997.8077
            [REQUEST_TIME] => 1777524997
            [argv] => Array
                (
                    [0] => n=PITS%2f00563
                )

            [argc] => 1
        )

)