Advantages of a smart robots.txt file

Write a smart robots.txtA loyal reader of my pamphlets asked me:

I foresee many new capabilities with robots.txt in the future due to this [Google’s robots.txt experiments]. However, how the hell can a webmaster hide their robots.txt from the public while serving it up to bots without doing anything shady?

That’s a great question. On this blog I’ve a static robots.txt, so I’ve set up a dynamic example using code snippets from other sites: this robots.txt is what a user sees, and here is what various crawlers get on request of my robots.txt example. Of course crawlers don’t request a robots.txt file with a query string identifying themselves (/robots.txt?crawlerName=*) like in the preview links above, so it seems you’ll need a pretty smart robots.txt file.

Before I tell you how to smarten a robots.txt file, lets define the anatomy of a somewhat intelligent robots.txt script:

  • It exists. It’s not empty. I’m not kidding.
  • A smart robots.txt detects and verifies crawlers to serve customized REP statements to each spider. Customized code means a section for the actual search engine, and general crawler directives. Example:
    User-agent: Googlebot-Image
    Disallow: /
    Allow: /cuties/*.jpg$
    Allow: /hunks/*.gif$
    Allow: /sitemap*.xml$
    Sitemap: http://example.com/sitemap-images.xml
     
    User-agent: *
    Disallow: /cgi-bin/

    This avoids confusion, because complex static robots.txt files with a section for all crawlers out there –plus a general section for other Web robots– are fault-prone, and might exceed the maximum file size some bots can handle. If you fuck up a single statement in a huge set of instructions, this may lead to the exitus of the process parsing your robots.txt, what results in no crawling at all, or possibly crawling of forbidden areas. Checking the syntax per engine with a lean robots.txt is way easier (supported robots.txt syntax: Google, Yahoo, Ask and MSN/LiveSearch - don’t use wildcards with MSN because they don’t really support them, that means at MSN wildcards are valid to match filetypes only).
  • A smart robots.txt reports all crawler requests. This helps with tracking when you change something. Please note that there’s a lag between the most recent request of robots.txt and the moment a search engine starts to obey it, because all engines cache your robots.txt.
  • A smart robots.txt helps identifying unknown Web robots, at least those which bother requesting it (ask Bill how to fondle rogue bots). From a log of suspect requests of your robots.txt you can decide whether particular crawlers need special instructions or not.
  • A smart robots.txt helps maintaining your crawler IP list.

Here is my step by step “how to create a smart robots.txt” guide. As always: if you suffer from IIS/ASP go search for reliable hosting (*ix/Apache).

In order to make robots.txt a script, tell your server to parse .txt files for PHP. (If you serve other .txt files than robots.txt, please note that you must add <?php ?> as first line to all .txt files on your server!) Add this line to your root’s .htaccess file:
AddType application/x-httpd-php .txt

Next grab the PHP code for crawler detection from this post. In addition to the functions checkCrawlerUA() and checkCrawlerIP() you need a function delivering the right user agent name, so please welcome getCrawlerName() in your PHP portfolio:

View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

(If your instructions for Googlebot, Googlebot-Mobile and Googlebot-Image are identical, you can put them in one single “Googlebot” section.)

And here is the PHP script “/robots.txt”. Include the general stuff like functions, shared (global) variables and whatnot.
<?php
@require($_SERVER["DOCUMENT_ROOT"] ."/code/generalstuff.php");

Probably your Web server’s default settings aren’t suitable to send out plain text files, hence instruct it properly.
@header("Content-Type: text/plain");
@header("Pragma: no-cache");
@header("Expires: 0");

If a search engine runs wild requesting your robots.txt too often, comment out the “no-cache” and “expires” headers.

Next check whether the requestor is a verifiable search engine crawler. Lookup the host name and do a reverse DNS lookup.
$isSpider = checkCrawlerIP($requestUri);

Depending on $isSpider log the request either in a crawler log or an access log gathering suspect requests of robots.txt. You can store both in a database table, or in a flat file if you operate a tiny site. (Write the logging function yourself.)
$standardStatement = "User-agent: * \n Disallow: /cgi-bin/ \n\n";
print $standardStatement;
if ($isSpider) {
$lOk = writeRequestLog("crawler");
$crawlerName = getCrawlerName();
}
else {
$lOk = writeRequestLog("suspect");
exit;
}

If the requestor is not a search engine crawler you can verify, send a standard statement to the user agent and quit. Otherwise call getCrawlerName() to name the section for the requesting crawler.

Now you can output individual crawler directives for each search engine, respectively their specialized crawlers.
$prnUserAgent = "User-agent: ";
$prnContent = "";
if ("$crawlerName" == "Googlebot-Image") {
$prnContent .= "$prnUserAgent $crawlerName\n";
$prnContent .= "Disallow: /\n";
$prnContent .= "Allow: /cuties/*.jpg$\n";
$prnContent .= "Allow: /hunks/*.gif$\n";
$prnContent .= "Allow: /sitemap*.xml$\n";
$prnContent .= "Sitemap: http://example.com/sitemap-images.xml\n\n";
}
if ("$crawlerName" == "Mediapartners-Google") {
$prnContent .= "$prnUserAgent $crawlerName \n Disallow:\n\n";
}

print $prnContent;
?>

Say the user agent is Googlebot-Image, the code above will output this robots.txt:
User-agent: *
Disallow: /cgi-bin/
 
User-agent: Googlebot-Image
Disallow: /
Allow: /cuties/*.jpg$
Allow: /hunks/*.gif$
Allow: /sitemap*.xml$
Sitemap: http://example.com/sitemap-images.xml

(Please note that crawler sections must be delimited by an empty line, and that if there’s a section for a particular crawler, this spider will ignore the general directives. Please consider reading more pamphlets discussing robots.txt and dull stuff like that.)

That’s it. Adapt. Enjoy.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

21 Comments to "Advantages of a smart robots.txt file"

  1. Marios Alexandrou on 26 November, 2007  #link

    Thanks for this clever idea. I don’t have anything in my robots.txt file that I want to hide, but knowing that other people might be hiding theirs means I’ll have to be a little more thorough when looking at the competition!

  2. g1smd on 26 November, 2007  #link

    Some general caveats with robots.txt usage (these caught me out once):

    There MUST be a blank line after the last line of the instructions in each section of the robots.txt file (that’s just before the next User-Agent declaration).

    If you have a section for User-Agent: Googlebot (for example), then Google will read ONLY that section of the robots.txt file. That is, Googlebot will completely IGNORE the User-Agent: * section of the file from then on.

    Hope that helps. :-)

  3. Sebastian on 26 November, 2007  #link

    Thanks Marios. You can link to a competitors robots.txt and then when it’s crawled and indexed just check the SE cache with a [cache:http://example.com/robots.txt] query for the real thing. Of course hardcore cloakers take care of that, that means they’ll serve the crawlers a noarchive-header, but in some cases you’ll be lucky.

  4. Sebastian on 26 November, 2007  #link

    Thanks for the heads-up g1smd. I’ve updated (and repeated) the Googlebot-Image example to make that clear (I was lazy caused by WP formatting issues, but then remembered that &nbsp; in empty lines saves the code formatting with WordPress, allowing blank lines in code blocks). I wrote about the ignored general directives in a recent post, so I had skipped that. You’re right, it makes sense to mention it again and again.

  5. SearchCap: The Day In Search, November 26, 2007…

    Below is what happened in search today, as reported on Search Engine Land and from other places across the web…….

  6. Jeremy Luebke on 26 November, 2007  #link

    I know your not ignorant so please don’t say ignorant things.

    “Here is my step by step “how to create a smart robots.txt” guide. As always: if you suffer from IIS/ASP go search for reliable hosting (*ix/Apache).”

    Some people are smart enough to program using the IIS/.NET platform ;)

  7. Tony Adam on 26 November, 2007  #link

    Great post explaining how to put together a detailed robots.txt. I’ve del.icio.us’ed this in case I need it in the future!

  8. Sebastian on 27 November, 2007  #link

    Jeremy, I believe that a developer forced to work with IIS/ASP/.NET must be extremely smart, because there are so many pitfalls (case issues, lack of request_uri, canonicalization nightmares … just to name a few). It’s easier with Apache under Windows, though. However, from years of Webmaster/Web developer support I’ve learned that it’s better not to run Web sites under Windows. I know, sometimes it’s unavoidable, but IMHO it’s not recommendable in general.

  9. Jeremy Luebke on 27 November, 2007  #link

    Actually it’s getting easier. The only real issues with IIS has been SE friendly URLs and redirects along with documentation for SEO using IIS & .Net. That’s changing as there are some great rewrite packages out now for free or fee. There is also an amazing CMS framework in Umbraco thats SE frindly out of the box.

    That said, windows hosting / program is suited more for enterprise level situations. Virtual hosting just doesn’t work, you need at least a VPS or dedicated server to get started so the barrier for entry is much higher and that’s the reason most small webmasters start out with *nix/Apache.

  10. Sebastian on 27 November, 2007  #link

    I’ve clients with Windows boxes too, you don’t try to convince them to switch to *ix when some AppServers, databases and stuff like that can’t run on Unix boxes, or when their staff isn’t capable to operate non-Windows stuff. ;)

    I don’t agree that *ix/Apache is for entry level only. I know a few pretty decent large scale systems running under Apache on Linux or BSD servers.

  11. Andy on 28 November, 2007  #link

    Interesting post. There’s plenty to think about there. I’ve bookmarked it on del.icio.us for future reference.

  12. Utah SEO Pro on 2 December, 2007  #link

    Sebastian, I appreciate you addressing my question. Thank you for this post. I’m going to try to implement this strategy tonight.

  13. Zak Nicola on 11 December, 2007  #link

    Thanks for all the great info, in the comments as well (g1smd).

    I’ve made some mistakes with my own robots.txt in the past that, had I read this before… well, no use in crying over spilled milk now.

    [Author link removed coz it points to an empty page]

  14. […] can test your Web server’s conditional GET support with your robots.txt, or, if even your robots.txt is a script, create a tiny HTML page with a text editor and upload it via FTP. Another neat tool to check HTTP […]

  15. […] robots.txt even maintains crawler IP lists and stores raw data for reports. I recently wrote a manual on cloaked robots.txt files on request of a loyal […]

  16. Kelly on 25 August, 2008  #link

    Can’t you just get around this using the User Agent Switcher plug-in in Firefox?

  17. montoz on 21 October, 2008  #link

    i still didint get some of the advanteges of robots:D

  18. Sebastian on 21 October, 2008  #link

    Lovebird, robots are sexy. Especially well coded Web robots.

  19. jonathan on 4 August, 2009  #link

    can someone help me with this please?

  20. Get yourself a smart robots.txt on 25 February, 2010  #link

    […] pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines […]

  21. Ben on 15 November, 2010  #link

    Just thought I would make the comment that if you’re on a shared hosting infrastructure, or even if you want to simplify things a little you can just use Apache’s Mod_Rewrite to make robots.txt dynamic. In the following example /robots.txt loads the file /robots.txt.php which dynamically generates it, and parses the PHP without having to add a handler for .txt.

    Put this in .htaccess:

    RewriteEngine on
    RewriteRule ^/robots.txt$ /robots.txt.php [R]

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.