Get yourself a smart robots.txt
Crawlers and other Web robots are the plague of today’s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don’t. Behaving or not, most bots just steal your content. You don’t appreciate that, so block them.
This pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines -that bring you nice traffic- while keeping the nasty (or useless, traffic-wise) bots out of the game.
The basic idea is that blocking all bots -with very few exceptions- makes more sense than maintaining kinda Web robots who’s who in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold Disallow: /.
Of course that’s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That’s bullshit. You can’t handle a gazillion bots with a black list.
Even bots that respect the REP can harm your search engine rankings, or reveal sensitive information to your competitors. Every minute a new bots turns up. You can’t manage all of them, and you can’t trust any (behaving) bot. Or, as the master of bot control explains: “That’s the only thing I’m concerned with: what do I get in return. If it’s nothing, it’s blocked“.
Also, large robots.txt files handling tons of bots are fault prone. It’s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don’t enlighten your competitors.
If you’re a smart webmaster agreeing with this approach, here’s your ToDo-List:
• Grab the code
• Install
• Customize
• Test
• Implement.
On error read further.
The anatomy of a smart robots.txt
Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you’re somewhat fucked. The code isn’t elegant. I’ve tried to keep it easy to understand even for noobs — at the expense of occasional lengthiness and redundancy.
Install
First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that’s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you’d have to add a leading <?php ?> string to all of them. Hence you add this code snippet to your root’s .htaccess file:
<FilesMatch ^robots\.txt$>
SetHandler application/x-httpd-php
</FilesMatch>
As long as you’re testing and customizing my script, make that ^smart_robots\.txt$.
Next grab the code and extract it into your document root directory. Do not rename /smart_robots.txt to /robots.txt until you’ve customized the PHP code!
For testing purposes you can use the logRequest() function. Probably it’s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don’t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.
Customize
Search the code for #EDIT and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you’ve launched your smart robots.txt.
Let’s look at /smart_robots.txt. First of all, it sets the canonical server name, change that to yours. After routing robots.txt request logging to a flat file (change that to a database table!) it includes /smart_robots_inc.php.
Next it sends some HTTP headers that you shouldn’t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn’t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.
As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its cached copy. To test your own robots.txt with Googlebot, you can login to GWC and fetch it as Googlebot. It’s a shame that the other search engines don’t provide a feature like that.
When you implement the whitelisted crawler method, you really should provide a contact page for crawling requests. So please change the “In order to gain permissions to crawl blocked site areas…” comment.
Next up are the search engine specific crawler directives. You put them as
if (isGooglebot()) {
$content .= "
User-agent: Googlebot
Disallow:
…
\n\n";
}
If your URIs contain double quotes, escape them as \" in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)
Please note that you need to output at least one empty line before each User-agent: section. Repeat that for each accepted crawler, before you output
$content .= "User-agent: *
Disallow: /
\n\n";
Every behaving Web robot that’s not whitelisted will bounce at the Disallow: /.
Before $content is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all SEO tools that spoof the user agent. Make sure that you do not output a single byte -for example leading whitespaces, a debug message, or a #comment- before the print $content; statement.
Blocking rogue bots is important. If you discover a rogue bot -for example a scraper that pretends to be Googlebot- during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can’t access your content!
Bear in mind that each and every piece of content served from your site should implement rogue bot detection, that’s doable even with non-HTML resources like images or PDFs.
Finally we deliver the user agent specific robots.txt and terminate the connection.
Now let’s look at /smart_robots_inc.php. Don’t fuck-up the variable definitions and routines that populate them or deal with the requestor’s IP addy.
Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that’s maintained 24/7 and updated every 6 hours: here you go (don’t use it for deceptive cloaking, promised?).
/smart_robots_inc.php comes with functions that detect Googlebot, MSNbot, and Slurp.
Most search engines tell how you can verify their crawlers and which crawler directives their user agents support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading “spider” and trailing “.yandex.ru” string and inbetween an integer, like in the isSlurp() function.
Test
Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don’t need to wait for crawler visits, you can use the “Fetch as Googlebot” thingy in your webmaster console.
Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from “.search.live.com” by accident, or that someone at Yahoo starts an ancient test bot that still uses an “inktomisearch.com” host name.
Don’t rely on my crawler detection routines. They’re dumped from memory in a hurry, I’ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It’s up to you to make it smart.
Launch
Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.
The output of a smart robots.txt
When you download a smart robots.txt with your browser, wget, or any other tool that comes with user agent spoofing, you’ll see a 403 or something like:
HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1
X-Powered-By: sebastians-pamphlets.com
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain;charset=iso-8859-1# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-botUser-agent: *
Disallow: /
(the contact form URI above doesn’t exist)
whilst a real search engine crawler like Googlebot gets slightly different contents:
HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1
X-Powered-By: sebastians-pamphlets.com
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain; charset=iso-8859-1# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-botUser-agent: Googlebot
Allow: /
Disallow:Sitemap: http://sebastians-pamphlets.com/sitemap.xml
User-agent: *
Disallow: /
Search engines hide important information from webmasters
Unfortunately, most search engines don’t provide enough information about their crawling. For example, last time I looked Google doesn’t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for “Googlebot-” and you’ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.
I’ve nudged Google and their response was that they don’t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn’t even announce properly that they’ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.
Since search engines don’t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents “on behalf of” particular user agents.
Enjoy.
Changelog:
2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.
|
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb Subscribe to |
22 comments Sebastian | Crawler Directives, Web development, X-Robots-Tag, MSN, Cloaking, .htaccess, Yahoo, SEO, robots.txt, Webmaster Central, Google
Does your built-in bullshit detector cry in agony when you read announcements of link analysis tools claiming to have crawled Web pages in the trillions? Can a tiny SEO shop, or a remote search engine in its early stages running on donated equipment, build an index of that size? It took Google a decade to reach these figures, and Google’s webspam team alone outnumbers the staff of SEOmoz and
The storage requirements for the Web’s link graph are way smaller than for a full text index that major search engines have to handle. In other words, it’s plausible.
For example
When you’re familiar with my various rants on the ever morphing
Matt Cutts asks us
Why do I think that solid REP knowledge is important right now? Not only because of the confusion that exists thanks to the volume of crappy advice provided at every Webmaster hangout. Of course understanding the REP makes webmastering easier, thus I’m glad when my REP related pamphlets are considered somewhat helpful.
There’s
This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It’s based on/extends the robots.txt specifications from 1994 and 1996, as well as