Archived posts from the 'Web development' Category

Save bandwidth costs: Dynamic pages can support If-Modified-Since too

Conditional HTTP GET requests make Webmasters and Crawlers happyWhen search engine crawlers burn way too much of your bandwidth, this post is for you. Crawlers sent out by major search engines (Google, Yahoo and MSN/Live Search) support conditional GETs, that means they don’t fetch your pages if those didn’t change since the last crawl.

Of course they must fetch your stuff over and over again for this comparision, if your Web server doesn’t play nice with Web robots, as well as with other user agents that can  cache your pages and other Web objects like images. The protocol your Web server and the requestors use to handle caching is quite simple, but its implementation can become tricky. Here is how it works:

1st request Feb/10/2008 12:00:00

Googlebot requests /some-page.php from your server. Since Google has just discovered your page, there are no unusual request headers, just a plain GET.

You create the page from a database record which was modified on Feb/09/2008 10:00:00. Your server sends Googlebot the full page (5k) with an HTTP header
Date: Sun, 10 Feb 2008 12:00:00 GMT
Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT

(lets assume your server is located in Greenwich, UK), the HTTP response code is 200 (OK).

Bandwidth used: 5 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

2nd request Feb/17/2008 12:00:00

Googlebot found interesting links pointing to your page, so it requests /some-page.php again to check for updates. Since Google already knows the resource, Googlebot requests it with an additional HTTP header
If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT

where the date and time is taken from the Last-Modified header you’ve sent in your response to the previous request.

You didn’t change the page’s record in the database, hence there’s no need to send the full page again. Your Web server sends Googlebot just an HTTP header
Date: Sun, 17 Feb 2008 12:00:00 GMT
Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT

The HTTP response code is 304 (Not Modified). (Your Web server can suppress the Last-Modified header, because the requestor has this timestamp already.)

Bandwidth used: Less than 500 bytes for the HTTP header.

3rd request Feb/24/2008 12:00:00

Googlebot can’t resist to recrawl /some-page.php, again using the
If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT

header.

You’ve updated the database on Feb/23/2008 09:00:00 adding a few paragraphs to the article, thus you send Googlebot the full page (now 7k) with this HTTP header
Date: Sun, 10 Feb 2008 12:00:00 GMT
Last-Modified: Sat, 23 Feb 2008 09:00:00 GMT

and an HTTP response code 200 (OK).

Bandwidth used: 7 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

Further requests

Provided you don’t change the contents again, all further chats between Googlebot and your Web server regarding /some-page.php will burn less than 500 bytes of your bandwidth each. Say Googlebot requests this page weekly, that’s 370k saved bandwidth annually. You do the math. Even with a medium-sized Web site you most likely want to implement proper caching, right?

Not only Webmasters love conditional GET requests that save bandwidth costs and processing time, search engines aren’t keen on useless data transfers too. So lets see how you could respond efficiently to conditional GET requests from search engines. Apache handles caching of static files (e.g. .txt or .html files you upload with FTP) differently from dynamic contents (script outputs with or without a query string in the URI).

Static files

Fortunately, Apache comes with native support of the Last-Modified / If-Modified-Since / Not-Modified functionality. That means that crawlers and your Web server don’t produce too much network traffic when a requested static file  didn’t change since the last crawl.

You can test your Web server’s conditional GET support with your robots.txt, or, if even your robots.txt is a script, create a tiny HTML page with a text editor and upload it via FTP. Another neat tool to check HTTP headers is the Live Headers Extension for FireFox (bear in mind that testing crawler behavior with Web browsers is fault-prone by design).

If your second request of an unchanged static file results in a 200 HTTP response code, instead of a 304, call your hosting service. If it works and you’ve only static pages, then bookmark this article and move on.

Dynamic contents

Everything you output with server sided scripts is dynamic content by definition, regardless whether the URI has a query string or not. Even if you just read and print out a static file –that never changes– with PHP, Apache doesn’t add the Last-Modified header which forces crawlers to perform further requests with an If-Modified-Since header.

With dynamic content you can’t rely on Apache’s caching support, you must do it yourself.

The first step is figuring out where your CMS or eCommerce software hides the timestamps telling you the date and time of a page’s last modification. Usually a script pulls its stuff from different database tables, hence a page contains more than one area, or block, of dynamic contents. Every block might have a different last-modified timestamp, but not every block is important enough to serve as the page’s determinant last-modified date. The same goes for templates. Most template tweaks shouldn’t trigger a full blown recrawl, but some do, for example a new address or phone number if such information is present on every page.

For example a blog has posts, pages, comments, categories and other data sources that can change the sidebar’s contents quite frequently. On a page that outputs a single post or page, the last-modified date is determined by the post, respectively its last comment. The main page’s last-modified date is the modified-timestamp of the most recent post, and the same goes for its paginated continuations. A category page’s last-modified date is determined by the category’s most recent post, and so on.

New posts can change outgoing links of older posts when you use plugins that list related posts and stuff like that. There are many more reasons why search engines should crawl older posts at least monthly or so. You might need a routine that changes a blog page’s last-modified timestamp for example when it is a date more than 30 days or so in the past. Also, in some cases it could make sense to have a routine that can reset all timestamps reported as last-modified date for particular site areas, or even the whole site.

If your software doesn’t populate last-modified attributes on changes of all entities, then snap at the chance to consider database triggers, stored procedures, respectively changes of your data access layer. Bear in mind that not all changes of a record must trigger a crawler cache reset. For example a table storing textual contents like articles or product descriptions usually has a number of attributes that don’t affect crawling, thus it should have an attribute last updated  that’s changeable in the UI and serves as last-modified date in your crawler cache control (instead of the timestamp that’s changed automatically even on minor updates of attributes which are meaningless for HTML outputs).

Handling Last-Modified, If-Modified-Since, and Not-Modified HTTP headers with PHP/Apache

Below I provide example PHP code I’ve thrown together after midnight in a sleepless night, doped with painkillers. It doesn’t run on a production system, but it should get you started. Adapt it to your needs and make sure you test your stuff intensively. As always, my stuff comes as is  without any guarantees. ;)

First grab a couple helpers and put them in an include file you’ve available in all scripts. Since we deal with HTTP headers, you must not output anything before the logic that deals with conditional search engine requests, not even a single white space character, HTML DOCTYPE declaration …
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

In general, all user agents should support conditional GET requests, not only search engine crawlers. If you allow long lasting caching, which is fine with search engines that don’t need to crawl your latest Twitter message from your blog’s sidebar, you could leave your visitors with somewhat outdated pages if you serve them 304-Not-Modified responses too.

It might be a good idea to limit 304 responses to conditional GET requests from crawlers, when you don’t implement way shorter caching cycles for other user agents. The latter includes folks that spoof their user agent name as well as scrapers trying to steal your stuff masked as a legit spider. To verify legit search engine crawlers that (should) support conditional GET requests (from Google, Yahoo, MSN and Ask) you can grab my crawler detection routines here. Include them as well, then you can code stuff like that:

$isSpiderUA = checkCrawlerUA ();
$isLegitSpider = checkCrawlerIP (__FILE__);
if ($isSpiderUA && !$isLegitSpider) {
@header("Thou shalt not spoof", TRUE, 403);
exit;
// make sure your 403-Forbidden ErrorDocument directive in
// .htaccess points to a page that explains the issue!
}
if ($isLegitSpider) {
// insert your code dealing with conditional GET requests
}

Now that you’re sure that the requestor is a legit crawler from a major search engine, look at the HTTP request header it has submitted to your Web server.

// lookup the HTTP request header for a possible conditional GET
$ifModifiedSinceTimestamp = getIfModifiedSince();
// if the request is not conditional, don’t send a 304
$canSend304 = FALSE;
if ($ifModifiedSinceTimestamp !== FALSE) {
$canSend304 = TRUE;

// Tells the requestor that you’ve recognized the conditional GET
$echoRequestHeader = "X-Requested-If-modified-since: "
.unixTimestamp2HttpDate($ifModifiedSinceTimestamp);
@header($echoRequestHeader, TRUE);
}

You don’t need to echo the If-Modified-Since HTTP-date in the response header, but this custom header makes testing easier.

Next get the page’s actual last-modified date/time. Here is an (incomplete) code sample for a WordPress single post page.

// select the requested post's comment_count, post_modified and
 // post_date values, then:
if ($wp_post_modified) {
$lastModified = date2UnixTimestamp($wp_post_modified);
}
else {
$lastModified = date2UnixTimestamp($wp_post_date);
}
if (intval($wp_comment_count) > 0) {
// select last comment from the WordPress database, then:
$lastCommentTimestamp = date2UnixTimestamp($wp_comment_date);
if ($lastCommentTimestamp > $lastModified) {
$lastModified = $lastCommentTimestamp;
}
}

The date2UnixTimestamp() function accepts MySQL datetime values as valid input. If you need to (re)write last-modified dates to a MySQL database, convert the Unix timestamps to MySQL datetime values with unixTimestamp2MySqlDatetime().

Your server’s clock isn’t necessarily synchronized with all search engines out there. To cover possible gaps you can use a last-modified timestamp that’s a little bit fresher than the actual last-modified date. In this example the timestamp reported to the crawler is last-modified + 13 hours, you can change the deviant in makeLastModifiedTimestamp().
$lastModifiedTimestamp = makeLastModifiedTimestamp($lastModified);

If you compare the timestamps later on, and the request isn’t conditional, don’t run into the 304 routine.
if ($ifModifiedSinceTimestamp === FALSE) {
// make things equal if the request isn't conditional
$ifModifiedSinceTimestamp = $lastModifiedTimestamp;
}

You may want to allow a full fetch if the requestor’s timestamp is ancient, in this example older than one month.
$tooOld = @strtotime("now") - (31 * 24 * 60 * 60);
if ($ifModifiedSinceTimestamp < $tooOld) {
$lastModifiedTimestamp = @strtotime("now");
$ifModifiedSinceTimestamp = @strtotime("now") - (1 * 24 * 60 * 60);
}

Setting the last-modified attribute to yesterday schedules the next full crawl after this fetch in 30 days (or later, depending on the actual crawl frequency).

Finally respond with 304-Not-Modified if the page wasn’t remarkably changed since the date/time given in the crawler’s If-Modified-Since header. Otherwise send a Last-Modified header with a 200 HTTP response code, allowing the crawler to fetch the page contents.
$lastModifiedHeader = "Last-Modified: " .unixTimestamp2HttpDate($lastModifiedTimestamp);
if ($lastModifiedTimestamp < $ifModifiedSinceTimestamp &&
$canSend304) {
@header($lastModifiedHeader, TRUE, 304);
exit;
}
else {
@header($lastModifiedHeader, TRUE);
}

When you’re testing your version of this script with a browser, it will send a standard HTTP request, and your server will return a 200-OK. From your server’s response your browser should recognize the “Last-Modified” header, so when you reload the page the browser should send an “If-Modified-Since” header and you should get the 304 response code if Last-Modified > If-Modified-Since. However, judging from my experience such browser based tests of crawler behavior, respectively responses to crawler requests, aren’t reliable.

Test it with this MS tool instead. I’ve played with it for a while and it works great. With the PHP code above I’ve created a 200/304 test page
http://sebastians-pamphlets.com/tools/last-modified-yesterday.php
that sends a “Last-Modified: Yesterday” response header, and should return a 304-Not Modified HTTP response code when you request it with an “If-Modified-Since: Today+” header, otherwise it should respond with 200-OK (this version returns 200-OK only but tells when it would  respond with a 304). You can use this URI with the MS-tool linked above to test HTTP requests with different If-Modified-Since headers.

Have fun and paypal me 50% of your savings. ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Why storing URLs with truncated trailing slashes is an utterly idiocy

Yahoo steals my trailing slashesWith some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like Yahoo thinks circumcising URLs is cool. Proper URL canonicalization might, for example, screw your blog’s reputation at Technorati.

In fact the problem is not your URL canonicalization, e.g. 301 redirects from http://example.com to http://example.com/ respectively http://example.com/directory to http://example.com/directory/, but crappy software that removes trailing forward slashes from your URLs.

Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that “http://example.com” as anchor text makes any sense when the URL in HREF points to “http://example.com/”). Spreading invalid URLs is not funny. People as well as Web robots take invalid URLs from your pages for various purposes. Many usages of invalid URLs are capable to damage the search engine rankings of the link destinations. You can’t control that, hence don’t screw our URLs. Never. Period.

Folks who don’t agree with the above said read on.

    TOC:

  • What is a trailing slash? About URLs, directory URIs, default documents, directory indexes, …
  • How to rescue stolen trailing slashes About Apache’s handling of directory requests, and rewriting respectively redirecting invalid directory URIs in .htaccess as well as in PHP scripts.
  • Why stealing trailing slashes is not cool Truncating slashes is not only plain robbery (bandwidth theft), it often causes malfunctions at the destination server and 3rd party services as well.
  • How URL canonicalization irritates Technorati 301 redirects that “add” a trailing slash to directory URLs, respectively virtual URIs that mimic directories, seem to irritate Technorati so much that it can’t compute reputation, recent post lists, and so on.

What is a trailing slash?

The Web’s standards say (links and full quotes): The trailing path segment delimiter “/” represents an empty last path segment. Normalization should not remove delimiters when their associated component is empty. (Read the polite “should” as “must”.)

To understand that, lets look at the most common URL components:
scheme:// server-name.tld /path ?query-string #fragment
The (red) path part begins with a forward slash “/” and must consist of at least one byte (the trailing slash itself in case of the home page URL http://example.com/).

If an URL ends with a slash, it points to a directory’s default document, or, if there’s no default document, to a list of objects stored in a directory. The home page link lacks a directory name, because “/” after the TLD (.com|net|org|…) stands for the root directory.

Automated directory indexes (a list of links to all files) should be forbidden, use Options -Indexes in .htaccess to send such requests to your 403-Forbidden page.

In order to set default file names and their search sequence for your directories use DirectoryIndex index.html index.htm index.php /error_handler/missing_directory_index_doc.php. In this example: on request of http://example.com/directory/ Apache will first look for /directory/index.html, then if that doesn’t exist for /directory/index.htm, then /directory/index.php, and if all that fails, it will serve an error page (that should log such requests so that the Webmaster can upload the missing default document to /directory/).

The URL http://example.com (without the trailing slash) is invalid, and there’s no specification telling a reason why a Web server should respond to it with meaningful contents. Actually, the location http://example.com points to Null  (nil, zilch, nada, zip, nothing), hence the correct response is “404 - we haven’t got ‘nothing to serve’ yet”.

The same goes for sub-directories. If there’s no file named “/dir”, the URL http://example.com/dir points to Null too. If you’ve a directory named “/dir”, the canonical URL http://example.com/dir/ either points to a directory index page (an autogenerated list of all files) or the directory’s default document “index.(html|htm|shtml|php|…)”. A request of http://example.com/dir –without the trailing slash that tells the Web server that the request is for a directory’s index– resolves to “not found”.

You must not reference a default document by its name! If you’ve links like http://example.com/index.html you can’t change the underlying technology without serious hassles. Say you’ve a static site with a file structure like /index.html, /contact/index.html, /about/index.html and so on. Tomorrow you’ll realize that static stuff sucks, hence you’ll develop a dynamic site with PHP. You’ll end up with new files: /index.php, /contact/index.php, /about/index.php and so on. If you’ve coded your internal links as http://example.com/contact/ etc. they’ll still work, without redirects from .html to .php. Just change the DirectoryIndex directive from “… index.html … index.php …” to “… index.php … index.html …”. (Of course you can configure Apache to parse .html files for PHP code, but that’s another story.)

It seems that truncating default document names can make sense for services that deal with URLs, but watch out for sites that serve different contents under various extensions of “index” files (intentionally or not). I’d say that folks submitting their ugly index.html files to directories, search engines, top lists and whatnot deserve all the hassles that come with later changes.

How to rescue stolen trailing slashes

Since Web servers know that users are faulty by design, they jump through a couple of resource burning hoops in order to either add the trailing slash so that relative references inside HTML documents (CSS/JS/feed links, image locations, HREF values …) work correctly, or apply voodoo to accomplish that without (visibly) changing the address bar.

With Apache, DirectorySlash On enables this behavior (check whether your Apache version does 301 or 302 redirects, in case of 302s find another solution). You can also rewrite invalid requests in .htaccess when you need special rules:
RewriteEngine on
RewriteBase /content/
RewriteRule ^dir1$ http://example.com/content/dir1/ [R=301,L]
RewriteRule ^dir2$ http://example.com/content/dir2/ [R=301,L]

With content management systems (CMS) that generate virtual URLs on the fly, often there’s no other chance than hacking the software to canonicalize invalid requests. To prevent search engines from indexing invalid URLs that are in fact duplicates of canonical URLs, you’ll perform permanent redirects (301).

Here is a WordPress (header.php) example:
$requestUri = $_SERVER["REQUEST_URI"];
$queryString = $_SERVER["QUERY_STRING"];
$doRedirect = FALSE;
$fileExtensions = array(".html", ".htm", ".php");
$serverName = $_SERVER["SERVER_NAME"];
$canonicalServerName = $serverName;
 
// if you prefer http://example.com/* URLs remove the "www.":
$srvArr = explode(".", $serverName);
$canonicalServerName = $srvArr[count($srvArr) - 2] ."." .$srvArr[count($srvArr) - 1];
 
$url = parse_url ("http://" .$canonicalServerName .$requestUri);
$requestUriPath = $url["path"];
if (substr($requestUriPath, -1, 1) != "/") {
$isFile = FALSE;
foreach($fileExtensions as $fileExtension) {
if ( strtolower(substr($requestUriPath, strlen($fileExtension) * -1, strlen($fileExtension))) == strtolower($fileExtension) ) {
$isFile = TRUE;
}
}
if (!$isFile) {
$requestUriPath .= "/";
$doRedirect = TRUE;
}
}
$canonicalUrl = "http://" .$canonicalServerName .$requestUriPath;
if ($queryString) {
$canonicalUrl .= "?" . $queryString;
}
if ($url["fragment"]) {
$canonicalUrl .= "#" . $url["fragment"];
}
if ($doRedirect) {
@header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
@header("Location: $canonicalUrl");
exit;
}

Check your permalink settings and edit the values of $fileExtensions and $canonicalServerName accordingly. For other CMSs adapt the code, perhaps you need to change the handling of query strings and fragments. The code above will not run under IIS, because it has no REQUEST_URI variable.

Why stealing trailing slashes is not cool

This section expressed in one sentence: Cool URLs don’t change, hence changing other people’s URLs is not cool.

Folks should understand the “U” in URL as unique. Each URL addresses one and only one particular resource. Technically spoken, if you change one single character of an URL, the altered URL points to a different resource, or nowhere.

Think of URLs as phone numbers. When you call 555-0100 you reach the switchboard, 555-0101 is the fax, and 555-0109 is the phone extension of somebody. When you steal the last digit, dialing 555-010, you get nowhere.

Yahoo'ish fools steal our trailing slashesOnly a fool would assert that a phone number shortened by one digit is way cooler than the complete phone number that actually connects somewhere. Well, the last digit of a phone number and the trailing slash of a directory link aren’t much different. If somebody hands out an URL (with trailing slash), then use it as is, or don’t use it at all. Don’t “prettify” it, because any change destroys its serviceability.

If one requests a directory without the trailing slash, most Web servers will just reply to the user agent (brower, screen reader, bot) with a redirect header telling that one must use a trailing slash, then the user agent has to re-issue the request in the formally correct way. From a Webmaster’s perspective, burning resources that thoughtlessly is plain theft. From a user’s perspective, things will often work without the slash, but they’ll be quicker with it. “Often” doesn’t equal “always”:

  • Some Web servers will serve the 404 page.
  • Some Web servers will serve the wrong content, because /dir is a valid script, virtual URI, or page that has nothing to do with the index of /dir/.
  • Many Web servers will respond with a 302 HTTP response code (Found) instead of a correct 301-redirect, so that most search engines discovering the sneakily circumcised URL will index the contents of the canonical URL under the invalid URL. Now all search engine users will request the incomplete URL too, running into unnecessary redirects.
  • Some Web servers will serve identical contents for /dir and /dir/, that leads to duplicate content issues with search engines that index both URLs from links. Most Web services that rank URLs will assign different scorings to all known URL variants, instead of accumulated rankings to both URLs (which would be the right thing to do, but is technically, well, challenging).
  • Some user agents can’t handle (301) redirects properly. Exotic user agents might serve the user an empty page or the redirect’s “error message”, and Web robots like the crawlers sent out by Technorati or MSN-LiveSearch hang up respectively process garbage.

Does it really make sense to maliciously manipulate URLs just because some clueless developers say “dude, without the slash it looks way cooler”? Nope. Stealing trailing slashes in general as well as storing amputated URLs is a brain dead approach.

KISS (keep it simple, stupid) is a great principle. “Cosmetic corrections” like trimming URLs add unnecessary complexity that leads to erroneous behavior and requires even more code tweaks. GIGO (garbage in, garbage out) is another great principle that applies here. Smart algos don’t change their inputs. As long as the input is processible, they accept it, otherwise they skip it.

Exceptions

URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - “domain.co.uk” in print and “domain dot co dot uk” on radio is enough. The necessary redirect is cheaper than a visitor who doesn’t type in the canonical URL including scheme, www-prefix, and trailing slash.

How URL canonicalization seems to irritate Technorati

Due to the not exactly responsively (respectively swamped) Technorati user support parts of this section should be interpreted as educated speculation. Also, I didn’t research enough cases to come to a working theory. So here is just the story “how Technorati fails to deal with my blog”.

When I moved my blog from blogspot to this domain, I’ve enhanced the faulty WordPress URL canonicalization. If any user agent requests http://sebastians-pamphlets.com it gets redirected to http://sebastians-pamphlets.com/. Invalid post/page URLs like http://sebastians-pamphlets.com/about redirect to http://sebastians-pamphlets.com/about/. All redirects are permanent, returning the HTTP response code “301″.

I’ve claimed my blog as http://sebastians-pamphlets.com/, but Technorati shows its URL without the trailing slash.
…<div class="url"><a href="http://sebastians-pamphlets.com">http://sebastians-pamphlets.com</a> </div> <a class="image-link" href="/blogs/sebastians-pamphlets.com"><img …

By the way, they forgot dozens of fans (folks who “fave’d” either my old blogspot outlet or this site) too.
Blogs claimed at Technorati

I’ve added a description and tons of tags, that both don’t show up on public pages. It seems my tags were deleted, at least they aren’t visible in edit mode any more.
Edit blog settings at Technorati

Shortly after the submission, Technorati stopped to adjust the reputation score from newly discovered inbound links. Furthermore, the list of my recent posts became stale, although I’ve pinged Technorati with every update, and technorati received my update notifications via ping services too. And yes, I’ve tried manual pings to no avail.

I’ve gained lots of fresh inbound links, but the authority score didn’t change. So I’ve asked Technorati’s support for help. A few weeks later, in December/2007, I’ve got an answer:

I’ve taken a look at the issue regarding picking up your pings for “sebastians-pamphlets.com”. After making a small adjustment, I’ve sent our spiders to revisit your page and your blog should be indexed successfully from now on.

Please let us know if you experience any problems in the future. Do not hesitate to contact us if you have any other questions.

Indeed, Technorati updated the reputation score from “56″ to “191″, and refreshed the list of posts including the most recent one.

Of course the “small adjustment” didn’t persist (I assume that a batch process stole the trailing slash that the friendly support person has added). I’ve sent a follow-up email asking whether that’s a slash issue or not, but didn’t receive a reply yet. I’m quite sure that Technorati doesn’t follow 301-redirects, so that’s a plausible cause for this bug at least.

Since December 2007 Technorati didn’t update my authority score (just the rank goes up and down depending on the number of inbound links Technorati shows on the reactions page - by the way these numbers are often unreal and change in the range of hundreds from day to day).
Blog reactions and authority scoring at Technorati

It seems Technorati didn’t index my posts since then (December/18/2007), so probably my outgoing links don’t count for their destinations.
Stale list of recent posts at Technorati

(All screenshots were taken on February/05/2008. When you click the Technorati links today, it could hopefully will look differently.)

I’m not amused. I’m curious what would happen when I add
if (!preg_match("/Technorati/i", "$userAgent")) {/* redirect code */}

to my canonicalization routine, but I can resist to handle particular Web robots. My URL canonicalization should be identical both for visitors and crawlers. Technorati should be able to fix this bug without code changes at my end or weeky support requests. Wishful thinking? Maybe.

Update 2008-03-06: Technorati crawls my blog again. The 301 redirects weren’t the issue. I’ll explain that in a follow-up post soon.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Get a grip on the Robots Exclusion Protocol (REP)

REP command hierarchyThanks to the very nice folks over at SEOmoz I was able to prevent this site from becoming a kind of REP/robots.txt blog. Please consider reading this REP round up:

Robots Exclusion Protocol 101

My REP 101  links to the various standards (robots.txt, REP tags, Sitemaps, microformats) the REP consists of, and provides a rough summary of each REP component. It explains the difference between crawler directives and indexer directives, and which command hierarchy search engines follow when REP directives put in different levels conflict.

Educate yourself on the REPWhy do I think that solid REP knowledge is important right now? Not only because of the confusion that exists thanks to the volume of crappy advice provided at every Webmaster hangout. Of course understanding the REP makes webmastering easier, thus I’m glad when my REP related pamphlets are considered somewhat helpful.

I’ve a hidden agenda, though. I predict that the REP is going to change shortly. As usual, its evolvement is driven by a major search engine, since the W3C and such organizations don’t bother with the conglomerate of quasi standards and RFCs known as the Robots Exclusion Protocol. In general that’s not a bad thing. Search engines deal with the REP every day, so they have a legitimate interest.

Unfortunately not every REP extension that search engines have invented so far is useful for Webmasters, some of them are plain crap. Learning from fiascos and riots of the past, the engines are well advised to ask Webmasters for feedback before they announce further REP directives.

I’ve a feeling that shortly a well known search engine will launch a survey regarding particular REP related ideas. I want that Webmasters are well aware of the REP’s complexity and functionality when they contribute their take on REP extensions. So please educate yourself. :)

My pamphlet discussing a possible standardization of REP tags as robots.txt directives could be a useful reference, also please watch the great video here. ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Getting URLs outta Google - the good, the popular, and the definitive way

Keep out GoogleThere’s more and more robots.txt talk in the SEOsphere lately. That’s a good thing in my opinion, because the good old robots.txt’s power is underestimated. Unfortunately it’s quite often misused or even abused too, usually because folks don’t fully understand the REP (by following “advice” from forums instead of reading the real thing, or at least my stuff ).

I’d like to discuss the REP’s capabilities assumed to make sure that Google doesn’t index particular contents from three angles:

The good way
If the major search engines would support new robots.txt directives that Webmasters really need, removing even huge chunks of content from Google’s SERPs –without collateral damage– via robots.txt would be a breeze.
The popular way
Shamelessly stealing Matt’s official advice [Source: Remove your content from Google by Matt Cutts]. To obscure the blatant plagiarism, I’ll add a few thoughts.
The definitive way
Of course that’s not the ultimate way, but that’s the way Google’s cookies crumble, currently. In other words: Google is working on a leaner approach, but that’s not yet announced, thus you can’t use it; you still have to jump through many hoops.

The good way

Caution: Don’t implement code from this section, the robots.txt directives discussed here are not (yet/fully) supported by search engines!

Currently all robots.txt statements are crawler directives. That means that they can tell behaving search engines how to crawl a site (fetching contents), but they’ve no impact on indexing (listing contents on SERPs). I’ve recently published a draft discussing possible REP tags for robots.txt. REP tags are indexer directives known from robots meta tags and X-Robots-Tags, which –as on-page respectively per-URL directives– require crawling.

The crux is that REP tags must be assigned to URLs. Say you’ve a gazillion of printer friendly pages in various directories that you want to deindex at Google, putting the “noindex,follow,noarchive” tags comes with a shitload of work.

How cool would be this robots.txt code instead:
Noindex: /*printable
Noarchive: /*printable

Search engines would continue to crawl, but deindex previously indexed URLs respectively not index new URLs from
/articles/printable/*.htm
/manuals/printable/*.pdf
/products/descriptions/*.php?format=printable&product=*
...

provided those URLs aren’t disallow’ed. They would follow the links in those documents, so that PageRank gathered by printer friendly pages wouldn’t be completely wasted. To apply an implicit rel-nofollow to all links pointing to printer friendly pages, so that those can’t accumulate PageRank from internal or external links, you’d add
Norank: /*printable

to the robots.txt code block above.

If you don’t like that search engines index stuff you’ve disallow’ed in your robots.txt from 3rd party signals like inbound links, and that Google accumulates even PageRank for disallow’ed URLs, you’d put:
Disallow: /unsearchable/
Noindex: /unsearchable/
Norank: /unsearchable/

To fix URL canonicalization issues with PHP session IDs and other tracking variables you’d write for example
Truncate-variable sessionID: /

and that would fix the duplicate content issues as well as the problem with PageRank accumulated by throw-away URLs.

Unfortunately, robots.txt is not yet that powerful, so please link to the REP tags for robotx.txt “RFC” to make it popular, and proceed with what you have at the moment.

Matt Cutts was kind enough to discuss Google’s take on contents excluded from search engine indexing in 10 minutes or less here:

You really should listen, the video isn’t that long.

In the following I’ve highlighted a few methods Matt has talked about:

Don’t link (very weak)
Although Google usually doesn’t index unlinked stuff, this can happen due to crawling based on sitemaps. Also, the URL might appear in linked referrer stats on other sites that are crawlable, and folks can link from the cold.
.htaccess / .htpasswd (Matt’s first recommendation)
Since Google cannot crawl password protected contents, Matt declares this method to prevent content from indexing safe. I’m not sure what will happen when I spread a few strong links to somebody’s favorite smut collection, perhaps I’ll test some day whether Google and other search engines list such a reference on their SERPs.
robots.txt (weak)
Matt rightly points out that Google’s cool robots.txt validator in the Webmaster Console is a great tool to develop, test and deploy proper robots.txt syntax that effectively blocks search engine crawling. The weak point is, that even when search engines obey robots.txt, they can index uncrawled content from 3rd party sources. Matt is proud of Google’s smart capabilities to figure out suiteble references like the ODP. I agree totally and wholeheartedly. Hence robots.txt in its current shape doesn’t prevent content from showing up in Google and other engines as well. Matt didn’t mention Google’s experiments with Noindex: support in robots.txt, which need improvement but could resolve this dilemma.
Robots meta tags (Google only, weak with MSN/Yahoo)
The REP tag “noindex” in a robots meta element prevents from indexing, and, once spotted, deindexes previously listed stuff - at least at Google. According to Matt Yahoo and MSN still list such URLs as references without snippets. Because only Google obeys “noindex” totally by wiping out even URL-only listings and foreign references, robots meta tags should be considered a kinda weak approach too. Also, search engines must crawl a page to discover this indexer directive. Matt adds that robots meta tags are problematic, because they’re buried on the pages and sometimes tend to get forgotten when no longer needed (Webmasters might do forget to take the tag down, respectively add it later on when search engines policies change, or work in progress gets released respectively outdated contents are taken down). Matt forgot to mention the neat X-Robots-Tags that can be used to apply REP tags in the HTTP header of non-HTML resources like images or PDF documents. Google’s X-Robots-Tag is supported by Yahoo too.
Rel-nofollow (kind of weak)
Although condoms totally remove links from Google’s link graphs, Matt says that rel-nofollow should not be used as crawler or indexer directive. Rel-nofollow is for condomizing links only, also other search engines do follow nofollow’ed links and even Google can discover the link destination from other links they gather on the Web, or grab from internal links inadvertently lacking a link condom. Finally, rel-nofollow requires crawling too.
URL removal tool in GWC (Matt’s second recommendation)
Taking Matt’s enthusiasm while talking about Google’s neat URL terminator into account, this one should be considered his first recommendation. Google provides tools to remove URLs from their search index since five years at least (way longer IIRC). Recently the Webmaster Central team has integrated those, as well as new functionality, into the Webmaster Console, donating it a very nice UI. The URL removal tools come with great granularity, and because the user’s site ownership is verified, it’s pretty powerful, safe, and shows even the progress for each request (the removal process lasts a few days). Its UI is very flexible and allows even revoking of previous removal requests. The wonderful little tool’s sole weak point is that it can’t remove URLs from the search index forever. After 90 days or possibly six months the erased stuff can pop up again.

Summary: If your site isn’t password protected, and you can’t live with indexing of disallow’ed contents, you must remove unwanted URLs from Google’s search index periodically. However, there are additional procedures that can support –but not guarantee!– deindexing. With other search engines it’s even worse, because those don’t respect the REP like Google, and don’t provide such handy URL removal tools.

The definitive way

Actually, I think Matt’s advice is very good. As long as you don’t need a permanent solution, and if you lack the programming skills to develop such a beast that works with all (major) search engines. I mean everybody can insert a robots meta tag or robots.txt statement, and everybody can semiyearly repeat URL removal requests with the neat URL terminator, but most folks are scared when it comes to conditional manipulation of HTTP headers to prevent stuff from indexing. However, I’ll try to explain quite safe methods that actually work (with Apache, not IIS) in the following examples.

First of all, if you really want that search engines don’t index your stuff, you must allow them to crawl it. And no, that’s not an oxymoron. At the moment there’s no such thing as an indexer directive on site-level. You can’t forbid indexing in robots.txt. All indexer directives require crawling of the URLs that you want to keep out of the SERPs. Of course that doesn’t mean you should serve search engine crawlers a book from each forbidden URL.

Lets start with robots.txt. You put
User-agent: *
Disallow: /images/
Disallow: /movies/
Disallow: /unsearchable/
 
User-agent: Googlebot
Disallow:
Allow: /
 
User-agent: Slurp
Disallow:
Allow: /

The first section is just a fallback.

(Here comes a rather brutal method that you can use to keep search engines out of particular directories. It’s not suitable to deal with duplicate content, session IDs, or other URL canonicalization. More on that later.)

Next edit your .htaccess file.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_URI} ^/unsearchable/
RewriteCond %{REQUEST_URI} !\.php
RewriteRule . /unsearchable/output-content.php [L]
</IfModule>

If you’ve .php pages in /unsearchable/ then remove the second rewrite condition, put output-content.php into another directory, and edit my PHP code below so that it includes the PHP scripts (don’t forget to pass the query string).

Now grab the PHP code to check for search engine crawlers here and include it below. Your script /unsearchable/output-content.php looks like:
<?php
@include("crawler-stuff.php"); // defines variables and functions
$isSpider = checkCrawlerIP ($requestUri);
if ($isSpider) {
@header("HTTP/1.1 403 Thou shalt not index this", TRUE, 403);
@header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir");
exit;
}
 
$arr = explode("#", $requestUri);
$outputFileName = $arr[0];
$arr = explode("?", $outputFileName);
$outputFileName = $_SERVER["DOCUMENT_ROOT"] .$arr[0];
if (substr($outputFileName, -1, 1) == "/") {
$outputFileName .= "index.html";
}
if (file_exists($outputFileName)) {
// send the content type header
$contentType = "text/plain";
if (stristr($outputFileName, ".html")) $contentType ="text/html";
if (stristr($outputFileName, ".css")) $contentType ="text/css";
if (stristr($outputFileName, ".js")) $contentType ="text/javascript";
if (stristr($outputFileName, ".png")) $contentType ="image/png";
if (stristr($outputFileName, ".jpg")) $contentType ="image/jpeg";
if (stristr($outputFileName, ".gif")) $contentType ="image/gif";
if (stristr($outputFileName, ".xml")) $contentType ="application/xml";
if (stristr($outputFileName, ".pdf")) $contentType ="application/pdf";
@header("Content-type: $contentType");
@header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir");
readfile($outputFileName);
exit;
}
 
// That’s not the canonical way to call the 404 error page. Don’t copy, adapt:
@header("HTTP/1.1 307 Oups, I displaced $outputFileName", TRUE, 307);
@header("Location: http://sebastians-pamphlets.com/404/");
exit;
?>

What does the gibberish above do? In .htaccess we rewrite all requests for resources stored in /unsearchable/ to a PHP script, which checks whether the request is from a search engine crawler or not.

If the requestor is a verified crawler (known IP or IP and host name belong to a major search engine’s crawling engine), we return an unfriendly X-Robots-Tag and an HTTP response code 403 telling the search engine that access to our content is forbidden. The search engines should assume that a human visitor receives the same response, hence they aren’t keen on indexing these URLs. Even if a search engine lists an URL on the SERPs by accident, it can’t tell the searcher anything about the uncrawled contents. That’s unlikely to happen actually, because the X-Robots-Tag forbids indexing (Ask and MSN might ignore these directives).

If the requestor is a human visitor, or an unknown Web robot, we serve the requested contents. If the file doesn’t exist, we call the 404 handler.

With dynamic content you must handle the query string and (expected) cookies yourself. PHP’s readfile() is binary safe, so the script above works with images or PDF documents too.

If you’ve an original search engine crawler coming from a verifiable server feel free to test it with this page (user agent spoofing doesn’t qualify as crawler, come back in a week or so to check whether the engines have indexed the unsearchable stuff linked above).

The method above is not only brutal, it wastes all the juice from links pointing to the unsearchable site areas. To rescue the PageRank, change the script as follows:

$urlThatDesperatelyNeedsPageRank = "http://sebastians-pamphlets.com/about/";
if ($isSpider) {
@header("HTTP/1.1 301 Moved permanently", TRUE, 301);
@header("Location: $urlThatDesperatelyNeedsPageRank");
exit;
}

This redirects crawlers to the URL that has won your internal PageRank lottery. Search engines will/shall transfer the reputation gained from inbound links to this page. Of course page by page redirects would be your first choice, but when you block entire directories you can’t accomplish this kind of granularity.

By the way, when you remove the offensive 403-forbidden stuff in the script above and change it a little more, you can use it to apply various X-Robots-Tags to your HTML pages, images, videos and whatnot. When a search engine finds an X-Robots-Tag in the HTTP header, it should ignore conflicting indexer directives in robots meta tags. That’s a smart way to steer indexing of bazillions of resources without editing them.

Ok, this was the cruel method; now lets discuss cases where telling crawlers how to behave is a royal PITA, thanks to the lack of indexer directives in robots.txt that provide the required granularity (Truncate-variable, Truncate-value, Order-arguments, …).

Say you’ve session IDs in your URLs. That’s one (not exactly elegant) way to track users or affiliate IDs, but strictly forbidden when the requestor is a search engine’s Web robot.

In fact, a site with unprotected tracking variables is a spider trap that would produce infinite loops in crawling, because spiders following internal links with those variables discover new redundant URLs with each and every fetch of a page. Of course the engines found suitable procedures to dramatically reduce their crawling of such sites, what results in less indexed pages. Besides joyless index penetration there’s another disadvantage - the indexed URLs are powerless duplicates that usually rank beyond the sonic barrier at 1,000 results per search query.

Smart search engines perform high sophisticated URL canonicalization to get a grip on such crap, but Webmasters can’t rely on Google & Co to fix their site’s maladies.

Ok, we agree that you don’t want search engines to index your ugly URLs, duplicates, and whatnot. To properly steer indexing, you can’t just block the crawlers’ access to URLs/contents that shouldn’t appear on SERPs. Search engines discover most of those URLs when following links, and that means that they’re ready to assign PageRank or other scoring of link popularity to your URLs. PageRank / linkpop is a ranking factor you shouldn’t waste. Every URL known to search engines is an asset, hence handle it with care. Always bother to figure out the canonical URL, then do a page by page permanent redirect (301).

For your URL canonicalization you should have an include file that’s available at the very top of all your scripts, executed before PHP sends anything to the user agent (don’t hack each script, maintaining so many places handling the same stuff is a nightmare, and fault-prone). In this include file put the crawler detection code and your individual routines that handle canonicalization and other search engine friendly cloaking routines.

View a Code example (stripping useless query string variables).

How you implement the actual canonicalization routines depends on your individual site. I mean, if you’ve not the coding skills necessary to accomplish that you wouldn’t read this entire section, wouldn’t you?

    Here are a few examples of pretty common canonicalization issues:

  • Session IDs and other stuff used for user tracking
  • Affiliate IDs and IDs used to track the referring traffic source
  • Empty values of query string variables
  • Query string arguments put in different order / not checking the canonical sequence of query string arguments (ordering them alphabetically is always a good idea)
  • Redundant query string arguments
  • URLs longer than 255 bytes
  • Server name confusion, e.g. subdomains like “www”, “ww”, “random-string” all serving identical contents from example.com
  • Case issues (IIS/clueless code monkeys handling GET-variables/values case-insensitive)
  • Spaces, punctuation, or other special characters in URLs
  • Different scripts outputting identical contents
  • Flawed navigation, e.g. passing the menu item to the linked URL
  • Inconsistent default values for variables expected from cookies
  • Accepting undefined query string variables from GET requests
  • Contentless pages, e.g. outputted templates when the content pulled from the database equals whitespace or is not available

Summary

Hiding contents from all search engines requires programming skills that many sites can’t afford. Even leading search engines like Google don’t provide simple and suitable ways to deindex content –respectively to prevent content from indexing– without collateral damage (lost/wasted PageRank). We desperately need better tools. Maybe my robots.txt extensions are worth an inspection.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

My plea to Google - Please sanitize your REP revamps

Standardization of REP tags as robots.txt directives

Google is confules on REP standards and robots.txtThis draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It’s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported by all major search engines. Furthermore it considers work in progress leaked out from Google.

In the following I’ll try to define a few robots.txt directives that Webmasters really need.

Show Table of Contents

Currently Google experiments with new robots.txt directives, that is REP tags like “noindex” adapted for robots.txt. That’s a welcomed and brilliant move.

Unfortunately, they got it totally wrong, again. (Skip the longish explanation of the rel-nofollow fiasco and my rant on Google’s current robots.txt experiments.)

Google’s last try to enhance the REP by adapting a REP tag’s value in another level was a miserable failure. Not because crawler directives on link-level are a bad thing, the opposite is true, but because the implementation of rel-nofollow confused the hell out of Webmasters, and still does.

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Don’t get me wrong, an instrument to steer search engine crawling and indexing on link level is a great utensil in a Webmaster’s toolbox. Rel-nofollow just lacks granularity, and it was sneakily introduced for the wrong purposes.

Recap: When Google launched rel-nofollow in 2005, they promoted it as a tool to fight comment spam.

From now on, when Google sees the attribute (rel=”nofollow”) on hyperlinks, those links won’t get any credit when we rank websites in our search results. This isn’t a negative vote for the site where the comment was posted; it’s just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists.

Technically spoken, this translates to “search engine crawlers shall/can use rel-nofollow links for discovery crawling, but indexers and ranking algos processing links must not credit link destinations with PageRank, anchor text, nor other link juice originating from rel-nofollow links”. Rel=”nofollow” meant rel=”pass-no-reputation”.

All blog platforms implemented the beast, and it seemed that Google got rid of a major problem (gazillions of irrelevant spam links manipulating their rankings). Not so the bloggers, because the spammers didn’t bother to check whether a blog dofollows inserted links or not. Despite all the condomized links the amount of blog comment spam increased dramatically, since the spammers were forced to attack even more blogs in order to earn the same amount of uncondomized links from blogs that didn’t update to a software version that supported rel-nofollow.

Experiment failed, move on to better solutions like Akismet, captchas or ajax’ed comment forms? Nope, it’s not that easy. Google had a hidden agenda. Fighting blog comment spam was just a snake oil sales pitch, an opportunity to establish rel-nofollow by jumping on a popular band wagon. In 2005 Google had mastered the guestbook spam problem already. Devaluing comment links in well structured pages like blog posts is as easy as doing the same with guestbook links, or identifying affiliate links. In other words, when Google launched rel-nofollow, blog comment spam was definitely not a major search quality issue any more.

Identifying paid links on the other hand is not that easy, because they often appear as editorial links within the content. And that was a major problem for Google, a problem that they weren’t able to solve algorithmically without cooperation of all webmasters, site owners, and publishers. Google actually invented rel-nofollow to get a grip on paid links. Recently they announced that Googlebot no longer follows condomized links (pre-Bigdaddy Google followed condomized links and indexed contents discovered from rel-nofollow links), and their cold war on paid links became hot.

Of course the sneaky morphing of rel-nofollow from “pass no reputation” to a full blown “nofollow” is just a secondary theater of war, but without this side issue (with regard to REP standardization) Google would have lost, hence it was decisive for the outcome of their war on paid links.

To stay fair, Danny Sullivan said twice that rel-nofollow is Dave Winer’s fault, and Google as the victim is not to blame.

Rel-nofollow is settled now. However, I don’t want to see Google using their enormous power to manipulate the REP for selfish goals again. I wrote this rel-nofollow recap because probably, or possibly, Google is just doing it once more:

Google’s “Noindex: in robots.txt” experiment

Google supports a Noindex: directive in robots.txt. It seems Google’s Noindex: blocks crawling like Disallow:, but additionally prevents URLs blocked with Noindex: both from accumulating PageRank as well as from indexing based on 3rd party signals like inbound links.

This functionality would be nice to have, but accomplishing it with “Noindex” is badly wrong. The REP’s “Noindex” value without an explicit “Nofollow” means “crawl it, follow its links, but don’t list it on SERPs”. With pagel-level directives (robots meta tags and X-Robots-Tags) Google handles “Noindex” exactly as defined, that means with an implicit “Follow”. Not so in robots.txt. Mixing crawler directives (Disallow:) with indexer directives (Noindex:) this way takes the “Follow” out of the game, because a search engine can’t follow links from uncrawled documents.

Webmasters will not understand that “Nofollow” means totally different things in robots.txt and meta tags. Also, this approach steals granularity that we need, for example for use with technically structured sitemap pages and other hubs.

According to Google their current interpretation of Noindex: in robots.txt is not yet set in stone. That means there’s an opportunity for improvement. I hope that Google, and other search engines as well, listen to the needs of Webmasters.

Dear Googlers, don’t take the above said as Google bashing. I know, and often wrote, that Google is the search engine that puts the most efforts in boring tasks like REP evolvement. I just think that a dog company like Google needs to take real-world Webmasters into the boat when playing with standards like the REP, for the sake of the cats. ;)

Recap: Existing robots.txt directives

The /path example in the following sections refers to any way to assign URIs to REP directives, not only complete URIs relative to the server’s root. Patterns can be useful to set crawler directives for a bunch of URIs:

  • *: any string in path or query string, including the query string delimiter “?”, multiple wildcards should be allowed.
  • $: end of URI
  • Trailing /: (not exactly a pattern) addresses a directory, its files and subdirectories, the subdirectorie’s files etc., for example
    • Disallow: /path/
      matches /path/index.html but not /path.html
    • /path
      matches both /path/index.html and /path.html, as well as /path_1.html. It’s a pretty common mistake to “forget” the trailing slash in crawler directives meant to disallow particular directories. Such mistakes can result in blocking script/page-URIs that should get crawled and indexed.

Please note that patterns aren’t supported by all search engines, for example MSN supports only file extensions (yet?).

User-agent: [crawler name]
Groups a set of instructions for a particular crawler. Crawlers that find their own section in robots.txt ignore the User-agent: * section that addresses all Web robots. Each User-agent: section must be terminated with at least one empty line.

Disallow: /path
Prevents from crawling, but allows indexing based on 3rd party information like anchor text and surrounding text of inbound links. Disallow’ed URLs can gather PageRank.

Allow: /path
Refines previous Disallow: statements. For example
Disallow: /scripts/
Allow: /scripts/page.php

tells crawlers that they may fetch http://example.com/scripts/page.php or http://example.com/scripts/page.php?article=1, but not any other URL in http://example.com/scripts/.

Sitemap: [absolute URL]
Announces XML sitemaps to search engines. Example:
Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.com/video-sitemap.xml

points all search engines that support Google’s Sitemaps Protocol to the sitemap locations. Please note that sitemap autodiscovery via robots.txt doesn’t replace sitemap submissions. Google, Yahoo and MSN provide Webmaster Consoles where you not only can submit your sitemaps, but follow the indexing process (wishful thinking WRT particular SEs). In some cases it might be a bright idea to avoid the default file name “sitemap.xml” and keep the sitemap URLs out of robots.txt, sitemap autodiscovery is not for everyone.

Recap: Existing REP tags

REP tags are values that you can use in a page’s robots meta tag and X-Robots-Tag. Robots meta tags go to the HTML document’s HEAD section
<meta name="robots" content="noindex, follow, noarchive" />

whereas X-Robots-Tags supply the same information in the HTTP header
X-Robots-Tag: noindex, follow, noarchive

and thus can instruct crawlers how to handle non-HTML resources like PDFs, images, videos, and whatnot.

    Widely supported REP tags are:

  • INDEX|NOINDEX - Tells whether the page may be indexed (listed on SERPs) or not
  • FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided in the document or not
  • ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
  • NOODP - tells search engines not to use page titles and descriptions pulled from DMOZ on their SERPs.
  • NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
  • NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
  • NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
  • UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

Problems with REP tags in robots.txt

REP tags (index, noindex, follow, nofollow, all, none, noarchive, nosnippet, noodp, noydir, unavailable_after) were designed as page-level directives. Setting those values for groups of URLs makes steering search engine crawling and indexing a breeze, but also comes with more complexity and a few pitfalls as well.

  • Page-level directives are instructions for indexers and query engines, not crawlers. A search engine can’t obey REP tags without crawling the resource that supplies them. That means that not a single REP tag put as robots.txt statement shall be misunderstood as crawler directive.

    For example Noindex: /path must not block crawling, not even in combination with Nofollow: /path, because there’s still the implicit “archive” (= absence of Noarchive: /path). Providing a cached copy even of a not indexed page makes sense for toolbar users.

    Whether or not a search engine actually crawls a resource that’s tagged with “noindex, nofollow, noarchive, nosnippet” or so is up to the particular SE, but none of those values implies a Disallow: /path.

  • Historically, a crawler instruction on HTML element level overrules the robots meta tag. For example when the meta tag says “follow” for all links on a page, the crawler will not follow a link that is condomized with rel=”nofollow”.

    Does that mean that a robots meta tag overrules a conflicting robots.txt statement? Of course not in any case. Robots.txt is the gatekeeper, and so to say the “highest REP instance”. Actually, to this question there’s no absolute answer that satisfies everybody.

    A Webmaster sitting on a huge conglomerate of legacy code may want to totally switch to robots.txt directives, that means search engines shall ignore all the BS in ancient meta tags of pages created in the stone age of the Internet. Back then the rules were different. An alternative/secondary landing page’s “index,follow” from 1998 most probably doesn’t fly with 2008’s duplicate content filters and high sophisticated link pattern analytics.

    The Webmaster of a well designed brand new site on the other hand might be happy with a default behavior where page-level REP tags overrule site-wide directives in robots.txt.

  • REP tags used in robots.txt might refine crawler directives. For example a disallow’ed URL can accumulate PageRank, and may be listed on SERPs. We need at least two different directives ruling PageRank caluculation and indexing for uncrawlable resources (see below under Noodp:/Noydir:, Noindex: and Norank:).

    Google’s current approach to handle this with the Noindex: directive alone is not acceptable, we need a new REP tag to handle this case. Next up, when we introduce a new REP tag for use in robots.txt, we should allow it in meta tags and HTTP headers too.

  • In theory it makes no sense to maintain a directive that describes a default behavior. But why has the REP “follow” although the absence of “nofollow” perfectly expresses “follow”? Because of the way non-geeks think (try to explain why the value nil/null doesn’t equal empty/zero/blank to a non-geek. Not!).

    Implicit directives that aren’t explicitely named and described in the rules don’t exist for the masses. Even in the 10 commandments someone had to write “thou shalt not hotlink|scrape|spam|cloak|crosslink|hijack…” instead of a no-brainer like “publish unique and compelling content for people and make your stuff crawlable”. Unfortunately, that works the other way round too. If a statement (Index: or Follow:) is dependent on another one (Allow: respectively the absence of Disallow:) folks will whine, rant and argue when search engines ignore their stuff.

    Obviously we need at least Index:, Follow: and Archive to keep the standard usable and somewhat understandable. Of course crawler directives might thwart such indexer directives. Ignorant folks will write alphabetically ordered robots.txt files like
    Disallow: /cgi-bin/
    Disallow: /content/
    ...
    Follow: /cgi-bin/redirect.php
    Follow: /content/links/
    ...
    Index: /content/articles/

    without Allow: /content/links/, Allow: /content/articles/ and Allow: /cgi-bin/redirect.

    Whether or not indexer directives that require crawling can overrule the crawler directive Disallow: is open for discussion. I vote for “not”.

  • Applying REP tags on site-level would be great, but it doesn’t solve other problems like the need of directives on block and element level. Both Google’s section targeting as well as Yahoo’s robots-nocontent class name aren’t acceptable tools capable to instruct search engines how to handle content in particular page areas (advertising blocks, navigation and other templated stuff, links in footers or sidebar elements, and so on).

    Instead of editing bazillions of pages, templates, include files and whatnot to insert rel-nofollow/nocontent stuff for the sole purpose of sucking up to search engines, we need an elegant way to apply such micro-directives via robots.txt, or at least site-wide sets of instructions referenced in robots.txt. Once that’s doable, Webmasters will make use of such tools to improve their rankings, and not alone to comply to the ever changing search engine policies that cost the Webmaster community billions of man hours each year.

    I consider these robots.txt statements sexy:
    Nofollow a.advertising, div#adblock, span.cross-links: /path
    Noindex .inherited-properties, p#tos, p#privacy, p#legal: /path

    but that’s a wish list for another post. However, while designing site-wide REP statements we should at least think of block/element level directives.

Remember the rel-nofollow fiasco where a REP tag was used on HTML element level producing so much confusion and conflicts. Lets learn from past mistakes and make it perfect this time. A perfect standard can be complex, but it’s clear and unambiguous.

Priority settings

The REP’s command hierarchy must be well defined:

  1. robots.txt
  2. Page meta tags and X-Robots-Tags in the HTTP header. X-Robots-Tag values overrule conflicting meta tag values.
  3. [Future block level directives]
  4. Element level directives like rel-nofollow

That means, when crawling is allowed, page level instructions overrule robots.txt, and element level (or future block level) directives overrule page level instructions as well as robots.txt. As long as the Webmaster doesn’t revert the latter:

Priority-page-level: /path
Default behavior, directives in robots meta tags overrule robots.txt statements. Necessary to reset previous Priority-site-level: statements.

Priority-site-level: /path
Robots.txt directives overrule conflicting directives in robots meta tags and X-Robots-Tags.

Priority-site-level All: /path
Robots.txt directives overrule all directives in robots meta tags or provided elsewhere, because those are completely ignored for all URIs under /path. The “All” parameter would even dofollow nofollow’ed links when the robots.txt lacks corresponding Nofollow: statements.

Noindex: /path

Follow outgoing links, archive the page, but don’t list it on SERPs. The URLs can accumulate PageRank etcetera. Deindex previously indexed URLs.

[Currently Google doesn’t crawl Noindex’ed URLs and most probably those can’t accumulate PageRank, hence URLs in /path can’t distribute PageRank. That’s plain wrong. Those URLs should be able to pass PageRank to outgoing links when there’s no explicit Nofollow:, nor a “nofollow” meta tag respectively X-Robots-Tag.]

Norank: /path

Prevents URLs from accumulating PageRank, anchor text, and whatever link juice.

Makes sense to refine Disallow: statements in company with Noindex: and Noodp:/Noydir:, or to prevent TOS/contact/privacy/… pages and alike from sucking PageRank (nofollow’ing TOS links and stuff like that to control PageRank flow is fault-prone).

Nofollow: /path

The uber-link-condom. Don’t use outgoing links, not even internal links, for discovery crawling. Don’t credit the link destinations with any reputation (PageRank, anchor text, and whatnot).

Noarchive: /path

Don’t make a cached copy of the resource available to searchers.

Nosnippet: /path

List the resource with linked page title on SERPs, but don’t create a text snippet, and don’t reprint the description meta tag.

[Why don’t we have a REP tag saying “use my description meta tag or nothing”?]

Nopreview: /path

Don’t create/link an HTML preview of this resource. That’s interesting for subscriptions sites and applies mostly to PDFs, Word documents, spread sheets, presentations, and other non-HTML resources. More information here.

Noodp: /path

Don’t use the DMOZ title nor the DMOZ description for this URL on SERPs, not even when this resource is a non-HTML document that doesn’t supply its own title/meta description.

Noydir: /path

I’m not sure this one makes sense in robots.txt, because only Yahoo search uses titles and descriptions from the Yahoo directory. Anyway: “Don’t overwrite the page title listed on the SERPs with information pulled from the Yahoo directory, although I paid for it.”

Unavailable_after [date]: /path

Deindex the resource the day after [date]. The parameter [date] is put in any date or date/time format, if it lacks a timezone then GMT is assumed.

[Google’s RFC 850 obsession is somewhat weird. There are many ways to put a timestamp other than “25-Aug-2007 15:00:00 EST”.]

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

In the search index remove the unwanted variable/value pair(s) from the URL’s query string and transfer PageRank and other link juice to the matching URL without those parameters. If this “bare URL” redirects, or is uncrawlable for other reasons, index it with the content pulled from the page with the more complex URL.

Regardless whether the variable name or the variable’s value matches the pattern, “Truncate_*” statements remove a complete argument from the query string, that is &variable=value. If after the (last) truncate operation the query string is empty, the querystring delimiter “?” (questionmark) must be removed too.

Order-arguments [charset]: /path

Sort the query strings of all dynamic URLs by variable name, then within