Getting URLs outta Google - the good, the popular, and the definitive way

Keep out GoogleThere’s more and more robots.txt talk in the SEOsphere lately. That’s a good thing in my opinion, because the good old robots.txt’s power is underestimated. Unfortunately it’s quite often misused or even abused too, usually because folks don’t fully understand the REP (by following “advice” from forums instead of reading the real thing, or at least my stuff ).

I’d like to discuss the REP’s capabilities assumed to make sure that Google doesn’t index particular contents from three angles:

The good way
If the major search engines would support new robots.txt directives that Webmasters really need, removing even huge chunks of content from Google’s SERPs –without collateral damage– via robots.txt would be a breeze.
The popular way
Shamelessly stealing Matt’s official advice [Source: Remove your content from Google by Matt Cutts]. To obscure the blatant plagiarism, I’ll add a few thoughts.
The definitive way
Of course that’s not the ultimate way, but that’s the way Google’s cookies crumble, currently. In other words: Google is working on a leaner approach, but that’s not yet announced, thus you can’t use it; you still have to jump through many hoops.

The good way

Caution: Don’t implement code from this section, the robots.txt directives discussed here are not (yet/fully) supported by search engines!

Currently all robots.txt statements are crawler directives. That means that they can tell behaving search engines how to crawl a site (fetching contents), but they’ve no impact on indexing (listing contents on SERPs). I’ve recently published a draft discussing possible REP tags for robots.txt. REP tags are indexer directives known from robots meta tags and X-Robots-Tags, which –as on-page respectively per-URL directives– require crawling.

The crux is that REP tags must be assigned to URLs. Say you’ve a gazillion of printer friendly pages in various directories that you want to deindex at Google, putting the “noindex,follow,noarchive” tags comes with a shitload of work.

How cool would be this robots.txt code instead:
Noindex: /*printable
Noarchive: /*printable

Search engines would continue to crawl, but deindex previously indexed URLs respectively not index new URLs from
/articles/printable/*.htm
/manuals/printable/*.pdf
/products/descriptions/*.php?format=printable&product=*
...

provided those URLs aren’t disallow’ed. They would follow the links in those documents, so that PageRank gathered by printer friendly pages wouldn’t be completely wasted. To apply an implicit rel-nofollow to all links pointing to printer friendly pages, so that those can’t accumulate PageRank from internal or external links, you’d add
Norank: /*printable

to the robots.txt code block above.

If you don’t like that search engines index stuff you’ve disallow’ed in your robots.txt from 3rd party signals like inbound links, and that Google accumulates even PageRank for disallow’ed URLs, you’d put:
Disallow: /unsearchable/
Noindex: /unsearchable/
Norank: /unsearchable/

To fix URL canonicalization issues with PHP session IDs and other tracking variables you’d write for example
Truncate-variable sessionID: /

and that would fix the duplicate content issues as well as the problem with PageRank accumulated by throw-away URLs.

Unfortunately, robots.txt is not yet that powerful, so please link to the REP tags for robotx.txt “RFC” to make it popular, and proceed with what you have at the moment.

Matt Cutts was kind enough to discuss Google’s take on contents excluded from search engine indexing in 10 minutes or less here:

You really should listen, the video isn’t that long.

In the following I’ve highlighted a few methods Matt has talked about:

Don’t link (very weak)
Although Google usually doesn’t index unlinked stuff, this can happen due to crawling based on sitemaps. Also, the URL might appear in linked referrer stats on other sites that are crawlable, and folks can link from the cold.
.htaccess / .htpasswd (Matt’s first recommendation)
Since Google cannot crawl password protected contents, Matt declares this method to prevent content from indexing safe. I’m not sure what will happen when I spread a few strong links to somebody’s favorite smut collection, perhaps I’ll test some day whether Google and other search engines list such a reference on their SERPs.
robots.txt (weak)
Matt rightly points out that Google’s cool robots.txt validator in the Webmaster Console is a great tool to develop, test and deploy proper robots.txt syntax that effectively blocks search engine crawling. The weak point is, that even when search engines obey robots.txt, they can index uncrawled content from 3rd party sources. Matt is proud of Google’s smart capabilities to figure out suiteble references like the ODP. I agree totally and wholeheartedly. Hence robots.txt in its current shape doesn’t prevent content from showing up in Google and other engines as well. Matt didn’t mention Google’s experiments with Noindex: support in robots.txt, which need improvement but could resolve this dilemma.
Robots meta tags (Google only, weak with MSN/Yahoo)
The REP tag “noindex” in a robots meta element prevents from indexing, and, once spotted, deindexes previously listed stuff - at least at Google. According to Matt Yahoo and MSN still list such URLs as references without snippets. Because only Google obeys “noindex” totally by wiping out even URL-only listings and foreign references, robots meta tags should be considered a kinda weak approach too. Also, search engines must crawl a page to discover this indexer directive. Matt adds that robots meta tags are problematic, because they’re buried on the pages and sometimes tend to get forgotten when no longer needed (Webmasters might do forget to take the tag down, respectively add it later on when search engines policies change, or work in progress gets released respectively outdated contents are taken down). Matt forgot to mention the neat X-Robots-Tags that can be used to apply REP tags in the HTTP header of non-HTML resources like images or PDF documents. Google’s X-Robots-Tag is supported by Yahoo too.
Rel-nofollow (kind of weak)
Although condoms totally remove links from Google’s link graphs, Matt says that rel-nofollow should not be used as crawler or indexer directive. Rel-nofollow is for condomizing links only, also other search engines do follow nofollow’ed links and even Google can discover the link destination from other links they gather on the Web, or grab from internal links inadvertently lacking a link condom. Finally, rel-nofollow requires crawling too.
URL removal tool in GWC (Matt’s second recommendation)
Taking Matt’s enthusiasm while talking about Google’s neat URL terminator into account, this one should be considered his first recommendation. Google provides tools to remove URLs from their search index since five years at least (way longer IIRC). Recently the Webmaster Central team has integrated those, as well as new functionality, into the Webmaster Console, donating it a very nice UI. The URL removal tools come with great granularity, and because the user’s site ownership is verified, it’s pretty powerful, safe, and shows even the progress for each request (the removal process lasts a few days). Its UI is very flexible and allows even revoking of previous removal requests. The wonderful little tool’s sole weak point is that it can’t remove URLs from the search index forever. After 90 days or possibly six months the erased stuff can pop up again.

Summary: If your site isn’t password protected, and you can’t live with indexing of disallow’ed contents, you must remove unwanted URLs from Google’s search index periodically. However, there are additional procedures that can support –but not guarantee!– deindexing. With other search engines it’s even worse, because those don’t respect the REP like Google, and don’t provide such handy URL removal tools.

The definitive way

Actually, I think Matt’s advice is very good. As long as you don’t need a permanent solution, and if you lack the programming skills to develop such a beast that works with all (major) search engines. I mean everybody can insert a robots meta tag or robots.txt statement, and everybody can semiyearly repeat URL removal requests with the neat URL terminator, but most folks are scared when it comes to conditional manipulation of HTTP headers to prevent stuff from indexing. However, I’ll try to explain quite safe methods that actually work (with Apache, not IIS) in the following examples.

First of all, if you really want that search engines don’t index your stuff, you must allow them to crawl it. And no, that’s not an oxymoron. At the moment there’s no such thing as an indexer directive on site-level. You can’t forbid indexing in robots.txt. All indexer directives require crawling of the URLs that you want to keep out of the SERPs. Of course that doesn’t mean you should serve search engine crawlers a book from each forbidden URL.

Lets start with robots.txt. You put
User-agent: *
Disallow: /images/
Disallow: /movies/
Disallow: /unsearchable/
 
User-agent: Googlebot
Disallow:
Allow: /
 
User-agent: Slurp
Disallow:
Allow: /

The first section is just a fallback.

(Here comes a rather brutal method that you can use to keep search engines out of particular directories. It’s not suitable to deal with duplicate content, session IDs, or other URL canonicalization. More on that later.)

Next edit your .htaccess file.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_URI} ^/unsearchable/
RewriteCond %{REQUEST_URI} !\.php
RewriteRule . /unsearchable/output-content.php [L]
</IfModule>

If you’ve .php pages in /unsearchable/ then remove the second rewrite condition, put output-content.php into another directory, and edit my PHP code below so that it includes the PHP scripts (don’t forget to pass the query string).

Now grab the PHP code to check for search engine crawlers here and include it below. Your script /unsearchable/output-content.php looks like:
<?php
@include("crawler-stuff.php"); // defines variables and functions
$isSpider = checkCrawlerIP ($requestUri);
if ($isSpider) {
@header("HTTP/1.1 403 Thou shalt not index this", TRUE, 403);
@header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir");
exit;
}
 
$arr = explode("#", $requestUri);
$outputFileName = $arr[0];
$arr = explode("?", $outputFileName);
$outputFileName = $_SERVER["DOCUMENT_ROOT"] .$arr[0];
if (substr($outputFileName, -1, 1) == "/") {
$outputFileName .= "index.html";
}
if (file_exists($outputFileName)) {
// send the content type header
$contentType = "text/plain";
if (stristr($outputFileName, ".html")) $contentType ="text/html";
if (stristr($outputFileName, ".css")) $contentType ="text/css";
if (stristr($outputFileName, ".js")) $contentType ="text/javascript";
if (stristr($outputFileName, ".png")) $contentType ="image/png";
if (stristr($outputFileName, ".jpg")) $contentType ="image/jpeg";
if (stristr($outputFileName, ".gif")) $contentType ="image/gif";
if (stristr($outputFileName, ".xml")) $contentType ="application/xml";
if (stristr($outputFileName, ".pdf")) $contentType ="application/pdf";
@header("Content-type: $contentType");
@header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir");
readfile($outputFileName);
exit;
}
 
// That’s not the canonical way to call the 404 error page. Don’t copy, adapt:
@header("HTTP/1.1 307 Oups, I displaced $outputFileName", TRUE, 307);
@header("Location: http://sebastians-pamphlets.com/404/");
exit;
?>

What does the gibberish above do? In .htaccess we rewrite all requests for resources stored in /unsearchable/ to a PHP script, which checks whether the request is from a search engine crawler or not.

If the requestor is a verified crawler (known IP or IP and host name belong to a major search engine’s crawling engine), we return an unfriendly X-Robots-Tag and an HTTP response code 403 telling the search engine that access to our content is forbidden. The search engines should assume that a human visitor receives the same response, hence they aren’t keen on indexing these URLs. Even if a search engine lists an URL on the SERPs by accident, it can’t tell the searcher anything about the uncrawled contents. That’s unlikely to happen actually, because the X-Robots-Tag forbids indexing (Ask and MSN might ignore these directives).

If the requestor is a human visitor, or an unknown Web robot, we serve the requested contents. If the file doesn’t exist, we call the 404 handler.

With dynamic content you must handle the query string and (expected) cookies yourself. PHP’s readfile() is binary safe, so the script above works with images or PDF documents too.

If you’ve an original search engine crawler coming from a verifiable server feel free to test it with this page (user agent spoofing doesn’t qualify as crawler, come back in a week or so to check whether the engines have indexed the unsearchable stuff linked above).

The method above is not only brutal, it wastes all the juice from links pointing to the unsearchable site areas. To rescue the PageRank, change the script as follows:

$urlThatDesperatelyNeedsPageRank = "http://sebastians-pamphlets.com/about/";
if ($isSpider) {
@header("HTTP/1.1 301 Moved permanently", TRUE, 301);
@header("Location: $urlThatDesperatelyNeedsPageRank");
exit;
}

This redirects crawlers to the URL that has won your internal PageRank lottery. Search engines will/shall transfer the reputation gained from inbound links to this page. Of course page by page redirects would be your first choice, but when you block entire directories you can’t accomplish this kind of granularity.

By the way, when you remove the offensive 403-forbidden stuff in the script above and change it a little more, you can use it to apply various X-Robots-Tags to your HTML pages, images, videos and whatnot. When a search engine finds an X-Robots-Tag in the HTTP header, it should ignore conflicting indexer directives in robots meta tags. That’s a smart way to steer indexing of bazillions of resources without editing them.

Ok, this was the cruel method; now lets discuss cases where telling crawlers how to behave is a royal PITA, thanks to the lack of indexer directives in robots.txt that provide the required granularity (Truncate-variable, Truncate-value, Order-arguments, …).

Say you’ve session IDs in your URLs. That’s one (not exactly elegant) way to track users or affiliate IDs, but strictly forbidden when the requestor is a search engine’s Web robot.

In fact, a site with unprotected tracking variables is a spider trap that would produce infinite loops in crawling, because spiders following internal links with those variables discover new redundant URLs with each and every fetch of a page. Of course the engines found suitable procedures to dramatically reduce their crawling of such sites, what results in less indexed pages. Besides joyless index penetration there’s another disadvantage - the indexed URLs are powerless duplicates that usually rank beyond the sonic barrier at 1,000 results per search query.

Smart search engines perform high sophisticated URL canonicalization to get a grip on such crap, but Webmasters can’t rely on Google & Co to fix their site’s maladies.

Ok, we agree that you don’t want search engines to index your ugly URLs, duplicates, and whatnot. To properly steer indexing, you can’t just block the crawlers’ access to URLs/contents that shouldn’t appear on SERPs. Search engines discover most of those URLs when following links, and that means that they’re ready to assign PageRank or other scoring of link popularity to your URLs. PageRank / linkpop is a ranking factor you shouldn’t waste. Every URL known to search engines is an asset, hence handle it with care. Always bother to figure out the canonical URL, then do a page by page permanent redirect (301).

For your URL canonicalization you should have an include file that’s available at the very top of all your scripts, executed before PHP sends anything to the user agent (don’t hack each script, maintaining so many places handling the same stuff is a nightmare, and fault-prone). In this include file put the crawler detection code and your individual routines that handle canonicalization and other search engine friendly cloaking routines.

View a Code example (stripping useless query string variables).

How you implement the actual canonicalization routines depends on your individual site. I mean, if you’ve not the coding skills necessary to accomplish that you wouldn’t read this entire section, wouldn’t you?

    Here are a few examples of pretty common canonicalization issues:

  • Session IDs and other stuff used for user tracking
  • Affiliate IDs and IDs used to track the referring traffic source
  • Empty values of query string variables
  • Query string arguments put in different order / not checking the canonical sequence of query string arguments (ordering them alphabetically is always a good idea)
  • Redundant query string arguments
  • URLs longer than 255 bytes
  • Server name confusion, e.g. subdomains like “www”, “ww”, “random-string” all serving identical contents from example.com
  • Case issues (IIS/clueless code monkeys handling GET-variables/values case-insensitive)
  • Spaces, punctuation, or other special characters in URLs
  • Different scripts outputting identical contents
  • Flawed navigation, e.g. passing the menu item to the linked URL
  • Inconsistent default values for variables expected from cookies
  • Accepting undefined query string variables from GET requests
  • Contentless pages, e.g. outputted templates when the content pulled from the database equals whitespace or is not available

Summary

Hiding contents from all search engines requires programming skills that many sites can’t afford. Even leading search engines like Google don’t provide simple and suitable ways to deindex content –respectively to prevent content from indexing– without collateral damage (lost/wasted PageRank). We desperately need better tools. Maybe my robots.txt extensions are worth an inspection.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

26 Comments to "Getting URLs outta Google - the good, the popular, and the definitive way"

  1. theGypsy on 14 January, 2008  #link

    Great stuff once again Seb… so when is the Robots.txt Pamplet (eBook) comming out? Should wrap the entire series up dood….hmmmmmmm

    Dave

  2. Sebastian on 14 January, 2008  #link

    Thanks Dave. :) I won’t write an ebook on robots.txt because things evolve way too fast, blog posts are the better medium.

  3. theGypsy on 14 January, 2008  #link

    Well sure…. that’s why you should only put out a pamphlet on it .. he he….

  4. Sebastian on 14 January, 2008  #link

    Yeah, pamphlets are the way to go. Unfortunately they sometimes piss off the recipient of the pamphlet’s message.

  5. JLH on 14 January, 2008  #link

    Brilliant. I have nothing to add. You’ve covered it quite well. Feel free to delete my comment.

  6. Sebastian on 15 January, 2008  #link

    Thanks John. :) Why should I delete that nice ego food? ;)

  7. Marty on 15 January, 2008  #link

    another great article, im sure like some of the comments read we could do with some of this in an E-book but like seb says the technology moves at such at pace it would need re-written each week..

    the video from Matt was a great addition. For me, when i read somthing i have to read it a few times, to get it drummed in there, same as auditory but visual works a treat.

    Thanks ;)

  8. Sebastian on 15 January, 2008  #link

    Thanks Marty. :) May I ask for a portion SM voodoo? I compete with SEOmoz on the Sphinn home page for “deindexing”. ;) j/k

  9. SearchCap: The Day In Search, January 15, 2008…

    Below is what happened in search today, as reported on Search Engine Land and from other places across the web…….

  10. Vingold on 15 January, 2008  #link

    This post specifically and this whole web site in general are quite the piece of work. Very well done. You sir are a skilled artisan. Masterful.

  11. Gab "SEO ROI" Goldenberg on 15 January, 2008  #link

    Seb, you’re the guy whose technical expertise I respect the most, and I think that goes for most people in search. So you’re doing yourself a disservice with lines like these:

    “According to Matt Yahoo and MSN still list such URLs as references without snippets. Because only Google obeys “noindex” totally by wiping out even URL-only listings and foreign references, robots meta tags should be considered a kinda weak approach too.”

    You’re citing a Google engineer on Yahoo and MSN’s algos. Then citing him again (at least, in context, it sounds that way) saying why Google is better. Even if it’s true, it doesn’t look good. At the very least there should be some response from Yahoo and MSN’s reps, or else your own SERP analysis.

    My 2 cents. I still refer clients and leads to your stuff when I don’t know the answer lol, so don’t take that personally or anything ;).

  12. Sebastian on 16 January, 2008  #link

    Gab, you can trust Matt, consider him the co-author of this post (because I’ve stolen most of it from his video). Seriously, Matt’s right and I’ve seen such references myself. Yahoo staff seldom comments here (in fact that happened only once), and the MSN dudes should hate me for my occasional rants as well, hence I don’t expect that I’d get a statement from any search engine other than Google (Googlers are very responsive from my experience). You’ve a valid point, though. In the “popular way” section I’ve mixed quotes from Matt with my comments in a way that it’s not clear who said what. That’s another reason why I’ve embedded the video. Next time I scrape content I’ll try to do a better job.

  13. Sebastian on 16 January, 2008  #link

    Thanks Vingold. :) I’m glad you consider my stuff helpful.

  14. […] Matt Cutts said that Google doesn’t index password protected content. I wasn’t sure whether or not that […]

  15. karl on 17 January, 2008  #link

    My own solution for blocking bots and/or feedreaders. It’s in French but the .htaccess file is easy to understand. :)

    Blocking Web robots by user agent string

    Best

    [That works, but it’s a kinda weak/risky approach - you don’t cover spoofing and such. Sebastian]

  16. […] “If you want to remove stuff from search indexes permanently, not only Google’s, then you need to serve verified crawlers a forbidden response code (see how to deindex content).” […]

  17. Melanie Phung on 31 January, 2008  #link

    Thanks Sebastian - another good (and useful!) post. I worked on a site that committed 10 of the mistakes on your list (the first nine and then the last one). We did eventually fix most of it, but I’m wondering if that was a record or something. You ever seen a site that was guilty of more than 10 of those items all at once?

  18. Sebastian on 31 January, 2008  #link

    Thanks Melanie. :) Unfortunately, that’s not a record. ;( I’ve spotted more than 10 issues from the list above, plus a few that I didn’t mention, on e-commerce sites developed with ASP running on IIS.

  19. Mark on 26 September, 2008  #link

    Just found this website and what awesome technical detail.

    I have always avoided changing the robots.txt file but it is actually neccessary.

    Great Info.

  20. Sebastian on 29 September, 2008  #link

    Just a reminder: I do not link out to make_me_rich_in_a_second_ebooks and crap like that. Of course you can’t be bothered to read a comment policy …

  21. […] Getting URLs outta Google - the good, the popular, and the definitive way Sebastian X | 1/14/08 […]

  22. […] this post got nominated for a SEMMYS award. I’m honoured to be included in the company of Sebastian, Aaron and Vanessa. My own education in this industry has been made possible by the articles that […]

  23. […] This post made it on the short-list of the SEMMY 2009 Award: Please click here and vote for my pamphlet! […]

  24. […] Getting URLs outta Google - the good, the popular, and the definitive way Sebastian X | 1/14/08 […]

  25. Daniel on 5 February, 2009  #link

    A good, in-depth article with helpful video, but I disagree that the Google removal tool is a recommended method. First, there’s the problem of reappearing in six months but more importantly it’s Google only. A huge number of searches are non-Google, especially outside the USA, meaning this is only a partial solution. With such a large proportion of bots ignoring robots.txt, for me .htaccess is the only reliable answer.

    Edit: Just realised your title is “Getting URLs out Google”. OK, forget what I said! (although I still think .htaccess is the best way to go…)

  26. […] is controlling what bots can crawl and index on your site, so some pamphlets would be useful as wellGetting URLs outta Google – the good, the popular, and the definitive way Handling Google’s neat X-Robots-Tag – Sending REP header tags with PHPNasty Bots & […]

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.