Q&A: An undocumented robots.txt crawler directive from Google

What's the fuss about noindex in Google's robots.txt?Blogging should be fun every now and then. Today I don’t tell you anything new about Google’s secret experiments with the robots exclusion protocol. I ask you instead, because I’m sure you know your stuff. Unfortunately, the Q&A on undocumented robots.txt syntax from Google’s labs utilizes JavaScript, so perhaps it looks somewhat weird in your feed reader.

Q: Please look at this robots.txt file and figure out why it’s worth a Q&A with you, my dear reader:


User-Agent: *
Disallow: /
Noindex: /

Ok, click here to show the first hint.

I know, this one was a breeze, so here comes your challenge.
Q: Which crawler directive used in the robots.txt above was introduced 1996 in the Robots Exclusion Protocol (REP), but was not defined in its very first version from 1994?

Ok, click here to show the second hint.

Congrats, you are smart. I’m sure you don’t need to lookup the next answers.
Q: Which major search engine has a team permanently working on REP extensions and releases those quite frequently, and who is the engineer in charge?

Ok, click here to show the third hint.

Exactly. Now we’ve gathered all the pieces of this robots.txt puzzle.
Q: Could you please summarize your cognitions and conclusions?

Ok, click here to show the fourth hint.

Thank you, dear reader! Now lets see what we can dig out. If the appearance of a “Noindex:” directive in robots.txt is an experiment, it would make sense that Ms. Googlebot understands and obeys it. Unfortunetely, I sold all the source code I’ve stolen from Google and didn’t keep a copy for myself, so I need to speculate a little.

Last time I looked, Google’s cool robots.txt validator emulated crawler behavior, that means that the crawlers understood syntax the validator didn’t handle correctly. Maybe this was changed in the meantime, perhaps the validator pulls its code from the “real thing” now, or at least the “Noindex:” experiment may have found its way into the validator’s portfolio. So I thought that testing the newish robots.txt statement “Noindex:” in the Webmaster Console is worth a try. And yes, it told me that Googlebot understands this command, and interprets it as “Disallow:”.
Blocked by line 27: Noindex: /noindex/

Since validation is no proof of crawler behavior, I’ve set up a page “blocked” with a “Noindex:” directive in robots.txt and linked it in my sidebar. The noindex statement was in place long enough before I’ve uploaded and linked the spider trap, so that the engines shouldn’t use a cached robots.txt when they follow my links. My test is public, feel free to check out my robots.txt as well as the crawler log.

While I’m waiting for the expected growth of my noindex crawler log, I’m speculating. Why the heck would Google use a new robots.txt directive which behaves like the good old Disallow: statement? Makes no sense to me.

Lets not forget that this mysterious noindex statement was discovered in the robots.txt of Google’s ad server, not in the better known and closely watched robots.txt of google.com. Google is not the only search engine trying to better understand client sided code. None of the major engines should be interested in crawling ads for ranking purposes. The MSN/LiveSearch referrer spam fiasco demonstrates that search engine bots can fetch and render Google ads outputted in iFrames on pagead2.googlesyndication.com.

Since nobody supports Google’s X-Robots-Tag (sending “noindex” and other REP directives in the HTTP header) until today, maybe the engines have a silent deal that content marked with “Noindex:” in robots.txt shouldn’t be indexed. Microsoft’s bogus spam bot which doesn’t bother with robots.txt because it somewhat hapless tries to emulate a human surfer is not considered a crawler, it’s existence just proves that “software shop” is not a valid label for M$.

This theory has a few weak points, but it could point to something. If noindex in robots.txt really prevents from indexing of contents crawled by accident, or non-HTML contents that can’t supply robots meta tags, that would be a very useful addition to the robots exclusion protocol. Of course we’d then need Noarchive:, Nofollow: and Nopreview: too, probably more but I’m not really in a greedy mood today.

Back to my crawler trap. Refreshing the log reveals that 30 minutes after spreading links pointing to it, Googlebot has fetched the page. That seems to prove that the Noindex: statement doesn’t prevent from crawling, regardless the false (?) information handed out by Google’s robots.txt validator.

(Or didn’t I give Ms. Googlebot enough time to refetch my robots.txt? Dunno. The robots.txt copy in my Google Webmaster Console still doesn’t show the Noindex: statement, but I doubt that’s the version Googlebot uses because according to the last-downloaded timestamp in GWC the robots.txt has been changed at the time of the download. Never mind. If I was way too impatient, I still can test whether a newly discovered noindex directive in robots.txt actually deindexes stuff or not.)

On with the show. The next interesting question is: Will the crawler trap page make it in Google’s search index? Without the possibly non-effective noindex directive a few hundred links should be able to accomplish that. Alas, a quoted search query delivers zilch so far.

Of course I’ve asked Google for more information, but didn’t receive a conclusive answer so far. While waiting for an official statement, I take a break from live blogging this quick research in favor of terrorizing a few folks with respectless blog comments. Stay tuned. Be right back.


Well, meanwhile I had dinner, the kids fell asleep –hopefully until tomorrow morning–, but nothing else happened. A very nice and friendly Googler tries to find out what the noindex in robots.txt fuss is all about, thanks and I can’t wait! However, I suspect the info is either forgotten or deeply buried in some well secured top secret code libraries, hence I’ll push the red button soon.


Thanks to Google’s great Webmaster Central team, especially Susan, I learned that I was flogging a dead horse. Here is Google’s take on Noindex in robots.txt:

As stated in my previous note, I wasn’t aware that we recognized any directives other than Allow/Disallow/Sitemap, so I did some asking around.

Unfortunately, I don’t have an answer that I can currently give you. […] I can’t contribute any clarifications right now.

Thank you Susan!

Update: John Müller from Google has just confirmed that their crawler understands the Noindex: syntax, but it’s not yet set in stone.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

20 Comments to "Q&A: An undocumented robots.txt crawler directive from Google"

  1. JLH on 14 November, 2007  #link

    You crack me up sometimes. So do you get up every morning and check all the robots.txt’s of the search engines or do you have a script that checks to see if anything has changed?

    I guess it’s possible that someone would like to have a page crawled, it’s links followed, but not indexed, for perhaps an interstitial categories listing page on an ecommerce site. You’d like the subsequent product listings pages to be indexed, but not the page that links to all of them. I dunno, just trying to find a use for it.

  2. Sebastian on 14 November, 2007  #link

    John, I’m a sniffer but I don’t monitor SE robots.txt files and stuff like that systematically. Of course such a script would be helpful, maybe it’s a good idea, err, of course it’s a good idea to write it. Usually I rely on gut feelings, and that works quite well.

    Of course more evolvement of the REP standards would be great. Maybe some day I really ask Dan Crow for an internship to speed things up. I don’t see any other engine seriously working on the REP, at least not with Google’s passion. :(

  3. Igor The Troll on 14 November, 2007  #link

    Very funny, I was going in infinate loop until you pushed exit()

    And the animals, where you get them? Do you have a zoo..:)

    Sebastian you , must be a core hacker - you may need to decompile that to the anaware ones.

  4. Sebastian on 14 November, 2007  #link

    Thanks to my cute monsters I’ve a zoo. Actually, I buy licenses for the images. I’m not a hacker.

  5. SearchCap: The Day In Search, November 14, 2007…

    Below is what happened in search today, as reported on Search Engine Land and from other places across the web…….

  6. Igor The Troll on 14 November, 2007  #link

    Core Hacker is identifying and second-guessing implementation peculiarities.

    Awareness of the implementation differences in standard protocols and mechanisms and the ability to second-guess the closed-source systems’ implementors is a core hacker competence. Of course, it’s not a solely hacker-specific skill—many developers who work in closed-source environments develop this skill while debugging undocumented features of their target platforms.

  7. Sam Daams on 14 November, 2007  #link

    This kind of thing has happened before, not too long ago I remember a discussion or two on google groups regarding the entire code.google.com being blocked off. They said it was a mistake. Here’s some threads:
    http://groups.google.com/group/codesite-discuss/browse_thread/thread/f5422cf7788c97ea/1d5a6211705ee597
    http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/cd740055a596d4e3/38e17d66a91c8f5a?lnk=st&q=code.google#38e17d66a91c8f5a

    You just have to hope these kind of mistakes don’t happen in their search algo where we can’t bring it to their attention :)
    Sam

  8. g1smd on 19 November, 2007  #link

    Is it just a part of the Robustness Principle in operation at Google?

    Maybe they are supporting syntax that they see a lot of people using in error?

  9. Sam Daams on 20 November, 2007  #link

    I don’t know about this specific case, but in the case of the code.google one they admitted it was just a massive mistake. Of course that might have been more related to disallowing the entire code.google.com subdomain than this particular non existing syntax being added in. Either way, it smelt extremely amateur.

  10. Sebastian on 20 November, 2007  #link

    Maybe this statement wasn’t put for testing, by mistake or lack of understanding the REP. Perhaps there’s a robot out there which understands “Noindex: /” in robots.txt and annoyed Google by downloading and rendering AdSense code? If that’s the case, they also should call MSN to ask how to stop the referrer spam bot lowering AdSense CTR … [evilgrin]

  11. JohnMu on 20 November, 2007  #link

    Good catch, Sebastian. How is your experiment going? At the moment we will usually accept the “noindex” directive in the robots.txt, but we are not yet at a point where we are willing to set it into stone and announce full support.

  12. Sebastian on 20 November, 2007  #link

    Thanks for your confirmation John! Googlebot has fetched the crawler trap on 2007-11-14 07:23:03 and it’s not indexed. Maybe the page makes it in the index if you at the time of crawling and indexing had a cached robots.txt without the Noindex: statement, but I doubt it. I’ll watch that for a while and then launch another experiment w/o telling the URI. ;)
    Do you support wildcards with “Noindex:” too, for example
    Noindex: /*.txt$
    Noarchive: /cloaked/*.html$
    Nofollow: /links/*.php$
    Nopreview: /scholar/*.pdf$
    or so? That would be exciting! Please LMK :)

  13. JohnMu on 20 November, 2007  #link

    Wildcards should be ok, as they are for the normal disallow: and allow: directives. I just want to remind everyone again that this is something that may still change over time. Be careful when playing with things like this :) .

  14. […] week I reported that Google experiments with new crawler directives for use in robots.txt. Today Google has confirmed that Googlebot understands experimental REP syntax like […]

  15. Igor The Troll on 21 November, 2007  #link

    Sebasian, why do you rank number 1 for Igor The Troll in Google?

    Is Sebstain Igor The Troll?
    http://www.google.com/search?hl=en&q=Igor+The+Troll

    I know you created me in GWHG, but now you are taking over my Nick…are you some Alien being that sucks the souls out of people and their Websites?

    NP, Everyone knows that Sabestian is the father of Igor The Troll

    Or, are you gaming Google with Black Hat Seo?

  16. Sebastian on 21 November, 2007  #link

    Probably a trust issue, no SEO-voodoo involved. It seems mine is one of very few sites still allowing you to leave your nick. I don’t try to outrank your absurd home page, why should I?

    You’ve created your nick yourself, and it’s suitable. At Google I just told you that trolling is not appreciated.

    Igor, you were warned that insane and childish off-topic comments will get you banned eventually. I’ve deleted a few of them and if you don’t behave you’re history. That’s my last warning. Behave yourself or get the fuck outta here.

  17. Utah SEO Pro on 25 November, 2007  #link

    This discovery is quite the milestone. I foresee many new capabilities with robots.txt in the future due to this. However, how the hell can a webmaster hide their robots.txt from the public while serving it up to bots without doing anything shady? Do you have any example .htaccess code? Cuz I know you’re the pimp master at that kinda shit.

    Way to lay it thick on Igor. I love how you’re so forward. haha

  18. […] loyal reader of my pamphlets asked me: I foresee many new capabilities with robots.txt in the future due to this [Google’s […]

  19. Sebastian on 26 November, 2007  #link

    Jordan, thanks and give this a try.

  20. […] engines can’t index your sitemap files you might be able to block them with robots.txt (e.g., noindex directive), and/or send an x-robots-tag HTTP header telling Google and Yahoo not to index them. This entry […]

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.