Q: Please look at this robots.txt file and figure out why it’s worth a Q&A with you, my dear reader:
I know, this one was a breeze, so here comes your challenge.
Q: Which crawler directive used in the robots.txt above was introduced 1996 in the Robots Exclusion Protocol (REP), but was not defined in its very first version from 1994?
Ok, click here to show the second hint. A: “Noindex” of course. By the way, someone should fix the robotstxt.org server because the 503-errors make it somewhat hard to research the answers.
Congrats, you are smart. I’m sure you don’t need to lookup the next answers.
Q: Which major search engine has a team permanently working on REP extensions and releases those quite frequently, and who is the engineer in charge?
Ok, click here to show the third hint. A: The search engine is Google, and the REP team’s head is Dan Crow. Lurking on their servers and reading their HTTP headers sometimes reveals exiting stuff they’re working on.
Exactly. Now we’ve gathered all the pieces of this robots.txt puzzle.
Q: Could you please summarize your cognitions and conclusions?
Ok, click here to show the fourth hint. A: The “noindex” value is a crawler directive valid in robots meta tags as well as X-Robots-Tags, not (yet) in robots.txt. Maybe Google has something cooking, or the “Noindex: /” statement is just an experiment.
Thank you, dear reader! Now lets see what we can dig out. If the appearance of a “Noindex:” directive in robots.txt is an experiment, it would make sense that Ms. Googlebot understands and obeys it. Unfortunetely, I sold all the source code I’ve stolen from Google and didn’t keep a copy for myself, so I need to speculate a little.
Last time I looked, Google’s cool robots.txt validator emulated crawler behavior, that means that the crawlers understood syntax the validator didn’t handle correctly. Maybe this was changed in the meantime, perhaps the validator pulls its code from the “real thing” now, or at least the “Noindex:” experiment may have found its way into the validator’s portfolio. So I thought that testing the newish robots.txt statement “Noindex:” in the Webmaster Console is worth a try. And yes, it told me that Googlebot understands this command, and interprets it as “Disallow:”.
Since validation is no proof of crawler behavior, I’ve set up a page “blocked” with a “Noindex:” directive in robots.txt and linked it in my sidebar. The noindex statement was in place long enough before I’ve uploaded and linked the spider trap, so that the engines shouldn’t use a cached robots.txt when they follow my links. My test is public, feel free to check out my robots.txt as well as the crawler log.
While I’m waiting for the expected growth of my noindex crawler log, I’m speculating. Why the heck would Google use a new robots.txt directive which behaves like the good old Disallow: statement? Makes no sense to me.
Lets not forget that this mysterious noindex statement was discovered in the robots.txt of Google’s ad server, not in the better known and closely watched robots.txt of google.com. Google is not the only search engine trying to better understand client sided code. None of the major engines should be interested in crawling ads for ranking purposes. The MSN/LiveSearch referrer spam fiasco demonstrates that search engine bots can fetch and render Google ads outputted in iFrames on pagead2.googlesyndication.com.
Since nobody supports Google’s X-Robots-Tag (sending “noindex” and other REP directives in the HTTP header) until today, maybe the engines have a silent deal that content marked with “Noindex:” in robots.txt shouldn’t be indexed. Microsoft’s bogus spam bot which doesn’t bother with robots.txt because it somewhat hapless tries to emulate a human surfer is not considered a crawler, it’s existence just proves that “software shop” is not a valid label for M$.
This theory has a few weak points, but it could point to something. If noindex in robots.txt really prevents from indexing of contents crawled by accident, or non-HTML contents that can’t supply robots meta tags, that would be a very useful addition to the robots exclusion protocol. Of course we’d then need Noarchive:, Nofollow: and Nopreview: too, probably more but I’m not really in a greedy mood today.
Back to my crawler trap. Refreshing the log reveals that 30 minutes after spreading links pointing to it, Googlebot has fetched the page. That seems to prove that the Noindex: statement doesn’t prevent from crawling, regardless the false (?) information handed out by Google’s robots.txt validator.
(Or didn’t I give Ms. Googlebot enough time to refetch my robots.txt? Dunno. The robots.txt copy in my Google Webmaster Console still doesn’t show the Noindex: statement, but I doubt that’s the version Googlebot uses because according to the last-downloaded timestamp in GWC the robots.txt has been changed at the time of the download. Never mind. If I was way too impatient, I still can test whether a newly discovered noindex directive in robots.txt actually deindexes stuff or not.)
On with the show. The next interesting question is: Will the crawler trap page make it in Google’s search index? Without the possibly non-effective noindex directive a few hundred links should be able to accomplish that. Alas, a quoted search query delivers zilch so far.
Of course I’ve asked Google for more information, but didn’t receive a conclusive answer so far. While waiting for an official statement, I take a break from live blogging this quick research in favor of terrorizing a few folks with respectless blog comments. Stay tuned. Be right back.
Well, meanwhile I had dinner, the kids fell asleep –hopefully until tomorrow morning–, but nothing else happened. A very nice and friendly Googler tries to find out what the noindex in robots.txt fuss is all about, thanks and I can’t wait! However, I suspect the info is either forgotten or deeply buried in some well secured top secret code libraries, hence I’ll push the red button soon.
Thanks to Google’s great Webmaster Central team, especially Susan, I learned that I was flogging a dead horse. Here is Google’s take on Noindex in robots.txt:
As stated in my previous note, I wasn’t aware that we recognized any directives other than Allow/Disallow/Sitemap, so I did some asking around.
Unfortunately, I don’t have an answer that I can currently give you. […] I can’t contribute any clarifications right now.
Thank you Susan!
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments