Google to change the Robots Exclusion Protocol again

Posted on 30 November, 2007

Web crawler directives, partly standardized in the Robots Exclusion Protocol (REP), evolved since 1994. Nowadays we’ve to deal with a conglomerate of not binding de facto standards and microformats, all of them extended by various organizations. All search engines claim that they obey “the standard”, but they refer to their very own REP implementation. In fact, each search engine supports a proprietary set of REP directives, differently from other players as a rule.

Google is the search engine putting the most efforts into Robots Exclusion Protocol (REP) evolvements. Their XML Sitemaps handling submissions instead of crawl restrictions changed the REP to a wider scope, the X-Robots-Tag brought us robots meta tags for non-HTML resources like PDF documents, images or video clips, and with Unavailable_after Google made a few clueless news sites happy. With the rel-nofollow microformat on the other hand, respectively its sneaky morphing from a spam fighting tool to its current shape, Google made nobody happy. Yahoo contributed the well meant but half-assed “robots-nocontent” class name, and of course “noydir” (it’s unlikely that any other engine will support those).

Now Google is working on new robots.txt syntax, and I am, politely put, not amused. Here is why I fear that Google is going to totally mess up the REP:

Google supports a “Noindex:” directive in robots.txt, which is treated as “Disallow:”¹⁾. Of course that’s an experiment, but if this behavior doesn’t change we’ll get a beast that is -with regard to the confusion it will produce- way more evil than the rel-nofollow fiasco.

A noindex-alias for disallow makes no sense, even when such syntax errors are out there.
Mixing crawler directives (allow/disallow) with indexer directives (noindex) is not always a bright idea. It’s bad enough that most Webmasters still believe that “Googlebot ranks their stuff”. (Actually, in some cases it can make sense. For example “nofollow” in robots meta tags (or at least for Google in REL attributes too) is both a crawler instruction as well as an indexer directive.)
Noindex and disallow are completely different commands. The REP’s noindex directive means “crawl it, follow its links, but don’t list it on the SERPs”. Disallow forbids crawling, but allows indexing URLs from directory listings or other inbound links.

Standards should be clear and unambiguous. Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they’ve the power to fuck up the REP, but they also have “do no evil”.

Considering that Google is run by a bunch of smart engineers, I hope that they’ll do the right thing eventually. The right thing in this case is giving more power to REP evolvements, before questionable and selfish anti-search initiatives like ACAP ruin both the robots.txt consensus as well as the robots meta tag standard.

My idea of more power to REP evolvements is:

Sensible implementation of crawler/indexer-directives adapted from REP tags in robots.txt. Applying page-level instructions ((no)index, (no)follow, noarchive, nosnippet, noodp/noydir, unavailable_after and hopefully nopreview) to groups of URIs is a great way to steer crawling and indexing, especially for sites which for various reasons cannot make use of the HTTP header’s X-Robots-Tag.
Implementation of block-level directives in robots.txt. Allowing Webmasters to apply crawler instructions like “noindex” or “nofollow” to particular page areas, like advertising blocks, duplicated text or repetitive navigation elements, addressed via HTML element names and class names and/or DOM-IDs, would be a very flexible instrument to steer crawling and indexing, and it could eleminate many points of failure.
Getting Webmasters, Publishers, SEOs and all major engines together to discuss possibly missing granularity and to develop a binding norm obeyed by all players.

The last one sounds like wishful thinking. The alternative is that Google (and, if possible, the bigger engines) talk with Webmasters and then launch the necessary REP extensions. The other engines will follow sooner or later. The publishers, although not getting all their desired ACAP restrictions, will be happy too. Standards like the Robots Exclusion Protocol should be developed by engineers.

¹⁾ Noindex: is not a plain Disallow:, there’s an interesting difference. In Google’s experiment both directives block crawling, but Disallow: allows URL-indexing based on 3rd party information, and Disallow:‘ed URLs can accumulate PageRank from internal as well as external links. Noindex:‘ed URLs on the other hand will not appear on SERPs as URL-only listing or with an ODP title and snippet, and I’m quite sure that they will not gather PageRank nor other link juice. That means links from any pages to such URLs get an implicit rel-nofollow in Google’s PageRank calculation, just like dangling links. This apparatus could be a great way to handle PageRank leaks (monthly blog archives, printer friendly pages and stuff like that), because shit happens, hence some links to such pages will slip through without condom. I admit that’s a neat idea, but its implementation is flawed because it doesn’t consider the implicit Follow: (that’s syntax Google doesn’t support in robots.txt). A better way to mark site areas which shall not gather PageRank without raping the REP would be a Norank: directive or so. Noindex: without a Nofollow: must not block crawling. Googlebot must fetch those URLs to follow their links.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

Sebastian | Crawler Directives, Robots Meta Tags, X-Robots-Tag, XML-Sitemaps, robots.txt, Microformats, Google, Yahoo, Nofollow | Related posts

8 Comments to "Google to change the Robots Exclusion Protocol again"

Michael VanDeMar on 30 November, 2007 #link

Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they’ve the power to fuck up the REP, but they also have “do no evil”.

Can we really still claim that we believe that “Don’t Be Evil” remains a part of their creed? Just because the phrase “absolute power corrupts absolutely” is a cliche does not mean that it doesn’t still hold true in many cases.

Currently the only thing that Google seems to care about, the only thing that is likely to curb them from doing things with a negative impact, is image, and there only if it is something that will cause a stir with the mainstream. With something as technical as this I doubt that enough of a buzz will occur for them to even consider not screwing with it.
david deangelo on 3 December, 2007 #link

I have always thought disallow: would not allow a page to be crawled and hence would not allow page rank to accummulate. Does that mean I should use noindex: too for the pagerank to to go to those pages from links on my other pages?
Sebastian on 3 December, 2007 #link

“Disallow:” forbids crawling, but allows URLs to gather PageRank, anchor text and whatnot. That’s old news. The new “Noindex:” directive in robots.txt would block PageRank flow and indexing under anchor text or titles/descriptions of 3rd party links, but it’s experimental. As long as Google doesn’t announce it there’s no official support, and the behavior described in my post can change without notice. Currently there’s no sure-fire way to prevent Disallow’ed URLs from getting indexed respectively from accumulating link juice of any kind.
david deangelo on 3 December, 2007 #link

Thanks Sebastian. I think i will not use noindex: in the robots.txt until its become official. I only use the disallow: on redirected URLs e.g. affiliate links.
Matt on 3 December, 2007 #link

I guess people reacted the same way when the no follow was added, but now some people think it’s a great thing…I’m not so sure this or the no follow are actual benefits to the internet…seems Google is impeding the growth of the net with all these stipulations
Search Engine Land: News About Search Engines & Search Marketing on 3 December, 2007 #link

SearchCap: The Day In Search, December 3, 2007…

Below is what happened in search today, as reported on Search Engine Land and from other places across the web…….
Utah SEO on 24 December, 2007 #link

yeah, block-level directives would kick ass
My plea to Google - Please sanitize your REP revamps on 3 January, 2008 #link

[…] Google experiments with new robots.txt directives, that is REP tags like “noindex” adapted for robots.txt. That’s a welcomed and […]

Sebastian’s Pamphlets

Google to change the Robots Exclusion Protocol again

8 Comments to "Google to change the Robots Exclusion Protocol again"

Leave a reply

Categories

Monthly Archives

Links

RSS Feeds