Web crawler directives, partly standardized in the Robots Exclusion Protocol (REP), evolved since 1994. Nowadays we’ve to deal with a conglomerate of not binding de facto standards and microformats, all of them extended by various organizations. All search engines claim that they obey “the standard”, but they refer to their very own REP implementation. In fact, each search engine supports a proprietary set of REP directives, differently from other players as a rule.
Google is the search engine putting the most efforts into Robots Exclusion Protocol (REP) evolvements. Their XML Sitemaps handling submissions instead of crawl restrictions changed the REP to a wider scope, the X-Robots-Tag brought us robots meta tags for non-HTML resources like PDF documents, images or video clips, and with Unavailable_after Google made a few clueless news sites happy. With the rel-nofollow microformat on the other hand, respectively its sneaky morphing from a spam fighting tool to its current shape, Google made nobody happy. Yahoo contributed the well meant but half-assed “robots-nocontent” class name, and of course “noydir” (it’s unlikely that any other engine will support those).
Google supports a “Noindex:” directive in robots.txt, which is treated as “Disallow:”1). Of course that’s an experiment, but if this behavior doesn’t change we’ll get a beast that is –with regard to the confusion it will produce– way more evil than the rel-nofollow fiasco.
- A noindex-alias for disallow makes no sense, even when such syntax errors are out there.
- Mixing crawler directives (allow/disallow) with indexer directives (noindex) is not always a bright idea. It’s bad enough that most Webmasters still believe that “Googlebot ranks their stuff”. (Actually, in some cases it can make sense. For example “nofollow” in robots meta tags (or at least for Google in REL attributes too) is both a crawler instruction as well as an indexer directive.)
- Noindex and disallow are completely different commands. The REP’s noindex directive means “crawl it, follow its links, but don’t list it on the SERPs”. Disallow forbids crawling, but allows indexing URLs from directory listings or other inbound links.
Standards should be clear and unambiguous. Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they’ve the power to fuck up the REP, but they also have “do no evil”.
Considering that Google is run by a bunch of smart engineers, I hope that they’ll do the right thing eventually. The right thing in this case is giving more power to REP evolvements, before questionable and selfish anti-search initiatives like ACAP ruin both the robots.txt consensus as well as the robots meta tag standard.
My idea of more power to REP evolvements is:
- Sensible implementation of crawler/indexer-directives adapted from REP tags in robots.txt. Applying page-level instructions ((no)index, (no)follow, noarchive, nosnippet, noodp/noydir, unavailable_after and hopefully nopreview) to groups of URIs is a great way to steer crawling and indexing, especially for sites which for various reasons cannot make use of the HTTP header’s X-Robots-Tag.
- Implementation of block-level directives in robots.txt. Allowing Webmasters to apply crawler instructions like “noindex” or “nofollow” to particular page areas, like advertising blocks, duplicated text or repetitive navigation elements, addressed via HTML element names and class names and/or DOM-IDs, would be a very flexible instrument to steer crawling and indexing, and it could eleminate many points of failure.
- Getting Webmasters, Publishers, SEOs and all major engines together to discuss possibly missing granularity and to develop a binding norm obeyed by all players.
The last one sounds like wishful thinking. The alternative is that Google (and, if possible, the bigger engines) talk with Webmasters and then launch the necessary REP extensions. The other engines will follow sooner or later. The publishers, although not getting all their desired ACAP restrictions, will be happy too. Standards like the Robots Exclusion Protocol should be developed by engineers.
1) Noindex: is not a plain Disallow:, there’s an interesting difference. In Google’s experiment both directives block crawling, but Disallow: allows URL-indexing based on 3rd party information, and Disallow:‘ed URLs can accumulate PageRank from internal as well as external links. Noindex:‘ed URLs on the other hand will not appear on SERPs as URL-only listing or with an ODP title and snippet, and I’m quite sure that they will not gather PageRank nor other link juice. That means links from any pages to such URLs get an implicit rel-nofollow in Google’s PageRank calculation, just like dangling links. This apparatus could be a great way to handle PageRank leaks (monthly blog archives, printer friendly pages and stuff like that), because shit happens, hence some links to such pages will slip through without condom. I admit that’s a neat idea, but its implementation is flawed because it doesn’t consider the implicit Follow: (that’s syntax Google doesn’t support in robots.txt). A better way to mark site areas which shall not gather PageRank without raping the REP would be a Norank: directive or so. Noindex: without a Nofollow: must not block crawling. Googlebot must fetch those URLs to follow their links.
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments