Standardization of REP tags as robots.txt directives
This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It’s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported by all major search engines. Furthermore it considers work in progress leaked out from Google.
In the following I’ll try to define a few robots.txt directives that Webmasters really need.
Unfortunately, they got it totally wrong, again. (Skip the longish explanation of the rel-nofollow fiasco and my rant on Google’s current robots.txt experiments.)
Google’s last try to enhance the REP by adapting a REP tag’s value in another level was a miserable failure. Not because crawler directives on link-level are a bad thing, the opposite is true, but because the implementation of rel-nofollow confused the hell out of Webmasters, and still does.
Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes
Don’t get me wrong, an instrument to steer search engine crawling and indexing on link level is a great utensil in a Webmaster’s toolbox. Rel-nofollow just lacks granularity, and it was sneakily introduced for the wrong purposes.
Recap: When Google launched rel-nofollow in 2005, they promoted it as a tool to fight comment spam.
From now on, when Google sees the attribute (rel=”nofollow”) on hyperlinks, those links won’t get any credit when we rank websites in our search results. This isn’t a negative vote for the site where the comment was posted; it’s just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists.
Technically spoken, this translates to “search engine crawlers shall/can use rel-nofollow links for discovery crawling, but indexers and ranking algos processing links must not credit link destinations with PageRank, anchor text, nor other link juice originating from rel-nofollow links”. Rel=”nofollow” meant rel=”pass-no-reputation”.
All blog platforms implemented the beast, and it seemed that Google got rid of a major problem (gazillions of irrelevant spam links manipulating their rankings). Not so the bloggers, because the spammers didn’t bother to check whether a blog dofollows inserted links or not. Despite all the condomized links the amount of blog comment spam increased dramatically, since the spammers were forced to attack even more blogs in order to earn the same amount of uncondomized links from blogs that didn’t update to a software version that supported rel-nofollow.
Experiment failed, move on to better solutions like Akismet, captchas or ajax’ed comment forms? Nope, it’s not that easy. Google had a hidden agenda. Fighting blog comment spam was just a snake oil sales pitch, an opportunity to establish rel-nofollow by jumping on a popular band wagon. In 2005 Google had mastered the guestbook spam problem already. Devaluing comment links in well structured pages like blog posts is as easy as doing the same with guestbook links, or identifying affiliate links. In other words, when Google launched rel-nofollow, blog comment spam was definitely not a major search quality issue any more.
Identifying paid links on the other hand is not that easy, because they often appear as editorial links within the content. And that was a major problem for Google, a problem that they weren’t able to solve algorithmically without cooperation of all webmasters, site owners, and publishers. Google actually invented rel-nofollow to get a grip on paid links. Recently they announced that Googlebot no longer follows condomized links (pre-Bigdaddy Google followed condomized links and indexed contents discovered from rel-nofollow links), and their cold war on paid links became hot.
Of course the sneaky morphing of rel-nofollow from “pass no reputation” to a full blown “nofollow” is just a secondary theater of war, but without this side issue (with regard to REP standardization) Google would have lost, hence it was decisive for the outcome of their war on paid links.
Rel-nofollow is settled now. However, I don’t want to see Google using their enormous power to manipulate the REP for selfish goals again. I wrote this rel-nofollow recap because probably, or possibly, Google is just doing it once more:
Google’s “Noindex: in robots.txt” experiment
Google supports a Noindex: directive in robots.txt. It seems Google’s Noindex: blocks crawling like Disallow:, but additionally prevents URLs blocked with Noindex: both from accumulating PageRank as well as from indexing based on 3rd party signals like inbound links.
This functionality would be nice to have, but accomplishing it with “Noindex” is badly wrong. The REP’s “Noindex” value without an explicit “Nofollow” means “crawl it, follow its links, but don’t list it on SERPs”. With pagel-level directives (robots meta tags and X-Robots-Tags) Google handles “Noindex” exactly as defined, that means with an implicit “Follow”. Not so in robots.txt. Mixing crawler directives (Disallow:) with indexer directives (Noindex:) this way takes the “Follow” out of the game, because a search engine can’t follow links from uncrawled documents.
Webmasters will not understand that “Nofollow” means totally different things in robots.txt and meta tags. Also, this approach steals granularity that we need, for example for use with technically structured sitemap pages and other hubs.
According to Google their current interpretation of Noindex: in robots.txt is not yet set in stone. That means there’s an opportunity for improvement. I hope that Google, and other search engines as well, listen to the needs of Webmasters.
Dear Googlers, don’t take the above said as Google bashing. I know, and often wrote, that Google is the search engine that puts the most efforts in boring tasks like REP evolvement. I just think that a dog company like Google needs to take real-world Webmasters into the boat when playing with standards like the REP, for the sake of the cats.
Recap: Existing robots.txt directives
The /path example in the following sections refers to any way to assign URIs to REP directives, not only complete URIs relative to the server’s root. Patterns can be useful to set crawler directives for a bunch of URIs:
- *: any string in path or query string, including the query string delimiter “?”, multiple wildcards should be allowed.
- $: end of URI
- Trailing /: (not exactly a pattern) addresses a directory, its files and subdirectories, the subdirectorie’s files etc., for example
matches /path/index.html but not /path.html
matches both /path/index.html and /path.html, as well as /path_1.html. It’s a pretty common mistake to “forget” the trailing slash in crawler directives meant to disallow particular directories. Such mistakes can result in blocking script/page-URIs that should get crawled and indexed.
Please note that patterns aren’t supported by all search engines, for example MSN supports only file extensions (yet?).
User-agent: [crawler name]
Groups a set of instructions for a particular crawler. Crawlers that find their own section in robots.txt ignore the
User-agent: * section that addresses all Web robots. Each User-agent: section must be terminated with at least one empty line.
Prevents from crawling, but allows indexing based on 3rd party information like anchor text and surrounding text of inbound links. Disallow’ed URLs can gather PageRank.
Refines previous Disallow: statements. For example
tells crawlers that they may fetch http://example.com/scripts/page.php or http://example.com/scripts/page.php?article=1, but not any other URL in http://example.com/scripts/.
Sitemap: [absolute URL]
Announces XML sitemaps to search engines. Example:
points all search engines that support Google’s Sitemaps Protocol to the sitemap locations. Please note that sitemap autodiscovery via robots.txt doesn’t replace sitemap submissions. Google, Yahoo and MSN provide Webmaster Consoles where you not only can submit your sitemaps, but follow the indexing process (wishful thinking WRT particular SEs). In some cases it might be a bright idea to avoid the default file name “sitemap.xml” and keep the sitemap URLs out of robots.txt, sitemap autodiscovery is not for everyone.
Recap: Existing REP tags
REP tags are values that you can use in a page’s robots meta tag and X-Robots-Tag. Robots meta tags go to the HTML document’s HEAD section
<meta name="robots" content="noindex, follow, noarchive" />
whereas X-Robots-Tags supply the same information in the HTTP header
X-Robots-Tag: noindex, follow, noarchive
and thus can instruct crawlers how to handle non-HTML resources like PDFs, images, videos, and whatnot.
- Widely supported REP tags are:
- INDEX|NOINDEX - Tells whether the page may be indexed (listed on SERPs) or not
- FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided in the document or not
- ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
- NOODP - tells search engines not to use page titles and descriptions pulled from DMOZ on their SERPs.
- NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
- NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
- NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
- UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time
Problems with REP tags in robots.txt
REP tags (index, noindex, follow, nofollow, all, none, noarchive, nosnippet, noodp, noydir, unavailable_after) were designed as page-level directives. Setting those values for groups of URLs makes steering search engine crawling and indexing a breeze, but also comes with more complexity and a few pitfalls as well.
Page-level directives are instructions for indexers and query engines, not crawlers. A search engine can’t obey REP tags without crawling the resource that supplies them. That means that not a single REP tag put as robots.txt statement shall be misunderstood as crawler directive.
For example Noindex: /path must not block crawling, not even in combination with Nofollow: /path, because there’s still the implicit “archive” (= absence of Noarchive: /path). Providing a cached copy even of a not indexed page makes sense for toolbar users.
Whether or not a search engine actually crawls a resource that’s tagged with “noindex, nofollow, noarchive, nosnippet” or so is up to the particular SE, but none of those values implies a Disallow: /path.
Historically, a crawler instruction on HTML element level overrules the robots meta tag. For example when the meta tag says “follow” for all links on a page, the crawler will not follow a link that is condomized with rel=”nofollow”.
Does that mean that a robots meta tag overrules a conflicting robots.txt statement? Of course not in any case. Robots.txt is the gatekeeper, and so to say the “highest REP instance”. Actually, to this question there’s no absolute answer that satisfies everybody.
A Webmaster sitting on a huge conglomerate of legacy code may want to totally switch to robots.txt directives, that means search engines shall ignore all the BS in ancient meta tags of pages created in the stone age of the Internet. Back then the rules were different. An alternative/secondary landing page’s “index,follow” from 1998 most probably doesn’t fly with 2008’s duplicate content filters and high sophisticated link pattern analytics.
The Webmaster of a well designed brand new site on the other hand might be happy with a default behavior where page-level REP tags overrule site-wide directives in robots.txt.
REP tags used in robots.txt might refine crawler directives. For example a disallow’ed URL can accumulate PageRank, and may be listed on SERPs. We need at least two different directives ruling PageRank caluculation and indexing for uncrawlable resources (see below under Noodp:/Noydir:, Noindex: and Norank:).
Google’s current approach to handle this with the Noindex: directive alone is not acceptable, we need a new REP tag to handle this case. Next up, when we introduce a new REP tag for use in robots.txt, we should allow it in meta tags and HTTP headers too.
In theory it makes no sense to maintain a directive that describes a default behavior. But why has the REP “follow” although the absence of “nofollow” perfectly expresses “follow”? Because of the way non-geeks think (try to explain why the value nil/null doesn’t equal empty/zero/blank to a non-geek. Not!).
Implicit directives that aren’t explicitely named and described in the rules don’t exist for the masses. Even in the 10 commandments someone had to write “thou shalt not hotlink|scrape|spam|cloak|crosslink|hijack…” instead of a no-brainer like “publish unique and compelling content for people and make your stuff crawlable”. Unfortunately, that works the other way round too. If a statement (Index: or Follow:) is dependent on another one (Allow: respectively the absence of Disallow:) folks will whine, rant and argue when search engines ignore their stuff.
Obviously we need at least Index:, Follow: and Archive to keep the standard usable and somewhat understandable. Of course crawler directives might thwart such indexer directives. Ignorant folks will write alphabetically ordered robots.txt files like
Whether or not indexer directives that require crawling can overrule the crawler directive Disallow: is open for discussion. I vote for “not”.
Applying REP tags on site-level would be great, but it doesn’t solve other problems like the need of directives on block and element level. Both Google’s section targeting as well as Yahoo’s robots-nocontent class name aren’t acceptable tools capable to instruct search engines how to handle content in particular page areas (advertising blocks, navigation and other templated stuff, links in footers or sidebar elements, and so on).
Instead of editing bazillions of pages, templates, include files and whatnot to insert rel-nofollow/nocontent stuff for the sole purpose of sucking up to search engines, we need an elegant way to apply such micro-directives via robots.txt, or at least site-wide sets of instructions referenced in robots.txt. Once that’s doable, Webmasters will make use of such tools to improve their rankings, and not alone to comply to the ever changing search engine policies that cost the Webmaster community billions of man hours each year.
I consider these robots.txt statements sexy:
Nofollow a.advertising, div#adblock, span.cross-links: /path
Noindex .inherited-properties, p#tos, p#privacy, p#legal: /path
but that’s a wish list for another post. However, while designing site-wide REP statements we should at least think of block/element level directives.
Remember the rel-nofollow fiasco where a REP tag was used on HTML element level producing so much confusion and conflicts. Lets learn from past mistakes and make it perfect this time. A perfect standard can be complex, but it’s clear and unambiguous.
The REP’s command hierarchy must be well defined:
- Page meta tags and X-Robots-Tags in the HTTP header. X-Robots-Tag values overrule conflicting meta tag values.
- [Future block level directives]
- Element level directives like rel-nofollow
That means, when crawling is allowed, page level instructions overrule robots.txt, and element level (or future block level) directives overrule page level instructions as well as robots.txt. As long as the Webmaster doesn’t revert the latter:
Default behavior, directives in robots meta tags overrule robots.txt statements. Necessary to reset previous Priority-site-level: statements.
Robots.txt directives overrule conflicting directives in robots meta tags and X-Robots-Tags.
Priority-site-level All: /path
Robots.txt directives overrule all directives in robots meta tags or provided elsewhere, because those are completely ignored for all URIs under /path. The “All” parameter would even dofollow nofollow’ed links when the robots.txt lacks corresponding Nofollow: statements.
Follow outgoing links, archive the page, but don’t list it on SERPs. The URLs can accumulate PageRank etcetera. Deindex previously indexed URLs.
[Currently Google doesn’t crawl Noindex’ed URLs and most probably those can’t accumulate PageRank, hence URLs in /path can’t distribute PageRank. That’s plain wrong. Those URLs should be able to pass PageRank to outgoing links when there’s no explicit Nofollow:, nor a “nofollow” meta tag respectively X-Robots-Tag.]
Prevents URLs from accumulating PageRank, anchor text, and whatever link juice.
Makes sense to refine Disallow: statements in company with Noindex: and Noodp:/Noydir:, or to prevent TOS/contact/privacy/… pages and alike from sucking PageRank (nofollow’ing TOS links and stuff like that to control PageRank flow is fault-prone).
The uber-link-condom. Don’t use outgoing links, not even internal links, for discovery crawling. Don’t credit the link destinations with any reputation (PageRank, anchor text, and whatnot).
Don’t make a cached copy of the resource available to searchers.
List the resource with linked page title on SERPs, but don’t create a text snippet, and don’t reprint the description meta tag.
[Why don’t we have a REP tag saying “use my description meta tag or nothing”?]
Don’t create/link an HTML preview of this resource. That’s interesting for subscriptions sites and applies mostly to PDFs, Word documents, spread sheets, presentations, and other non-HTML resources. More information here.
Don’t use the DMOZ title nor the DMOZ description for this URL on SERPs, not even when this resource is a non-HTML document that doesn’t supply its own title/meta description.
I’m not sure this one makes sense in robots.txt, because only Yahoo search uses titles and descriptions from the Yahoo directory. Anyway: “Don’t overwrite the page title listed on the SERPs with information pulled from the Yahoo directory, although I paid for it.”
Unavailable_after [date]: /path
Deindex the resource the day after [date]. The parameter [date] is put in any date or date/time format, if it lacks a timezone then GMT is assumed.
[Google’s RFC 850 obsession is somewhat weird. There are many ways to put a timestamp other than “25-Aug-2007 15:00:00 EST”.]
Truncate-variable [string|pattern]: /path
Truncate-value [string|pattern]: /path
In the search index remove the unwanted variable/value pair(s) from the URL’s query string and transfer PageRank and other link juice to the matching URL without those parameters. If this “bare URL” redirects, or is uncrawlable for other reasons, index it with the content pulled from the page with the more complex URL.
Regardless whether the variable name or the variable’s value matches the pattern, “Truncate_*” statements remove a complete argument from the query string, that is
&variable=value. If after the (last) truncate operation the query string is empty, the querystring delimiter “?” (questionmark) must be removed too.
Order-arguments [charset]: /path
Sort the query strings of all dynamic URLs by variable name, then within the ordered variables by their values. Pick the first URL from each set of identical results as canonical URL. Transfer PageRank etcetera from all dupes to the canonical URL.
Lots of sites out there were developed by coders who are utterly challenged by all things SEO. Most Web developers don’t even know what URL canonicalization means. Those sites suffer from tons of URLs that all serve identical contents, just because the query string arguments are put in random order, usually inventing a new sequence for each script, function, or include file. Of course most search engines run high sophisticated URL canonicalization routines to prevent their indexes from too much duplicate content, but those algos can fail because every Web site is different.
I totally can resist to suggest a
Canonical-uri /: /Default.asp statement that gathers all IIS default-document-URI maladies. Also, case issues shouldn’t get fixed with
Case-insensitive-uris: / but by the clueless developers in Redmond.
Will all this come true?
Well, Google has silently started to support REP tags in robots.txt, it totally makes sense both for search engines as well as for Webmasters, and Joe Webmaster’s life would be way more comfortable having REP tags for robots.txt.
A better question would be “will search engines implement REP tags for robots.txt in a way that Webmasters can live with it?”. Although Google launched the sitemaps protocol without significant help from the Webmaster community, I strongly feel that they desperately need our support with this move.
Currently it looks like they will fuck up the REP, respectively the robots.txt standard, hence go grab your AdWords rep and choke her/him until s/he promises to involve Larry, Sergey, Matt, Adam, John, and the whole Webmaster Support Team for the sake of common sense and the worldwide Webmaster community. Thank you!
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments