Robots Meta Tags

Archived posts from the 'Robots Meta Tags' Category

How do Majestic and LinkScape get their raw data?

Posted on 21 January, 2010

LinkScape data acquisition Does your built-in bullshit detector cry in agony when you read announcements of link analysis tools claiming to have crawled Web pages in the trillions? Can a tiny SEO shop, or a remote search engine in its early stages running on donated equipment, build an index of that size? It took Google a decade to reach these figures, and Google’s webspam team alone outnumbers the staff of SEOmoz and Majestic, not to speak of infrastructure.

Well, it’s not as shady as you might think, although there’s some serious bragging and willy whacking involved.

First of all, both SEOmoz and Majestic do not own an indexed copy of the Web. They process markup just to extract hyperlinks. That means they parse Web resources, mostly HTML pages, to store linkage data. Once each link and its attributes (HREF and REL values, anchor text, …) are stored under a Web page’s URI, the markup gets discarded. That’s why you can’t search these indexes for keywords. There’s no full text index necessary to compute link graphs.

Majestic index size The storage requirements for the Web’s link graph are way smaller than for a full text index that major search engines have to handle. In other words, it’s plausible.

Majestic clearly describes this process, and openly tells that they index only links.

With SEOmoz that’s a completely different story. They obfuscate information about the technology behind LinkScape to a level that could be described as near-snake-oil. Of course one could argue that they might be totally clueless, but I don’t buy that. You can’t create a tool like LinkScape being a moron with an IQ slighly below an amoeba. As a matter of fact, I do know that LinkScape was developed by extremely bright folks, so we’re dealing with a misleading sales pitch:

Linkscape index size

Let’s throw in a comment at Sphinn, where a SEOmoz rep posted “Our bots, our crawl, our index“.

Of course that’s utter bullshit. SEOmoz does not have the resources to accomplish such a task. In other words, if -and that’s a big IF- they do work as described above, they’re operating something extremely sneaky that breaks Web standards and my understanding of fairness and honesty. Actually, that’s not so, but because it is not so, LinkScape and OpenSiteExplorer in its current shape must die (see below why).

They do insult your intelligence as well as mine, and that’s obviously not the right thing to do, but I assume they do it solely for marketing purposes. Not that they need to cover up their operation with a smokescreen like that. LinkScape could succeed with all facts on the table. I’d call it a neat SEO tool, if it just would be legit.

So what’s wrong with SEOmoz’s statements above, and LinkScape at all?

Let’s start with “Crawled in the past 45 days: 700 billion links, 55 billion URLs, 63 million root domains”. That translates to “crawled … 55 billion Web pages, including 63 million root index pages, carrying 700 billion links”. 13 links per page is plausible. Crawling 55 billion URIs requires sending out HTTP GET requests to fetch 55 billion Web resources within 45 days, that’s roughly 30 terabyte per day. Plausible? Perhaps.

True? Not as is. Making up numbers like “crawled 700 billion links” suggests a comprehensive index of 700 billion URIs. I highly doubt that SEOmoz did ‘crawl’ 700 billion URIs.

When SEOmoz would really crawl the Web, they’d have to respect Web standards like the Robots Exclusion Protocol (REP). You would find their crawler in your logs. An organization crawling the Web must

do that with a user agent that identifies itself as crawler, for example “Mozilla/5.0 (compatible; Seomozbot/1.0; +http://www.seomoz.com/bot.html)”,
fetch robots.txt at least daily,
provide a method to block their crawler with robots.txt,
respect indexer directives like “noindex” or “nofollow” both in META elements as well as in HTTP response headers.

SEOmoz obeys only <META NAME="SEOMOZ" CONTENT="NOINDEX" />, according to their sources page. And exactly this page reveals that they purchase their data from various services, including search engines. They do not crawl a single Web page.

Savvy SEOs should know that crawling, parsing, and indexing are different processes. Why does SEOmoz insist on the term “crawling”, taking all the flak they can get, when they obviously don’t crawl anything?

Two claims out of three in “Our bots, our crawl, our index” are blatant lies. If SEOmoz performs any crawling, in addition to processing bought data, without following and communicating the procedure outlined above, that would be sneaky. I really hope that’s not happening.

As a matter of fact, I’d like to see SEOmoz crawling. I’d be very, very happy if they would not purchase a single byte of 3rd party crawler results. Why? Because I could block them in robots.txt. If they don’t access my content, I don’t have to worry whether they obey my indexer directives (robots meta ‘tag’) or not.

As a side note, requiring a “SEOMOZ” robots META element to opt out of their link analysis is plain theft. Adding such code bloat to my pages takes a lot of time, and that’s expensive. Also, serving an additional line of code in each and every HEAD section sums up to a lot of wasted bandwidth -$$!- over time. Am I supposed to invest my hard earned bucks just to prevent me from revealing my outgoing links to my competitors? For that reason alone I should report SEOmoz to the FTC requesting them to shut LinkScape down asap.

They don’t obey the X-Robots-Tag (”noindex”/”nofollow”/… in the HTTP header) for a reason. Working with purchased data from various sources they can’t guarantee that they even get those headers. Also, why the fuck should I serve MSNbot, Slurp or Googlebot an HTTP header addressing SEOmoz? This could put my search engine visibility at risk.

If they’d crawl themselves, serving their user agent a “noindex” X-Robots-Tag and a 403 might be doable, at least when they pay for my efforts. With their current setup that’s technically impossible. They could switch to 80legs.com completely, that’ll solve the problem, provided 80legs works 100% by the REP and crawls as “SEOmozBot” or so.

With MajesticSEO that’s not an issue, because I can block their crawler withUser-agent: MJ12bot Disallow: /

Yahoo’s site explorer also delivers too much data. I can’t block it without losing search engine traffic. Since it will probably die when Microsoft overtakes search.yahoo.com, I don’t rant much about it. Google and Bing don’t reveal my linkage data to everyone.

I have an issue with SEOmoz’s LinkScape, and OpenSiteExplorer as well. It’s serious enough that I say they have to close it, if they’re not willing to change their architecture. And that has nothing to do with misleading sales pitches, or arrogant behavior, or sympathy (respectively, a possibly lack of sympathy).

The competitive link analysis OpenSiteExplorer/LinkScape provides, without giving me a real chance to opt out, puts my business at risk. As much as I appreciate an opportunity to analyze my competitors, vice versa it’s downright evil. Hence just kill it.

Is my take too extreme? Please enlighten me in the comments.

Update: A follow-up post from Michael VanDeMar and its Sphinn discussion, the first LinkScape thread at Sphinn, and Sphinn comments to this pamphlet.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

56 comments Sebastian | Tools, X-Robots-Tag, Robots Meta Tags, Crawler Directives, robots.txt, SEO

How brain-amputated developers created the social media plague

Posted on 12 January, 2010

The bot playground commonly refered to as “social media” is responsible for shitloads of absurd cretinism.

Twitter Bot Playground For example Twitter, where gazillions of bots [type A] follow other equally superfluous but nevertheless very busy bots [type B] that automatically generate 27% valuable content (links to penis enlargement tools) and 73% not exactly exciting girly chatter (breeding demand for cheap viagra).

Bazillions of other bots [type C] retweet bot [type B] generated crap and create lists of bots [type A, B, C]. In rare cases when a non-bot tries to participate in Twitter, the uber-bot [type T] prevents the whole bot network from negative impacts by serving a 503 error to the homunculus’ browser.

This pamphlet is about the idiocy of a particular subclass of bots [type S] that sneakily work in the underground stealing money from content producers, and about their criminal (though brain-dead) creators. May they catch the swine flu, or at least pox or cholera, for the pest they’ve brought to us.

The Twitter pest that costs you hard earned money

WTF I’m ranting about? The technically savvy reader, familiar with my attitude, has already figured out that I’ve read way too many raw logs. For the sake of a common denominator, I encourage you to perform a tiny real-world experiment:

Publish a great and linkworthy piece of content.
Tweet its URI (not shortened - message incl. URI ≤ 139 characters!) with a compelling call for action.
Watch your server logs.
Puke. Vomit increases with every retweet.

So what happens on your server? A greedy horde of bots pounces on every tweet containing a link, requesting its content. That’s because on Twitter all URIs are suspected to be shortened (learn why Twitter makes you eat shit). This uncalled-for -IOW abusive- bot traffic burns your resources, and (with a cheap hosting plan) it can hinder your followers to read your awesome article and prevent them from clicking on your carefully selected ads.

Those crappy bots not only cost you money because they keep your server busy and increase your bandwidth bill, they actively decrease your advertising revenue because your visitors hit the back button when your page isn’t responsive due to the heavy bot traffic. Even if you’ve great hosting, you probably don’t want to burn money, not even pennies, right?

Bogus Twitter apps and their modus operandi

If only every Twitter&Crap-mashup would lookup each URI once, that wouldn’t be such a mess. Actually, some of these crappy bots request your stuff 10+ times per tweet, and again for each and every retweet. That means, as more popular your content becomes, as more bot traffic it attracts.

Most of these bots don’t obey robots.txt, that means you can’t even block them applying Web standards (learn how to block rogue bots). Topsy, for example, does respect the content producer — so morons using “Python-urllib/1.17″ or “AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)” could obey the Robots Exclusion Protocol (REP), too. Their developers are just too fucking lazy to understand such protocols that every respected service on the Web (search engines…) obeys.

Some of these bots even provide an HTTP_REFERER to lure you into viewing the website operated by their shithead of developer when you’re viewing your referrer stats. Others fake Web browsers in their user agent string, just in case you’re not smart enough to smell shit that really stinks (IOW browser-like requests that don’t fetch images, CSS files, and so on).

One of the worst offenders is outing itself as “ThingFetcher” in the user agent string. It’s hosted by Rackspace, which is a hosting service that obviously doesn’t care much about its reputation. Otherwise these guys would have reacted to my various complaints WRT “ThingFetcher”. By the way, Robert Scoble represents Rackspace, you could drop him a line if ThingFetcher annoys you, too.

ThingFetcher sometimes requests a (shortened) URI 30 times per second, from different IPs. It can get worse when a URI gets retweeted often. This malicious piece of code doesn’t obey robots.txt, and doesn’t cache results. Also, it’s too dumb to follow chained redirects, by the way. It doesn’t even publish its results anywhere, at least I couldn’t find the fancy URIs I’ve feeded it with in Google’s search index.

In ThingFetcher’s defense, its developer might say that it performs only HEAD requests. Well, it’s true that HEAD request provoke only an HTTP response header. But: the script invoked gets completely processed, just the output is trashed.

That means, the Web server has to deal with the same load as with a GET request, it just deletes the content portion (the compelety formatted HTML page) when responding, after counting its size to send the Content-Length response header. Do you really believe that I don’t care about machine time? For each of your ~~utterly useless~~ bogus requests I could have my server deliver ads to a human visitor, who pulls the plastic if I’m upselling the right way (I do, usually).

Unfortunately, ThingFetcher is not the only bot that does a lookup for each URI embedded in a tweet, per tweet processed. Probably the overall number of URIs that appear only once is bigger than the number of URIs that appear quite often while a retweet campaign lasts. That means that doing HTTP requests is cheaper for the bot’s owner, but on the other hand that’s way more expensive for the content producer, and the URI shortening services involved as well.

ThingFetcher update: The owners of ThingFetcher are now aware of the problem, and will try to fix it asap (more information). Now that I know who’s operating the Twitter app owning ThingFetcher, ~~I take back the insults above~~ I’ve removed some insults from above, because they’d no longer address an anonymous developer, but bright folks who’ve just failed once. Too sad that Brizzly didn’t reply earlier to my attempts to identify ThingFetcher’s owner.

As a content producer I don’t care about the costs of any Twitter application that processes Tweets to deliver anything to its users. I care about my costs, and I can perfecly live without such a crappy service. Liberally, I can allow one single access per (shortened) URI to figure out its final destination, but I can’t tolerate such thoughtless abuse of my resources.

Every Twitter related “service” that does multiple requests per (shortened) URI embedded in a tweet is guilty of theft and pilferage. Actually, that’s an understatement, because these raids cost publishers an enormous sum across the Web.

These fancy apps shall maintain a database table storing the destination of each redirect (chain) acessible by its short URI. Or leave the Web, respectively pay the publishers. And by the way, Twitter should finally end URI shortening. Not only it breaks the Internet, it’s way too expensive for all of us.

A few more bots that need a revamp, or at least minor tweaks

I’ve added this section to express that besides my prominent example above, there’s more than one Twitter related app running not exactly squeaky clean bots. That’s not a “worst offenders” list, it’s not complete (I don’t want to reprint Twitter’s yellow pages), and bots are listed in no particular order (compiled from requests following the link in a test tweet, evaluating only a snapshot of less than 5 minutes, backed by historized logs.)

Skip examples

Tweetmeme’s TweetmemeBot coming from eagle.favsys.net doesn’t fetch robots.txt. On their site they don’t explain why they don’t respect the robots exclusion protocol (REP). Apart from that it behaves.

OneRiot’s bot OneRiot/1.0 totally proves that this real time search engine has chosen a great name for itself. Performing 5+ GET as well as HEAD requests per link in a tweet (sometimes more) certainly counts as rioting. Requests for content come from different IPs, the host name pattern is flx1-ppp*.lvdi.net, e.g. flx1-ppp47.lvdi.net. From the same IPs comes another bot: Me.dium/1.0, me.dium.com redirects to oneriot.com. OneRiot doesn’t respect the REP.

Microsoft/Bing runs abusive bots following links in tweets, too. They fake browsers in the user agent, make use of IPs that don’t obviously point to Microsoft (no host name, e.g. 65.52.19.122, 70.37.70.228 …), send multiple GET requests per processed tweet, and don’t respect the REP. If you need more information, I’ve ranted about deceptive M$-bots before. Just a remark in case you’re going to block abusive MSN bot traffic:

MSN/Bing reps ask you not to block their spam bots when you’d like to stay included in their search index (that goes for real time search, too), but who really wants that? Their search index is tiny -compared to other search engines like Yahoo and Google-, their discovery crawling sucks -to get indexed you need to submit your URIs at their webmaster forum-, and in most niches you can count your yearly Bing SERP referrers using not even all fingers of your right hand. If your stats show more than that, check your raw logs. You’ll soon figure out that MSN/Bing spam bots fake SERP traffic in the HTTP_REFERER (guess where their “impressive” market share comes from).

FriendFeed’s bot FriendFeedBot/0.1 is well explained, and behaves. Its bot page even lists all its IPs, and provides you with an email addy for complaints (I never had a reason to use it). The FriendFeedBot made it on this list just because of its lack of REP support.

PostRank’s bot PostRank/2.0 comes from Amazon IPs. It doesn’t respect the REP, and does more than one request per URI found in one single tweet.

MarkMonitor operates a bot faking browser requests, coming from *.embarqhsd.net (va-71-53-201-211.dhcp.embarqhsd.net, va-67-233-115-66.dhcp.embarqhsd.net, …). Multiple requests per URI, no REP support.

Cuil’s bot provides an empty user agent name when following links in tweets, but fetches robots.txt like Cuil’s offical crawler Twiceler. I didn’t bother to test whether this Twitter bot can be blocked following Cuil’s instructions for webmasters or not. It got included in this list for the supressed user agent.

Twingly’s bot Twingly Recon coming from *.serverhotell.net doesn’t respect the REP, doesn’t name its owner, but does only few HEAD requests.

Many bots mimicking browsers come from Amazon, Rackspace, and other cloudy environments, so you can’t get hold of their owners without submitting a report-abuse form. You can identify such bots by sorting your access logs by IP addy. Those “browsers” which don’t request your images, CSS files, and so on, are most certainly bots. Of course, a human visitor having cached your images and CSS matches this pattern, too. So block only IPs that solely request your HTML output over a longer period of time (problematic with bots using DSL providers, AOL, …).

Blocking requests (with IPs belonging to consumer ISPs, or from Amazon and other dynamic hosting environments) with a user agent name like “LWP::Simple/5.808″, “PycURL/7.18.2″, “my6sense/1.0″, “Firefox” (just these 7 characters), “Java/1.6.0_16″ or “libwww-perl/5.816″ is sound advice. By the way, these requests sum up to an amount that would lead a “worst offenders” listing.

Then there are students doing research. I’m not sure I want to waste my resources on requests from Moscow’s “Institute for System Programming RAS”, which fakes unnecessary loads of human traffic (from efrate.ispras.ru, narva.ispras.ru, dvina.ispras.ru …), for example.

When you analyze bot traffic following a tweet with many retweets, you’ll gather a way longer list of misbehaving bots. That’s because you’ll catch more 3rd party Twitter UIs when many Twitter users view their timeline. Not all Twitter apps route their short URI evaluation through their servers, so you might miss out on abusive requests coming from real users via client sided scripts.

Developers might argue that such requests “on behalf of the user” are neither abusive, nor count as bot traffic. I assure you, that’s crap, regardless a particular Twitter app’s architecture, when you count more than one evaluation request per (shortened) URI. For example Googlebot acts on behalf of search engine users too, but it doesn’t overload your server. It fetches each URI embedded in tweets only once. And yes, it processes all tweets out there.

How to do it the right way

Here is what a site owner can expect from a Twitter app’s Web robot:

A meaningful user agent

A Web robot must provide a user agent name that fulfills at least these requirements:

A unique string that identifies the bot. The unique part of this string must not change when the version changes (”somebot/1.0″, “somebot/2.0″, …).
A URI pointing to a page that explains what the bot is all about, names the owner, and tells how it can be blocked in robots.txt (like this or that).
A hint on the rendering engine used, for example “Mozilla/5.0 (compatible; …”.

A method to verify the bot

All IP addresses used by a bot should resolve to server names having a unique pattern. For example Googlebot comes only from servers named "crawl" + "-" + replace($IP, ".", "-") + ".googlebot.com", e.g. “crawl-66-249-71-135.googlebot.com”. All major search engines follow this standard that enables crawler detection not solely relying on the easily spoofable user agent name.

Obeying the robots.txt standard

Webmasters must be able to steer a bot with crawler directives in robots.txt like “Disallow:”. A Web robot should fetch a site’s /robots.txt file before it launches a request for content, when it doesn’t have a cached version from the same day.

Obeying REP indexer directives

Indexer directives like “nofollow”, “noindex” et cetera must be obeyed. That goes for HEAD requests just chasing for a 301/302/307 redirect response code and a “location” header, too.

Indexer directives can be served in the HTTP response header with an X-Robots-Tag, and/or in META elements like the robots meta tag, as well as in LINK elements like rel=canonical and its corresponding headers.

Responsible behavior

As outlined above, requesting the same resources over and over doesn’t count as responsible behavior. Fetching or “HEAD’ing” a resource no more than once a day should suffice for every Twitter app’s needs.

Respecting copyrights

Reprinting a page’s content, or just large quotes, doesn’t count as fair use. It’s Ok to grab the page title and a summary from a META element like “description” (or up to 250 characters from an article’s first paragraph) to craft links, for example - but not more! Also, showing images or embedding videos from the crawled page violates copyrights.

Conclusion, and call for action

If you suffer from rogue Twitter bot traffic, use the medium those bots live in to make their sins public knowledge. Identify the bogus bot’s owners and tweet the crap out of them. Lookup their hosting services, find the report-abuse form, and submit your complaints. Most of these apps make use of the Twitter-API, there are many spam report forms you can creatively use to ruin their reputation at Twitter. If you’ve an account at such a bogus Twitter app, then cancel it and encourage your friends to follow suit.

Don’t let the assclowns of the Twitter universe get away with theft!

I’d like to hear about particular offenders you’re dealing with, and your defense tactics as well, in the comments. Don’t be shy. Go rant away. Thanks in advance!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

10 comments Sebastian | Social Web, X-Robots-Tag, Redirects, URI shortening, Trolling, Web development, Crap, Crawler Directives, Robots Meta Tags, Twitter, robots.txt

About the bad taste of shameless ego food

Posted on 30 January, 2009

Seems I’ve made it on the short-list at the SEMMY 2009 Awards in the Search Tech category. Great ego food. I’m honored. Thanks for nominating me that often! And thanks to John Andrews and Todd Mintz for the kind judgement!

Now that you’ve read the longish introduction, why not click here and vote for my pamphlet?

Ok Ok Ok, it’s somewhat technically and you perhaps even consider it plain geek food. However, it’s hopefully / nevertheless useful for your daily work. BTW … I wish more search engine engineers would read it. It could help them to tidy up their flawed REP support.

Does this post smell way too selfish? I guess it does, but I’ll post it nonetheless coz I’m fucking keen on your votes. Thanks in advance!

Wow, I won! Thank you all!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | X-Robots-Tag, Ego Food, URL removal, Robots Meta Tags, robots.txt, SEO

Crawling vs. Indexing

Posted on 20 October, 2008

Sigh. I just have to throw in my 2 cents.

Crawling means sucking content without processing the results. Crawlers are rather dumb processes that fetch content supplied by Web servers answering (HTTP) requests of requested URIs, delivering those contents to other processes, e.g. crawling caches or directly to indexers. Crawlers get their URIs from a crawling engine that’s feeded from different sources, including links extracted from previously crawled Web documents, URI submissions, foreign Web indexes, and whatnot.

Indexing means making sense out of the retrieved contents, storing the processing results in a (more or less complex) document index. Link analysis is a way to measure URI importance, popularity, trustworthiness and so on. Link analysis is often just a helper within the indexing process, sometimes the end in itself, but traditionally a task of the indexer, not the crawler (high sophisticated crawling engines do use link data to steer their crawlers, but that has nothing to do with link analysis in document indexes).

A crawler directive like “disallow” in robots.txt can direct crawlers, but means nothing to indexers.

An indexer directive like “noindex” in an HTTP header, an HTML document’s HEAD section, or even a robots.txt file, can direct indexers, but means nothing to crawlers, because the crawlers have to fetch the document in order to enable the indexer to obey those (inline) directives.

So when a Web service offers an indexer directive like <meta name="SEOservice" content="noindex" /> to keep particular content out of its index, but doesn’t offer a crawler directive like User-agent: SEOservice Disallow: /, this Web service doesn’t crawl.

That’s not about semantics, that’s about Web standards.

Whether or not such a Web service can come with incredible values squeezed out of its index gathered elsewhere, without crawling the Web itself, is a completely different story.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

15 comments Sebastian | X-Robots-Tag, Robots Meta Tags, Crawler Directives, robots.txt, SEO

@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!

Posted on 25 February, 2008

Dear Google, please respect NOINDEX Matt Cutts asks us How should Google handle NOINDEX? That’s a tough question worth thinking twice before you submit a comment to Matt’s post. Here is Matt’s question, all the background information you need, and my opinion.

What is NOINDEX?

Noindex is an indexer directive defined in the Robots Exclusion Protocol (REP) from 1996 for use in robots meta tags. Putting a NOINDEX value in a page’s robots meta tag or X-Robots-Tag tells search engines that they shall not index the page content, but may follow links provided on the page.

To get a grip on NOINDEX’s role in the REP please read my Robots Exclusion Protocol summary at SEOmoz. Also, Google experiments with NOINDEX as crawler directive in robots.txt, more on that later.

How major search engines treat NOINDEX

Of course you could read a ton of my pamphlets to extract this information, but Matt’s summary is still accurate and easier to digest:

[Matt Cutts on August 30, 2006]
Google doesn’t show the page in any way.

Ask doesn’t show the page in any way.

MSN shows a URL reference and cached link, but no snippet. Clicking the cached link doesn’t return anything.

Yahoo! shows a URL reference and cached link, but no snippet. Clicking on the cached link returns the cached page.

Personally, I’d prefer it if every search engine treated the noindex meta tag by not showing a page in the search results at all. [Meanwhile Matt might have a slightly different opinion.]

Google’s experimental support of NOINDEX as crawler directive in robots.txt also includes the DISALLOW functionality (an instruction that forbids crawling), and most probably URIs tagged with NOINDEX in robots.txt cannot accumulate PageRank. In my humble opinion the DISALLOW behavior of NOINDEX in robots.txt is completely wrong, and without any doubt in no way compliant to the Robots Exclusion Protocol.

Matt’s question: How should Google handle NOINDEX in the future?

To simplify Matt’s poll, lets assume he’s talking about NOINDEX as indexer directive, regardless where a Webmaster has put it (robots meta tag, X-Robots-Tag, or robots.txt).

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

Here are the arguments, or pros and cons, for each variant:

Google should completely drop a NOINDEX’ed page from their search results

Obviously that’s what most Webmasters would prefer:

This is the behavior that we’ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way — in fact, one of the only ways — to completely remove all traces of a site from Google (another way is our url removal tool). That’s incredibly useful for webmasters.

NOINDEX means don’t index, search engines must respect such directives, even when the content isn’t password protected or cloaked away (redirected or hidden for crawlers but not for visitors).

The corner case that Google discovers a link and lists it on their SERPs before the page that carries a NOINDEX directive is crawled and deindexed isn’t crucial, and could be avoided by a (new) NOINDEX indexer directive in robots.txt, which is requested by search engines quite frequently. Ok, maybe Google’s BlitzCrawler™ has to request robots.txt more often then.

Google should show a reference to NOINDEX’ed pages on their SERPs

Search quality and user experience are strong arguments:

Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google’s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like [3 linked examples] aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users (and thus for Google).

Search quality and searchers’ user experience is also a strong argument for totally delisting NOINDEX’ed pages, because most Webmasters use this indexer directive to keep stuff that doesn’t provide value for searchers out of the search indexes. <polemic>I mean, how much weight have a few Korean sites when it comes to decisions that affect the whole Web?</polemic>

If a Webmaster puts a NOINDEX directive by accident, that’s easy to spot in the site’s stats, considering the volume of traffic that Google controls. I highly doubt that a simple URI reference with an anchor text scrubbed from external links on Google SERPs would heal such a mistake. Also, Matt said that Google could add a NOINDEX check to the Webmaster Console.

The reference to the URI removal tools is out of context, because these tools remove an URI only for a short period of time and all removal requests have to be resubmitted repeatedly every few weeks. NOINDEX on the other hand is a way to keep an URI out of the index as long as this crawler directive is provided.

I’d say the sole argument for listing references to NOINDEX’ed pages that counts is misleading navigational searches. Of course that does not mean that Google may ignore the NOINDEX directive to show -with a linked reference- that they know a resource, despite the fact that the site owner has strictly forbidden such references on SERPs.

Something in between, Google should find a reasonable way to please both Webmasters and searchers

Quoting Matt again:

The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don’t want to show up in Google). Users are most discouraged when they search for a well-known site and can’t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don’t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.

Whether or not a site is popular must not impact a search engine’s respect for a Webmaster’s decision to keep search engines, and their users, out of her realm. That reads like “Hey, Google is popular, so we’ve the right to go to Mountain View to pillage the Googleplex, acquiring everything we can steal for the public domain”. Neither Webmasters nor search engines should mimic Robin Hood. Also, lots of Webmasters highly doubt that Google’s idea of (link) popularity should rule the Web.

Whether or not a site is listed in the ODP directory is definitely not an indicator that can be applied here. Last time I looked the majority of the Web’s content wasn’t listed at DMOZ due to the lack of editors and various other reasons, and that includes gazillions of great and useful resources. I’m not bashing DMOZ here, but as a matter of fact it’s not comprehensive enough to serve as indicator for anything, especially not importance and popularity.

I strongly believe that there’s no such thing as a criterion suitable to mark out a two class Web.

My take: Yes, No, Depends

Google could enhance navigational queries -and even “I feel lucky” queries- that lead to a NOINDEX’ed page with a message like “The best matching result for this query was blocked by the site”. I wouldn’t mind if they mention the URI as long as it’s not linked.

In fact, the problem is the granularity of the existing indexer directives. NOINDEX is neither meant for nor capable of serving that many purposes. It is wrong to assign DISALLOW semantics to NOINDEX, and it is wrong to create two classes of NOINDEX support. Fortunately, we’ve more REP indexer directives that could play a role in this discussion.

NOODP, NOYDIR, NOARCHIVE and/or NOSNIPPET in combination with NOINDEX on a site’s home page, that is either a domain or subdomain, could indicate that search engines must not show references to the URI in question. Otherwise, if no other indexer directives elaborate NOINDEX, search engines could show references to NOINDEX’ed main pages. The majority of navigational search queries should lead to main pages, so that would solve the search quality issues.

Of course that’s not precise enough due to the lack of a specific directive that deals with references to forbidden URIs, but it’s way better than ignoring NOINDEX in its current meaning.

A fair solution: NOREFERENCE

If I’d make the decision at Google and couldn’t live with a best matching search result blocked message, I’d go for a new REP tag:

“NOINDEX, NOREFERENCE” in a robots meta tag -respectively Googlebot meta tag- or X-Robots-Tag forbids search engines to show a reference on their SERPs. In robots.txt this would look like NOINDEX: / NOINDEX: /blog/ NOINDEX: /members/ … NOREFERENCE: / NOREFERENCE: /blog/ NOREFERENCE: /members/ …
Search engines would crawl these URIs, and follow their links as long as there’s no NOFOLLOW directive either in robots.txt or a page specific instruction.

NOINDEX without a NOREFERENCE directive would instruct search engines not to index a page, but allows references on SERPs. Supporting this indexer directive both in robots.txt as well as on-the-page (respectively in the HTTP header for X-Robots-Tags) makes it easy to add NOREFERENCE on sites that hate search engine traffic. Also, a syntax variant like NOINDEX=NOREFERENCE for robots.txt could tell search eniges how they have to treat NOINDEX statements on site level, or even on site area level.

Even more appealing would be NOINDEX=REFERENCE, because only the very few Webmasters that would like to see their NOINDEX’ed URIs on Google’s SERPs would have to add a directive to their robots.txt at all. Unfortunately, that’s not doable for Google unless they can convice three well known Korean sites to edit their robots.txt.

By the way, don’t miss out on my draft asking for REP tag support in robots.txt!

Anyway: Dear Google, please don’t touch NOINDEX!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | URL removal, Search Quality, X-Robots-Tag, Robots Meta Tags, Crawler Directives, SEO, robots.txt, Google

Get a grip on the Robots Exclusion Protocol (REP)

Posted on 17 January, 2008

Thanks to the very nice folks over at SEOmoz I was able to prevent this site from becoming a kind of REP/robots.txt blog. Please consider reading this REP round up:

Robots Exclusion Protocol 101

My REP 101 links to the various standards (robots.txt, REP tags, Sitemaps, microformats) the REP consists of, and provides a rough summary of each REP component. It explains the difference between crawler directives and indexer directives, and which command hierarchy search engines follow when REP directives put in different levels conflict.

Educate yourself on the REP Why do I think that solid REP knowledge is important right now? Not only because of the confusion that exists thanks to the volume of crappy advice provided at every Webmaster hangout. Of course understanding the REP makes webmastering easier, thus I’m glad when my REP related pamphlets are considered somewhat helpful.

I’ve a hidden agenda, though. I predict that the REP is going to change shortly. As usual, its evolvement is driven by a major search engine, since the W3C and such organizations don’t bother with the conglomerate of quasi standards and RFCs known as the Robots Exclusion Protocol. In general that’s not a bad thing. Search engines deal with the REP every day, so they have a legitimate interest.

Unfortunately not every REP extension that search engines have invented so far is useful for Webmasters, some of them are plain crap. Learning from fiascos and riots of the past, the engines are well advised to ask Webmasters for feedback before they announce further REP directives.

I’ve a feeling that shortly a well known search engine will launch a survey regarding particular REP related ideas. I want that Webmasters are well aware of the REP’s complexity and functionality when they contribute their take on REP extensions. So please educate yourself.

My pamphlet discussing a possible standardization of REP tags as robots.txt directives could be a useful reference, also please watch the great video here.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | URL removal, Web development, X-Robots-Tag, Robots Meta Tags, Crawler Directives, SEO, robots.txt, XML-Sitemaps, Microformats

Getting URLs outta Google - the good, the popular, and the definitive way

Posted on 14 January, 2008

Keep out Google There’s more and more robots.txt talk in the SEOsphere lately. That’s a good thing in my opinion, because the good old robots.txt’s power is underestimated. Unfortunately it’s quite often misused or even abused too, usually because folks don’t fully understand the REP (by following “advice” from forums instead of reading the real thing, or at least my stuff ).

The good way: If the major search engines would support new robots.txt directives that Webmasters really need, removing even huge chunks of content from Google’s SERPs -without collateral damage- via robots.txt would be a breeze.
The popular way: Shamelessly stealing Matt’s official advice [Source: Remove your content from Google by Matt Cutts]. To obscure the blatant plagiarism, I’ll add a few thoughts.
The definitive way: Of course that’s not the ultimate way, but that’s the way Google’s cookies crumble, currently. In other words: Google is working on a leaner approach, but that’s not yet announced, thus you can’t use it; you still have to jump through many hoops.

The good way

Caution: Don’t implement code from this section, the robots.txt directives discussed here are not (yet/fully) supported by search engines!

Currently all robots.txt statements are crawler directives. That means that they can tell behaving search engines how to crawl a site (fetching contents), but they’ve no impact on indexing (listing contents on SERPs). I’ve recently published a draft discussing possible REP tags for robots.txt. REP tags are indexer directives known from robots meta tags and X-Robots-Tags, which -as on-page respectively per-URL directives- require crawling.

The crux is that REP tags must be assigned to URLs. Say you’ve a gazillion of printer friendly pages in various directories that you want to deindex at Google, putting the “noindex,follow,noarchive” tags comes with a shitload of work.

How cool would be this robots.txt code instead: Noindex: /*printable Noarchive: /*printable
Search engines would continue to crawl, but deindex previously indexed URLs respectively not index new URLs from /articles/printable/*.htm /manuals/printable/*.pdf /products/descriptions/*.php?format=printable&product=* ...
provided those URLs aren’t disallow’ed. They would follow the links in those documents, so that PageRank gathered by printer friendly pages wouldn’t be completely wasted. To apply an implicit rel-nofollow to all links pointing to printer friendly pages, so that those can’t accumulate PageRank from internal or external links, you’d add Norank: /*printable
to the robots.txt code block above.

If you don’t like that search engines index stuff you’ve disallow’ed in your robots.txt from 3rd party signals like inbound links, and that Google accumulates even PageRank for disallow’ed URLs, you’d put: Disallow: /unsearchable/ Noindex: /unsearchable/ Norank: /unsearchable/

To fix URL canonicalization issues with PHP session IDs and other tracking variables you’d write for example Truncate-variable sessionID: /
and that would fix the duplicate content issues as well as the problem with PageRank accumulated by throw-away URLs.

Unfortunately, robots.txt is not yet that powerful, so please link to the REP tags for robotx.txt “RFC” to make it popular, and proceed with what you have at the moment.

The popular way

Matt Cutts was kind enough to discuss Google’s take on contents excluded from search engine indexing in 10 minutes or less here:

You really should listen, the video isn’t that long.

Don’t link (very weak): Although Google usually doesn’t index unlinked stuff, this can happen due to crawling based on sitemaps. Also, the URL might appear in linked referrer stats on other sites that are crawlable, and folks can link from the cold.
.htaccess / .htpasswd (Matt’s first recommendation): Since Google cannot crawl password protected contents, Matt declares this method to prevent content from indexing safe. I’m not sure what will happen when I spread a few strong links to somebody’s favorite smut collection, perhaps I’ll test some day whether Google and other search engines list such a reference on their SERPs.
robots.txt (weak): Matt rightly points out that Google’s cool robots.txt validator in the Webmaster Console is a great tool to develop, test and deploy proper robots.txt syntax that effectively blocks search engine crawling. The weak point is, that even when search engines obey robots.txt, they can index uncrawled content from 3rd party sources. Matt is proud of Google’s smart capabilities to figure out suiteble references like the ODP. I agree totally and wholeheartedly. Hence robots.txt in its current shape doesn’t prevent content from showing up in Google and other engines as well. Matt didn’t mention Google’s experiments with Noindex: support in robots.txt, which need improvement but could resolve this dilemma.
Robots meta tags (Google only, weak with MSN/Yahoo): The REP tag “noindex” in a robots meta element prevents from indexing, and, once spotted, deindexes previously listed stuff - at least at Google. According to Matt Yahoo and MSN still list such URLs as references without snippets. Because only Google obeys “noindex” totally by wiping out even URL-only listings and foreign references, robots meta tags should be considered a kinda weak approach too. Also, search engines must crawl a page to discover this indexer directive. Matt adds that robots meta tags are problematic, because they’re buried on the pages and sometimes tend to get forgotten when no longer needed (Webmasters ~~might~~ do forget to take the tag down, respectively add it later on when search engines policies change, or work in progress gets released respectively outdated contents are taken down). Matt forgot to mention the neat X-Robots-Tags that can be used to apply REP tags in the HTTP header of non-HTML resources like images or PDF documents. Google’s X-Robots-Tag is supported by Yahoo too.
Rel-nofollow (kind of weak): Although condoms totally remove links from Google’s link graphs, Matt says that rel-nofollow should not be used as crawler or indexer directive. Rel-nofollow is for condomizing links only, also other search engines do follow nofollow’ed links and even Google can discover the link destination from other links they gather on the Web, or grab from internal links inadvertently lacking a link condom. Finally, rel-nofollow requires crawling too.
URL removal tool in GWC (Matt’s second recommendation): Taking Matt’s enthusiasm while talking about Google’s neat URL terminator into account, this one should be considered his first recommendation. Google provides tools to remove URLs from their search index since five years at least (way longer IIRC). Recently the Webmaster Central team has integrated those, as well as new functionality, into the Webmaster Console, donating it a very nice UI. The URL removal tools come with great granularity, and because the user’s site ownership is verified, it’s pretty powerful, safe, and shows even the progress for each request (the removal process lasts a few days). Its UI is very flexible and allows even revoking of previous removal requests. The wonderful little tool’s sole weak point is that it can’t remove URLs from the search index forever. After 90 days or possibly six months the erased stuff can pop up again.

Summary: If your site isn’t password protected, and you can’t live with indexing of disallow’ed contents, you must remove unwanted URLs from Google’s search index periodically. However, there are additional procedures that can support -but not guarantee!- deindexing. With other search engines it’s even worse, because those don’t respect the REP like Google, and don’t provide such handy URL removal tools.

The definitive way

Actually, I think Matt’s advice is very good. As long as you don’t need a permanent solution, and if you lack the programming skills to develop such a beast that works with all (major) search engines. I mean everybody can insert a robots meta tag or robots.txt statement, and everybody can semiyearly repeat URL removal requests with the neat URL terminator, but most folks are scared when it comes to conditional manipulation of HTTP headers to prevent stuff from indexing. However, I’ll try to explain quite safe methods that actually work (with Apache, not IIS) in the following examples.

First of all, if you really want that search engines don’t index your stuff, you must allow them to crawl it. And no, that’s not an oxymoron. At the moment there’s no such thing as an indexer directive on site-level. You can’t forbid indexing in robots.txt. All indexer directives require crawling of the URLs that you want to keep out of the SERPs. Of course that doesn’t mean you should serve search engine crawlers a book from each forbidden URL.

Lets start with robots.txt. You put User-agent: * Disallow: /images/ Disallow: /movies/ Disallow: /unsearchable/ User-agent: Googlebot Disallow: Allow: / User-agent: Slurp Disallow: Allow: /
The first section is just a fallback.

(Here comes a rather brutal method that you can use to keep search engines out of particular directories. It’s not suitable to deal with duplicate content, session IDs, or other URL canonicalization. More on that later.)

Next edit your .htaccess file. <IfModule mod_rewrite.c> RewriteEngine On RewriteCond %{REQUEST_URI} ^/unsearchable/ RewriteCond %{REQUEST_URI} !\.php RewriteRule . /unsearchable/output-content.php [L] </IfModule>
If you’ve .php pages in /unsearchable/ then remove the second rewrite condition, put output-content.php into another directory, and edit my PHP code below so that it includes the PHP scripts (don’t forget to pass the query string).

Now grab the PHP code to check for search engine crawlers here and include it below. Your script /unsearchable/output-content.php looks like: <?php @include("crawler-stuff.php"); // defines variables and functions $isSpider = checkCrawlerIP ($requestUri); if ($isSpider) { @header("HTTP/1.1 403 Thou shalt not index this", TRUE, 403); @header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir"); exit; } $arr = explode("#", $requestUri); $outputFileName = $arr[0]; $arr = explode("?", $outputFileName); $outputFileName = $_SERVER["DOCUMENT_ROOT"] .$arr[0]; if (substr($outputFileName, -1, 1) == "/") { $outputFileName .= "index.html"; } if (file_exists($outputFileName)) { // send the content type header $contentType = "text/plain"; if (stristr($outputFileName, ".html")) $contentType ="text/html"; if (stristr($outputFileName, ".css")) $contentType ="text/css"; if (stristr($outputFileName, ".js")) $contentType ="text/javascript"; if (stristr($outputFileName, ".png")) $contentType ="image/png"; if (stristr($outputFileName, ".jpg")) $contentType ="image/jpeg"; if (stristr($outputFileName, ".gif")) $contentType ="image/gif"; if (stristr($outputFileName, ".xml")) $contentType ="application/xml"; if (stristr($outputFileName, ".pdf")) $contentType ="application/pdf"; @header("Content-type: $contentType"); @header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir"); readfile($outputFileName); exit; } // That’s not the canonical way to call the 404 error page. Don’t copy, adapt: @header("HTTP/1.1 307 Oups, I displaced $outputFileName", TRUE, 307); @header("Location: http://sebastians-pamphlets.com/404/"); exit; ?>

What does the gibberish above do? In .htaccess we rewrite all requests for resources stored in /unsearchable/ to a PHP script, which checks whether the request is from a search engine crawler or not.

If the requestor is a verified crawler (known IP or IP and host name belong to a major search engine’s crawling engine), we return an unfriendly X-Robots-Tag and an HTTP response code 403 telling the search engine that access to our content is forbidden. The search engines should assume that a human visitor receives the same response, hence they aren’t keen on indexing these URLs. Even if a search engine lists an URL on the SERPs by accident, it can’t tell the searcher anything about the uncrawled contents. That’s unlikely to happen actually, because the X-Robots-Tag forbids indexing (Ask and MSN might ignore these directives).

If the requestor is a human visitor, or an unknown Web robot, we serve the requested contents. If the file doesn’t exist, we call the 404 handler.

With dynamic content you must handle the query string and (expected) cookies yourself. PHP’s readfile() is binary safe, so the script above works with images or PDF documents too.

If you’ve an original search engine crawler coming from a verifiable server feel free to test it with this page (user agent spoofing doesn’t qualify as crawler, come back in a week or so to check whether the engines have indexed the unsearchable stuff linked above).

The method above is not only brutal, it wastes all the juice from links pointing to the unsearchable site areas. To rescue the PageRank, change the script as follows: … $urlThatDesperatelyNeedsPageRank = "http://sebastians-pamphlets.com/about/"; if ($isSpider) { @header("HTTP/1.1 301 Moved permanently", TRUE, 301); @header("Location: $urlThatDesperatelyNeedsPageRank"); exit; } …
This redirects crawlers to the URL that has won your internal PageRank lottery. Search engines will/shall transfer the reputation gained from inbound links to this page. Of course page by page redirects would be your first choice, but when you block entire directories you can’t accomplish this kind of granularity.

By the way, when you remove the offensive 403-forbidden stuff in the script above and change it a little more, you can use it to apply various X-Robots-Tags to your HTML pages, images, videos and whatnot. When a search engine finds an X-Robots-Tag in the HTTP header, it should ignore conflicting indexer directives in robots meta tags. That’s a smart way to steer indexing of bazillions of resources without editing them.

Ok, this was the cruel method; now lets discuss cases where telling crawlers how to behave is a royal PITA, thanks to the lack of indexer directives in robots.txt that provide the required granularity (Truncate-variable, Truncate-value, Order-arguments, …).

Say you’ve session IDs in your URLs. That’s one (not exactly elegant) way to track users or affiliate IDs, but strictly forbidden when the requestor is a search engine’s Web robot.

In fact, a site with unprotected tracking variables is a spider trap that would produce infinite loops in crawling, because spiders following internal links with those variables discover new redundant URLs with each and every fetch of a page. Of course the engines found suitable procedures to dramatically reduce their crawling of such sites, what results in less indexed pages. Besides joyless index penetration there’s another disadvantage - the indexed URLs are powerless duplicates that usually rank beyond the sonic barrier at 1,000 results per search query.

Smart search engines perform high sophisticated URL canonicalization to get a grip on such crap, but Webmasters can’t rely on Google & Co to fix their site’s maladies.

Ok, we agree that you don’t want search engines to index your ugly URLs, duplicates, and whatnot. To properly steer indexing, you can’t just block the crawlers’ access to URLs/contents that shouldn’t appear on SERPs. Search engines discover most of those URLs when following links, and that means that they’re ready to assign PageRank or other scoring of link popularity to your URLs. PageRank / linkpop is a ranking factor you shouldn’t waste. Every URL known to search engines is an asset, hence handle it with care. Always bother to figure out the canonical URL, then do a page by page permanent redirect (301).

For your URL canonicalization you should have an include file that’s available at the very top of all your scripts, executed before PHP sends anything to the user agent (don’t hack each script, maintaining so many places handling the same stuff is a nightmare, and fault-prone). In this include file put the crawler detection code and your individual routines that handle canonicalization and other search engine friendly cloaking routines.

View a Code example (stripping useless query string variables).<?php @include("crawler-stuff.php"); // defines variables and functions $isSpider = checkCrawlerIP ($requestUri); if ($isSpider) { $canonicalServerName = "sebastians-pamphlets.com"; $doRedirect = FALSE; $killVars = array("sessionID", "affiliateID", "format", "nav"); $arr = explode("#", $requestUri); $uri = $arr[0]; $arr = explode("?", $uri); $uri = $arr[0]; if ($queryString) { $qs = str_replace("&", "&", $queryString) ."&"; $qsArr = explode("&", $qs); $qs = ""; foreach ($qsArr as $qsArg) { $arr = explode("=", $qsArg); $var = $arr[0]; $val = $arr[1]; if (empty($val)) { $doRedirect = TRUE; } foreach ($killVars as $killVar) { if ("$killVar" == "$var") { $var = ""; $doRedirect = TRUE; } } if (!empty($var) && !empty($val)) { if (!empty($qs)) $qs .= "&"; $qs .= $var ."=" .$val; } } if (!empty($qs)) { $uri .= "?" .$qs; } } if ($doRedirect) { $canonicalUrl = "http://" .$canonicalServerName .$uri; @header("HTTP/1.1 301 Moved permanently", TRUE, 301); @header("Location: $canonicalUrl"); exit; } } // isSpider ?>

How you implement the actual canonicalization routines depends on your individual site. I mean, if you’ve not the coding skills necessary to accomplish that you wouldn’t read this entire section, wouldn’t you?

Here are a few examples of pretty common canonicalization issues:

Session IDs and other stuff used for user tracking
Affiliate IDs and IDs used to track the referring traffic source
Empty values of query string variables
Query string arguments put in different order / not checking the canonical sequence of query string arguments (ordering them alphabetically is always a good idea)
Redundant query string arguments
URLs longer than 255 bytes
Server name confusion, e.g. subdomains like “www”, “ww”, “random-string” all serving identical contents from example.com
Case issues (IIS/clueless code monkeys handling GET-variables/values case-insensitive)
Spaces, punctuation, or other special characters in URLs
Different scripts outputting identical contents
Flawed navigation, e.g. passing the menu item to the linked URL
Inconsistent default values for variables expected from cookies
Accepting undefined query string variables from GET requests
Contentless pages, e.g. outputted templates when the content pulled from the database equals whitespace or is not available
…

Summary

Hiding contents from all search engines requires programming skills that many sites can’t afford. Even leading search engines like Google don’t provide simple and suitable ways to deindex content -respectively to prevent content from indexing- without collateral damage (lost/wasted PageRank). We desperately need better tools. Maybe my robots.txt extensions are worth an inspection.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

26 comments Sebastian | URL removal, Web development, Duplicate Content, X-Robots-Tag, Robots Meta Tags, Crawler Directives, SEO, robots.txt, Webmaster Central, Google

My plea to Google - Please sanitize your REP revamps

Posted on 3 January, 2008

Standardization of REP tags as robots.txt directives

Google is confules on REP standards and robots.txt This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It’s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported by all major search engines. Furthermore it considers work in progress leaked out from Google.

In the following I’ll try to define a few robots.txt directives that Webmasters really need.

Show Table of Contents

Currently Google experiments with new robots.txt directives, that is REP tags like “noindex” adapted for robots.txt. That’s a welcomed and brilliant move.

Unfortunately, they got it totally wrong, again. (Skip the longish explanation of the rel-nofollow fiasco and my rant on Google’s current robots.txt experiments.)

Google’s last try to enhance the REP by adapting a REP tag’s value in another level was a miserable failure. Not because crawler directives on link-level are a bad thing, the opposite is true, but because the implementation of rel-nofollow confused the hell out of Webmasters, and still does.

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Don’t get me wrong, an instrument to steer search engine crawling and indexing on link level is a great utensil in a Webmaster’s toolbox. Rel-nofollow just lacks granularity, and it was sneakily introduced for the wrong purposes.

Recap: When Google launched rel-nofollow in 2005, they promoted it as a tool to fight comment spam.

From now on, when Google sees the attribute (rel=”nofollow”) on hyperlinks, those links won’t get any credit when we rank websites in our search results. This isn’t a negative vote for the site where the comment was posted; it’s just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists.

Technically spoken, this translates to “search engine crawlers shall/can use rel-nofollow links for discovery crawling, but indexers and ranking algos processing links must not credit link destinations with PageRank, anchor text, nor other link juice originating from rel-nofollow links”. Rel=”nofollow” meant rel=”pass-no-reputation”.

All blog platforms implemented the beast, and it seemed that Google got rid of a major problem (gazillions of irrelevant spam links manipulating their rankings). Not so the bloggers, because the spammers didn’t bother to check whether a blog dofollows inserted links or not. Despite all the condomized links the amount of blog comment spam increased dramatically, since the spammers were forced to attack even more blogs in order to earn the same amount of uncondomized links from blogs that didn’t update to a software version that supported rel-nofollow.

Experiment failed, move on to better solutions like Akismet, captchas or ajax’ed comment forms? Nope, it’s not that easy. Google had a hidden agenda. Fighting blog comment spam was just a snake oil sales pitch, an opportunity to establish rel-nofollow by jumping on a popular band wagon. In 2005 Google had mastered the guestbook spam problem already. Devaluing comment links in well structured pages like blog posts is as easy as doing the same with guestbook links, or identifying affiliate links. In other words, when Google launched rel-nofollow, blog comment spam was definitely not a major search quality issue any more.

Identifying paid links on the other hand is not that easy, because they often appear as editorial links within the content. And that was a major problem for Google, a problem that they weren’t able to solve algorithmically without cooperation of all webmasters, site owners, and publishers. Google actually invented rel-nofollow to get a grip on paid links. Recently they announced that Googlebot no longer follows condomized links (pre-Bigdaddy Google followed condomized links and indexed contents discovered from rel-nofollow links), and their cold war on paid links became hot.

Of course the sneaky morphing of rel-nofollow from “pass no reputation” to a full blown “nofollow” is just a secondary theater of war, but without this side issue (with regard to REP standardization) Google would have lost, hence it was decisive for the outcome of their war on paid links.

To stay fair, Danny Sullivan said twice that rel-nofollow is Dave Winer’s fault, and Google as the victim is not to blame.

Rel-nofollow is settled now. However, I don’t want to see Google using their enormous power to manipulate the REP for selfish goals again. I wrote this rel-nofollow recap because probably, or possibly, Google is just doing it once more:

Google’s “Noindex: in robots.txt” experiment

Google supports a Noindex: directive in robots.txt. It seems Google’s Noindex: blocks crawling like Disallow:, but additionally prevents URLs blocked with Noindex: both from accumulating PageRank as well as from indexing based on 3rd party signals like inbound links.

This functionality would be nice to have, but accomplishing it with “Noindex” is badly wrong. The REP’s “Noindex” value without an explicit “Nofollow” means “crawl it, follow its links, but don’t list it on SERPs”. With pagel-level directives (robots meta tags and X-Robots-Tags) Google handles “Noindex” exactly as defined, that means with an implicit “Follow”. Not so in robots.txt. Mixing crawler directives (Disallow:) with indexer directives (Noindex:) this way takes the “Follow” out of the game, because a search engine can’t follow links from uncrawled documents.

Webmasters will not understand that “Nofollow” means totally different things in robots.txt and meta tags. Also, this approach steals granularity that we need, for example for use with technically structured sitemap pages and other hubs.

According to Google their current interpretation of Noindex: in robots.txt is not yet set in stone. That means there’s an opportunity for improvement. I hope that Google, and other search engines as well, listen to the needs of Webmasters.

Dear Googlers, don’t take the above said as Google bashing. I know, and often wrote, that Google is the search engine that puts the most efforts in boring tasks like REP evolvement. I just think that a dog company like Google needs to take real-world Webmasters into the boat when playing with standards like the REP, for the sake of the cats.

Recap: Existing robots.txt directives

The /path example in the following sections refers to any way to assign URIs to REP directives, not only complete URIs relative to the server’s root. Patterns can be useful to set crawler directives for a bunch of URIs:

*: any string in path or query string, including the query string delimiter “?”, multiple wildcards should be allowed.
$: end of URI
Trailing /: (not exactly a pattern) addresses a directory, its files and subdirectories, the subdirectorie’s files etc., for example
- Disallow: /path/
  matches /path/index.html but not /path.html
- /path
  matches both /path/index.html and /path.html, as well as /path_1.html. It’s a pretty common mistake to “forget” the trailing slash in crawler directives meant to disallow particular directories. Such mistakes can result in blocking script/page-URIs that should get crawled and indexed.

Please note that patterns aren’t supported by all search engines, for example MSN supports only file extensions (yet?).

User-agent: [crawler name]
Groups a set of instructions for a particular crawler. Crawlers that find their own section in robots.txt ignore the User-agent: * section that addresses all Web robots. Each User-agent: section must be terminated with at least one empty line.

Disallow: /path
Prevents from crawling, but allows indexing based on 3rd party information like anchor text and surrounding text of inbound links. Disallow’ed URLs can gather PageRank.

Allow: /path
Refines previous Disallow: statements. For example Disallow: /scripts/ Allow: /scripts/page.php
tells crawlers that they may fetch http://example.com/scripts/page.php or http://example.com/scripts/page.php?article=1, but not any other URL in http://example.com/scripts/.

Sitemap: [absolute URL]
Announces XML sitemaps to search engines. Example: Sitemap: http://example.com/sitemap.xml Sitemap: http://example.com/video-sitemap.xml
points all search engines that support Google’s Sitemaps Protocol to the sitemap locations. Please note that sitemap autodiscovery via robots.txt doesn’t replace sitemap submissions. Google, Yahoo and MSN provide Webmaster Consoles where you not only can submit your sitemaps, but follow the indexing process (wishful thinking WRT particular SEs). In some cases it might be a bright idea to avoid the default file name “sitemap.xml” and keep the sitemap URLs out of robots.txt, sitemap autodiscovery is not for everyone.

Recap: Existing REP tags

REP tags are values that you can use in a page’s robots meta tag and X-Robots-Tag. Robots meta tags go to the HTML document’s HEAD section <meta name="robots" content="noindex, follow, noarchive" />
whereas X-Robots-Tags supply the same information in the HTTP header X-Robots-Tag: noindex, follow, noarchive
and thus can instruct crawlers how to handle non-HTML resources like PDFs, images, videos, and whatnot.

Widely supported REP tags are:

INDEX|NOINDEX - Tells whether the page may be indexed (listed on SERPs) or not
FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided in the document or not
ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
NOODP - tells search engines not to use page titles and descriptions pulled from DMOZ on their SERPs.
NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

Problems with REP tags in robots.txt

REP tags (index, noindex, follow, nofollow, all, none, noarchive, nosnippet, noodp, noydir, unavailable_after) were designed as page-level directives. Setting those values for groups of URLs makes steering search engine crawling and indexing a breeze, but also comes with more complexity and a few pitfalls as well.

Page-level directives are instructions for indexers and query engines, not crawlers. A search engine can’t obey REP tags without crawling the resource that supplies them. That means that not a single REP tag put as robots.txt statement shall be misunderstood as crawler directive.

For example Noindex: /path must not block crawling, not even in combination with Nofollow: /path, because there’s still the implicit “archive” (= absence of Noarchive: /path). Providing a cached copy even of a not indexed page makes sense for toolbar users.

Whether or not a search engine actually crawls a resource that’s tagged with “noindex, nofollow, noarchive, nosnippet” or so is up to the particular SE, but none of those values implies a Disallow: /path.
Historically, a crawler instruction on HTML element level overrules the robots meta tag. For example when the meta tag says “follow” for all links on a page, the crawler will not follow a link that is condomized with rel=”nofollow”.

Does that mean that a robots meta tag overrules a conflicting robots.txt statement? Of course not in any case. Robots.txt is the gatekeeper, and so to say the “highest REP instance”. Actually, to this question there’s no absolute answer that satisfies everybody.

A Webmaster sitting on a huge conglomerate of legacy code may want to totally switch to robots.txt directives, that means search engines shall ignore all the BS in ancient meta tags of pages created in the stone age of the Internet. Back then the rules were different. An alternative/secondary landing page’s “index,follow” from 1998 most probably doesn’t fly with 2008’s duplicate content filters and high sophisticated link pattern analytics.

The Webmaster of a well designed brand new site on the other hand might be happy with a default behavior where page-level REP tags overrule site-wide directives in robots.txt.
REP tags used in robots.txt might refine crawler directives. For example a disallow’ed URL can accumulate PageRank, and may be listed on SERPs. We need at least two different directives ruling PageRank caluculation and indexing for uncrawlable resources (see below under Noodp:/Noydir:, Noindex: and Norank:).

Google’s current approach to handle this with the Noindex: directive alone is not acceptable, we need a new REP tag to handle this case. Next up, when we introduce a new REP tag for use in robots.txt, we should allow it in meta tags and HTTP headers too.
In theory it makes no sense to maintain a directive that describes a default behavior. But why has the REP “follow” although the absence of “nofollow” perfectly expresses “follow”? Because of the way non-geeks think (try to explain why the value nil/null doesn’t equal empty/zero/blank to a non-geek. Not!).

Implicit directives that aren’t explicitely named and described in the rules don’t exist for the masses. Even in the 10 commandments someone had to write “thou shalt not hotlink|scrape|spam|cloak|crosslink|hijack…” instead of a no-brainer like “publish unique and compelling content for people and make your stuff crawlable”. Unfortunately, that works the other way round too. If a statement (Index: or Follow:) is dependent on another one (Allow: respectively the absence of Disallow:) folks will whine, rant and argue when search engines ignore their stuff.

Obviously we need at least Index:, Follow: and Archive to keep the standard usable and somewhat understandable. Of course crawler directives might thwart such indexer directives. Ignorant folks will write alphabetically ordered robots.txt files like Disallow: /cgi-bin/ Disallow: /content/ ... Follow: /cgi-bin/redirect.php Follow: /content/links/ ... Index: /content/articles/
without Allow: /content/links/, Allow: /content/articles/ and Allow: /cgi-bin/redirect.

Whether or not indexer directives that require crawling can overrule the crawler directive Disallow: is open for discussion. I vote for “not”.
Applying REP tags on site-level would be great, but it doesn’t solve other problems like the need of directives on block and element level. Both Google’s section targeting as well as Yahoo’s robots-nocontent class name aren’t acceptable tools capable to instruct search engines how to handle content in particular page areas (advertising blocks, navigation and other templated stuff, links in footers or sidebar elements, and so on).

Instead of editing bazillions of pages, templates, include files and whatnot to insert rel-nofollow/nocontent stuff for the sole purpose of sucking up to search engines, we need an elegant way to apply such micro-directives via robots.txt, or at least site-wide sets of instructions referenced in robots.txt. Once that’s doable, Webmasters will make use of such tools to improve their rankings, and not alone to comply to the ever changing search engine policies that cost the Webmaster community billions of man hours each year.

I consider these robots.txt statements sexy: Nofollow a.advertising, div#adblock, span.cross-links: /path Noindex .inherited-properties, p#tos, p#privacy, p#legal: /path
but that’s a wish list for another post. However, while designing site-wide REP statements we should at least think of block/element level directives.

Remember the rel-nofollow fiasco where a REP tag was used on HTML element level producing so much confusion and conflicts. Lets learn from past mistakes and make it perfect this time. A perfect standard can be complex, but it’s clear and unambiguous.

Priority settings

The REP’s command hierarchy must be well defined:

robots.txt
Page meta tags and X-Robots-Tags in the HTTP header. X-Robots-Tag values overrule conflicting meta tag values.
[Future block level directives]
Element level directives like rel-nofollow

That means, when crawling is allowed, page level instructions overrule robots.txt, and element level (or future block level) directives overrule page level instructions as well as robots.txt. As long as the Webmaster doesn’t revert the latter:

Priority-page-level: /path
Default behavior, directives in robots meta tags overrule robots.txt statements. Necessary to reset previous Priority-site-level: statements.

Priority-site-level: /path
Robots.txt directives overrule conflicting directives in robots meta tags and X-Robots-Tags.

Priority-site-level All: /path
Robots.txt directives overrule all directives in robots meta tags or provided elsewhere, because those are completely ignored for all URIs under /path. The “All” parameter would even dofollow nofollow’ed links when the robots.txt lacks corresponding Nofollow: statements.

Noindex: /path

Follow outgoing links, archive the page, but don’t list it on SERPs. The URLs can accumulate PageRank etcetera. Deindex previously indexed URLs.

[Currently Google doesn’t crawl Noindex’ed URLs and most probably those can’t accumulate PageRank, hence URLs in /path can’t distribute PageRank. That’s plain wrong. Those URLs should be able to pass PageRank to outgoing links when there’s no explicit Nofollow:, nor a “nofollow” meta tag respectively X-Robots-Tag.]

Norank: /path

Prevents URLs from accumulating PageRank, anchor text, and whatever link juice.

Makes sense to refine Disallow: statements in company with Noindex: and Noodp:/Noydir:, or to prevent TOS/contact/privacy/… pages and alike from sucking PageRank (nofollow’ing TOS links and stuff like that to control PageRank flow is fault-prone).

Nofollow: /path

The uber-link-condom. Don’t use outgoing links, not even internal links, for discovery crawling. Don’t credit the link destinations with any reputation (PageRank, anchor text, and whatnot).

Noarchive: /path

Don’t make a cached copy of the resource available to searchers.

Nosnippet: /path

List the resource with linked page title on SERPs, but don’t create a text snippet, and don’t reprint the description meta tag.

[Why don’t we have a REP tag saying “use my description meta tag or nothing”?]

Nopreview: /path

Don’t create/link an HTML preview of this resource. That’s interesting for subscriptions sites and applies mostly to PDFs, Word documents, spread sheets, presentations, and other non-HTML resources. More information here.

Noodp: /path

Don’t use the DMOZ title nor the DMOZ description for this URL on SERPs, not even when this resource is a non-HTML document that doesn’t supply its own title/meta description.

Noydir: /path

I’m not sure this one makes sense in robots.txt, because only Yahoo search uses titles and descriptions from the Yahoo directory. Anyway: “Don’t overwrite the page title listed on the SERPs with information pulled from the Yahoo directory, although I paid for it.”

Unavailable_after [date]: /path

Deindex the resource the day after [date]. The parameter [date] is put in any date or date/time format, if it lacks a timezone then GMT is assumed.

[Google’s RFC 850 obsession is somewhat weird. There are many ways to put a timestamp other than “25-Aug-2007 15:00:00 EST”.]

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

In the search index remove the unwanted variable/value pair(s) from the URL’s query string and transfer PageRank and other link juice to the matching URL without those parameters. If this “bare URL” redirects, or is uncrawlable for other reasons, index it with the content pulled from the page with the more complex URL.

Regardless whether the variable name or the variable’s value matches the pattern, “Truncate_*” statements remove a complete argument from the query string, that is &variable=value. If after the (last) truncate operation the query string is empty, the querystring delimiter “?” (questionmark) must be removed too.

Order-arguments [charset]: /path

Sort the query strings of all dynamic URLs by variable name, then within the ordered variables by their values. Pick the first URL from each set of identical results as canonical URL. Transfer PageRank etcetera from all dupes to the canonical URL.

Lots of sites out there were developed by coders who are utterly challenged by all things SEO. Most Web developers don’t even know what URL canonicalization means. Those sites suffer from tons of URLs that all serve identical contents, just because the query string arguments are put in random order, usually inventing a new sequence for each script, function, or include file. Of course most search engines run high sophisticated URL canonicalization routines to prevent their indexes from too much duplicate content, but those algos can fail because every Web site is different.

I totally can resist to suggest a Canonical-uri /: /Default.asp statement that gathers all IIS default-document-URI maladies. Also, case issues shouldn’t get fixed with Case-insensitive-uris: / but by the clueless developers in Redmond.

Will all this come true?

Well, Google has silently started to support REP tags in robots.txt, it totally makes sense both for search engines as well as for Webmasters, and Joe Webmaster’s life would be way more comfortable having REP tags for robots.txt.

A better question would be “will search engines implement REP tags for robots.txt in a way that Webmasters can live with it?”. Although Google launched the sitemaps protocol without significant help from the Webmaster community, I strongly feel that they desperately need our support with this move.

Currently it looks like they will fuck up the REP, respectively the robots.txt standard, hence go grab your AdWords rep and choke her/him until s/he promises to involve Larry, Sergey, Matt, Adam, John, and the whole Webmaster Support Team for the sake of common sense and the worldwide Webmaster community. Thank you!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

21 comments Sebastian | Robots Meta Tags, Web development, X-Robots-Tag, MSN, Crawler Directives, XML-Sitemaps, Microformats, Google, Yahoo, robots.txt, Nofollow

Google to change the Robots Exclusion Protocol again

Posted on 30 November, 2007

Web crawler directives, partly standardized in the Robots Exclusion Protocol (REP), evolved since 1994. Nowadays we’ve to deal with a conglomerate of not binding de facto standards and microformats, all of them extended by various organizations. All search engines claim that they obey “the standard”, but they refer to their very own REP implementation. In fact, each search engine supports a proprietary set of REP directives, differently from other players as a rule.

Google is the search engine putting the most efforts into Robots Exclusion Protocol (REP) evolvements. Their XML Sitemaps handling submissions instead of crawl restrictions changed the REP to a wider scope, the X-Robots-Tag brought us robots meta tags for non-HTML resources like PDF documents, images or video clips, and with Unavailable_after Google made a few clueless news sites happy. With the rel-nofollow microformat on the other hand, respectively its sneaky morphing from a spam fighting tool to its current shape, Google made nobody happy. Yahoo contributed the well meant but half-assed “robots-nocontent” class name, and of course “noydir” (it’s unlikely that any other engine will support those).

Now Google is working on new robots.txt syntax, and I am, politely put, not amused. Here is why I fear that Google is going to totally mess up the REP:

Google supports a “Noindex:” directive in robots.txt, which is treated as “Disallow:”¹⁾. Of course that’s an experiment, but if this behavior doesn’t change we’ll get a beast that is -with regard to the confusion it will produce- way more evil than the rel-nofollow fiasco.

A noindex-alias for disallow makes no sense, even when such syntax errors are out there.
Mixing crawler directives (allow/disallow) with indexer directives (noindex) is not always a bright idea. It’s bad enough that most Webmasters still believe that “Googlebot ranks their stuff”. (Actually, in some cases it can make sense. For example “nofollow” in robots meta tags (or at least for Google in REL attributes too) is both a crawler instruction as well as an indexer directive.)
Noindex and disallow are completely different commands. The REP’s noindex directive means “crawl it, follow its links, but don’t list it on the SERPs”. Disallow forbids crawling, but allows indexing URLs from directory listings or other inbound links.

Standards should be clear and unambiguous. Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they’ve the power to fuck up the REP, but they also have “do no evil”.

Considering that Google is run by a bunch of smart engineers, I hope that they’ll do the right thing eventually. The right thing in this case is giving more power to REP evolvements, before questionable and selfish anti-search initiatives like ACAP ruin both the robots.txt consensus as well as the robots meta tag standard.

My idea of more power to REP evolvements is:

Sensible implementation of crawler/indexer-directives adapted from REP tags in robots.txt. Applying page-level instructions ((no)index, (no)follow, noarchive, nosnippet, noodp/noydir, unavailable_after and hopefully nopreview) to groups of URIs is a great way to steer crawling and indexing, especially for sites which for various reasons cannot make use of the HTTP header’s X-Robots-Tag.
Implementation of block-level directives in robots.txt. Allowing Webmasters to apply crawler instructions like “noindex” or “nofollow” to particular page areas, like advertising blocks, duplicated text or repetitive navigation elements, addressed via HTML element names and class names and/or DOM-IDs, would be a very flexible instrument to steer crawling and indexing, and it could eleminate many points of failure.
Getting Webmasters, Publishers, SEOs and all major engines together to discuss possibly missing granularity and to develop a binding norm obeyed by all players.

The last one sounds like wishful thinking. The alternative is that Google (and, if possible, the bigger engines) talk with Webmasters and then launch the necessary REP extensions. The other engines will follow sooner or later. The publishers, although not getting all their desired ACAP restrictions, will be happy too. Standards like the Robots Exclusion Protocol should be developed by engineers.

¹⁾ Noindex: is not a plain Disallow:, there’s an interesting difference. In Google’s experiment both directives block crawling, but Disallow: allows URL-indexing based on 3rd party information, and Disallow:‘ed URLs can accumulate PageRank from internal as well as external links. Noindex:‘ed URLs on the other hand will not appear on SERPs as URL-only listing or with an ODP title and snippet, and I’m quite sure that they will not gather PageRank nor other link juice. That means links from any pages to such URLs get an implicit rel-nofollow in Google’s PageRank calculation, just like dangling links. This apparatus could be a great way to handle PageRank leaks (monthly blog archives, printer friendly pages and stuff like that), because shit happens, hence some links to such pages will slip through without condom. I admit that’s a neat idea, but its implementation is flawed because it doesn’t consider the implicit Follow: (that’s syntax Google doesn’t support in robots.txt). A better way to mark site areas which shall not gather PageRank without raping the REP would be a Norank: directive or so. Noindex: without a Nofollow: must not block crawling. Googlebot must fetch those URLs to follow their links.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | Crawler Directives, Robots Meta Tags, X-Robots-Tag, XML-Sitemaps, robots.txt, Microformats, Google, Yahoo, Nofollow

NOPREVIEW - The missing X-Robots-Tag

Posted on 7 August, 2007

Google provides previews of non-HTML resources listed on their SERPs:
These “view as text” and “view as HTML” links are pretty useful when you for example want to scan a PDF document before you clutter your machine’s RAM with 30 megs of useless digital rights management (aka Adobe Reader). You can view contents even when the corresponding application is not installed, Google’s transformed previews should not stuff your maiden box with unwanted malware, etcetera. However, under some circumstances it would make sound sense to have a NOPREVIEW X-Robots-Tag, but unfortunately Google forgot to introduce it yet.

Google is rightfully proud of their capability to transform various file formats to readable HTML or plain text: Adobe Portable Document Format (pdf), Adobe PostScript (ps), Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku), Lotus WordPro (lwp), MacWrite (mw), Microsoft Excel (xls), Microsoft PowerPoint (ppt), Microsoft Word (doc), Microsoft Works (wks, wps, wdb), Microsoft Write (wri), Rich Text Format (rtf), Shockwave Flash (swf), of course Text (ans, txt) plus a couple of “unrecognized” file types like XML. New formats are added from time to time.

According to Adam Lasnik currently there is no way for Webmasters to tell Google not to include the “View as HTML” option. You can try to fool Google’s converters by messing up the non-HTML resource in a way that a sane parser can’t interpret it. Actually, when you search a few minutes you’ll find e.g. PDF files without the preview links on Google’s SERPs. I wouldn’t consider this attempt a bullet-proof nor future-proof tactic though, because Google is pretty intent on improving their conversion/interpretation process.

I like the previews not only because sometimes they allow me to read documents behind a login screen. That’s a loophole Google should close as soon as possible. When for example PDF documents or Excel sheets are crawlable but not viewable for searchers (at least not with the second click) that’s plain annoying both for the site as well as for the search engine user.

With HTML documents the Webmaster can apply a NOARCHIVE crawler directive to prevent non paying visitors from lurking via Google’s cached page copies. Thanks to the newish REP header tags one can do that with non-HTML resources too, but neither NOARCHIVE nor NOSNIPPET etch away the “view-as HTML” link.

<speculation>Is the lack of a NOPREVIEW crawler directive just an oversight, or is it stuck in the pipeline because Google is working on supplemental components and concepts? Google’s yet inconsistent handling of subscription content comes to mind as an ideal playground for such a robots directive in combination with a policy change.</speculation>

Anyways, there is a need for a NOPREVIEW robots tag, so why not implement it now? Thanks in advance.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | X-Robots-Tag, Search Quality, Robots Meta Tags, Crawler Directives, Google

1 | 2 Next Page »

Archived posts from the 'Robots Meta Tags' Category

The Twitter pest that costs you hard earned money

Bogus Twitter apps and their modus operandi

A few more bots that need a revamp, or at least minor tweaks

How to do it the right way

A meaningful user agent

A method to verify the bot

Obeying the robots.txt standard

Obeying REP indexer directives

Responsible behavior

Respecting copyrights

Conclusion, and call for action

What is NOINDEX?

How major search engines treat NOINDEX

Matt’s question: How should Google handle NOINDEX in the future?

My take: Yes, No, Depends

A fair solution: NOREFERENCE

The good way

The popular way

The definitive way

Summary

Standardization of REP tags as robots.txt directives

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Google’s “Noindex: in robots.txt” experiment

Recap: Existing robots.txt directives

Recap: Existing REP tags

Problems with REP tags in robots.txt

Priority settings

Noindex: /path

Norank: /path

Nofollow: /path

Noarchive: /path

Nosnippet: /path

Nopreview: /path

Noodp: /path

Noydir: /path

Unavailable_after [date]: /path

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

Order-arguments [charset]: /path

Will all this come true?

Categories

Monthly Archives

Links

RSS Feeds