<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; Microformats</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Mon, 30 Jun 2008 20:12:40 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Get a grip on the Robots Exclusion Protocol (REP)</title>
		<link>http://sebastians-pamphlets.com/robots-exclusion-protocol-round-up-2008-01/</link>
		<comments>http://sebastians-pamphlets.com/robots-exclusion-protocol-round-up-2008-01/#comments</comments>
		<pubDate>Thu, 17 Jan 2008 14:26:25 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[URL removal]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[XML-Sitemaps]]></category>

		<category><![CDATA[Microformats]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/robots-exclusion-protocol-round-up-2008-01/</guid>
		<description><![CDATA[Thanks to the very nice folks over at SEOmoz I was able to prevent this site from becoming a kind of REP/robots.txt blog. Please consider reading this REP round up:
Robots Exclusion Protocol 101
My REP 101&#160; links to the various standards (robots.txt, REP tags, Sitemaps, microformats) the REP consists of, and provides a rough summary of [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/rep-command-hierarchy.png" width="250" height="200" align="right" style="margin-left:4px;" alt="REP command hierarchy" title="The REP's command hierarchy"  />Thanks to <a href="http://www.seomoz.org/blog/">the very nice folks over at SEOmoz</a> I was able to prevent this site from becoming a kind of REP/robots.txt blog. Please consider reading this REP round up:</p>
<p><a href="http://www.seomoz.org/blog/robots-exclusion-protocol-101" style="font-size:125%; font-weight:bold;">Robots Exclusion Protocol 101</a></p>
<p>My <em>REP 101</em>&nbsp; links to the various standards (robots.txt, REP tags, Sitemaps, microformats) the REP consists of, and provides a rough summary of each REP component. It explains the difference between crawler directives and indexer directives, and which command hierarchy search engines follow when REP directives put in different levels conflict.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/educate-yourself-on-the-rep.png" width="250" height="223" align="left" style="margin-right:4px;" alt="Educate yourself on the REP" title="Educate yourself on the Robots Exclusion Protocol"  />Why do I think that solid REP knowledge is important right now? Not only because of the confusion that exists thanks to the volume of crappy advice provided at every Webmaster hangout. Of course understanding the REP makes webmastering easier, thus I&#8217;m glad when my REP related pamphlets are considered somewhat helpful.  </p>
<p>I&#8217;ve a hidden agenda, though. I predict that the REP is going to change shortly. As usual, its evolvement is driven by a major search engine, since the W3C and such organizations don&#8217;t bother with the conglomerate of quasi standards and RFCs known as the Robots Exclusion Protocol. In general that&#8217;s not a bad thing. Search engines deal with the REP every day, so they have a legitimate interest. </p>
<p>Unfortunately not every REP extension that search engines have invented so far is useful for Webmasters, some of them are plain crap. Learning from fiascos and riots of the past, the engines are well advised to ask Webmasters for feedback before they announce further REP directives. </p>
<p>I&#8217;ve a feeling that shortly a well known search engine will launch a survey regarding particular REP related ideas. I want that Webmasters are well aware of the REP&#8217;s complexity and functionality when they contribute their take on REP extensions. <b>So please educate yourself</b>. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>My pamphlet discussing a possible <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/">standardization of REP tags as robots.txt directives</a> could be a useful reference, also please watch the great video <a href="http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/">here</a>. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/robots-exclusion-protocol-round-up-2008-01/feed/</wfw:commentRss>
		</item>
		<item>
		<title>My plea to Google - Please sanitize your REP revamps</title>
		<link>http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/</link>
		<comments>http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#comments</comments>
		<pubDate>Fri, 04 Jan 2008 00:02:46 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[XML-Sitemaps]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/</guid>
		<description><![CDATA[Standardization of REP tags as robots.txt directives
This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It&#8217;s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported [...]]]></description>
			<content:encoded><![CDATA[<h3 id="rep-revamp-serious-post-title">Standardization of REP tags as robots.txt directives</h3>
<p><img src="http://sebastians-pamphlets.com/img/google/google-confused-on-rep-robots.txt.png" width="234" height="250" align="right" style="margin-left:4px;" alt="Google is confules on REP standards and robots.txt" title="Please help Google to get it!" />This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It&#8217;s based on/extends the robots.txt specifications from <a href="http://www.robotstxt.org/orig.html">1994</a> and <a href="http://www.robotstxt.org/norobots-rfc.txt">1996</a>, as well as <a href="http://sitemaps.org/protocol.php#submit_robots">additions</a> supported by all major search engines. Furthermore it considers work in progress leaked out from Google. </p>
<p>In the following I&#8217;ll try to define a few robots.txt directives that Webmasters really need.</p>
<p><a onclick="hideContent('show-rep-draft-toc'); showContent('rep-draft-toc'); return false;" id="show-rep-draft-toc">Show Table of Contents</a>
<div id="rep-draft-toc" style="display:none;">
<ul><b style="margin-left:-20px;"><big>Jump station:</big></b></p>
<li><a href="#rel-nofollow-rep-abuse">Rel-Nofollow or how Google abused standardization of crawler directives for selfish goals</a></li>
<li><a href="#google-noindex-robots-txt-experiment">Google&#8217;s &#8220;<em>Noindex:</em> in robots.txt&#8221; experiment</a></li>
<li><a href="#existing-robots-txt-statements">Recap: Existing robots.txt directives</a></li>
<li style="padding-left:10px;"><a href="#robots-txt-patterns">URIs, patterns, &#8230;</a></li>
<li style="padding-left:10px;"><a href="#robots-txt-user-agent">User-agent: [name]</a></li>
<li style="padding-left:10px;"><a href="#robots-txt-disallow">Disallow: /path</a></li>
<li style="padding-left:10px;"><a href="#robots-txt-allow">Allow: /path</a></li>
<li style="padding-left:10px;"><a href="#robots-txt-sitemap">Sitemap: [absolute URL]</a></li>
<li><a href="#existing-rep-tags">Recap: Existing REP tags</a></li>
<li><a href="#probs-with-rep-tags-in-robots-txt">Problems with REP tags in robots.txt</a></li>
<li><a href="#rep-command-priority">Priority settings</a></li>
<li><a href="#robots-txt-noindex">Noindex: /path</a></li>
<li><a href="#robots-txt-norank">Norank: /path</a></li>
<li><a href="#robots-txt-nofollow">Nofollow: /path</a></li>
<li><a href="#robots-txt-noarchive">Noarchive: /path</a></li>
<li><a href="#robots-txt-nosnippet">Nosnippet: /path</a></li>
<li><a href="#robots-txt-nopreview">Nopreview: /path</a></li>
<li><a href="#robots-txt-noodp">Noodp: /path</a></li>
<li><a href="#robots-txt-noydir">Noydir: /path</a></li>
<li><a href="#robots-txt-unavail-after">Unavailable_after [date]: /path</a></li>
<li><a href="#robots-txt-truncate-variable">Truncate-variable [string|pattern]: /path</a></li>
<li><a href="#robots-txt-truncate-value">Truncate-value [string|pattern]: /path</a></li>
<li><a href="#robots-txt-order-arguments">Order-arguments [charset]: /path</a></li>
<li><a href="#will-my-dreams-come-true">Will all this come true?</a></li>
<li><a href="http://sphinn.com/story/21269">Sphinn this pamphlet to make my plea popular</a></li>
</ul>
<p>&nbsp;</p>
</div>
<p>Currently <a href="http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/">Google experiments with new robots.txt directives</a>, that is <a href="http://www.robotstxt.org/metabof.html">REP tags</a> like &#8220;noindex&#8221; adapted for robots.txt. That&#8217;s a welcomed and brilliant move. </p>
<p>Unfortunately, they got it totally wrong, again. (<a href="#existing-robots-txt-statements">Skip the longish explanation of the rel-nofollow fiasco and my rant on Google&#8217;s current robots.txt experiments</a>.) </p>
<p>Google&#8217;s last try to enhance the REP by adapting a REP tag&#8217;s value in another level was a miserable failure. Not because crawler directives on link-level are a bad thing, the opposite is true, but because the implementation of <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">rel-nofollow</a> <a href="http://sebastians-pamphlets.com/dear-search-engines-please-bury-the-relnofollow-fiasko/">confused the hell out of Webmasters</a>, and still does. </p>
<h3 id="rel-nofollow-rep-abuse">Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes</h3>
<p style="margin-left:10px;">Don&#8217;t get me wrong, an instrument to steer search engine crawling and indexing on link level is a great utensil in a Webmaster&#8217;s toolbox. Rel-nofollow just lacks granularity, and it was sneakily introduced for the wrong purposes.</p>
<p><b>Recap:</b> When Google launched rel-nofollow in 2005, they promoted it as <a href="http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html">a tool to fight comment spam</a>.<br />
<blockquote>From now on, when Google sees the attribute (rel=&#8221;nofollow&#8221;) on hyperlinks, those links won&#8217;t get any credit when we rank websites in our search results. This isn&#8217;t a negative vote for the site where the comment was posted; it&#8217;s just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists.</p></blockquote>
<p> Technically spoken, this translates to &#8220;search engine crawlers shall/can use rel-nofollow links for discovery crawling, but indexers and ranking algos processing links must not credit link destinations with PageRank, anchor text, nor other link juice originating from rel-nofollow links&#8221;. <b>Rel=&#8221;nofollow&#8221; meant rel=&#8221;pass-no-reputation&#8221;.</b></p>
<p>All blog platforms implemented the beast, and it seemed that Google got rid of a major problem (gazillions of irrelevant spam links manipulating their rankings). Not so the bloggers, because the spammers didn&#8217;t bother to check whether a blog dofollows inserted links or not. Despite all the condomized links the amount of blog comment spam increased dramatically, since the spammers were forced to attack even more blogs in order to earn the same amount of uncondomized links from blogs that didn&#8217;t update to a software version that supported rel-nofollow. </p>
<p>Experiment failed, move on to better solutions like Akismet, captchas or ajax&#8217;ed comment forms? Nope, it&#8217;s not that easy. <b>Google had a hidden agenda.</b> Fighting blog comment spam was just a snake oil sales pitch, an opportunity to establish rel-nofollow by jumping on a popular band wagon. In 2005 Google had mastered the guestbook spam problem already. Devaluing comment links in well structured pages like blog posts is as easy as doing the same with guestbook links, or identifying affiliate links. In other words, when Google launched rel-nofollow, blog comment spam was definitely not a major search quality issue any more. </p>
<p>Identifying paid links on the other hand is not that easy, because they often appear as editorial links within the content. And that was a major problem for Google, a problem that they weren&#8217;t able to solve algorithmically without cooperation of all webmasters, site owners, and publishers. <b>Google actually invented rel-nofollow to get a grip on paid links.</b> Recently they announced that Googlebot no longer follows condomized links (pre-<a href="http://www.mattcutts.com/blog/fall-weather-forecast/">Bigdaddy</a> Google followed condomized links and indexed contents discovered from rel-nofollow links), and their cold <a href="http://searchengineland.com/071231-101811.php">war on paid links</a> became hot.</p>
<p>Of course the sneaky morphing of rel-nofollow from &#8220;pass no reputation&#8221; to a full blown &#8220;nofollow&#8221; is just a secondary theater of war, but without this side issue (with regard to REP standardization) Google would have lost, hence it was decisive for the outcome of their war on paid links. </p>
<p>To stay fair, <a href="http://searchengineland.com/">Danny Sullivan</a> <a href="http://sphinn.com/story/21269#c24519">said</a> <a href="http://sphinn.com/story.php?id=22633#c26052">twice</a> that rel-nofollow is Dave Winer&#8217;s fault, and Google as the victim is not to blame.</p>
<p><b>Rel-nofollow is settled now. However, I don&#8217;t want to see Google using their enormous power to manipulate the REP for selfish goals again. I wrote this rel-nofollow recap because probably, or possibly, Google is just doing it once more:</b> </p>
<h3 id="google-noindex-robots-txt-experiment">Google&#8217;s &#8220;<em>Noindex:</em> in robots.txt&#8221; experiment</h3>
<p><a href="http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex-your-site/">Google supports a <em>Noindex:</em> directive in robots.txt</a>. It <a href="http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex-your-site/#just-in-case-rant">seems</a> <a href="http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/#noindex-disallow-peculiarities">Google&#8217;s <em>Noindex:</em> blocks crawling like <em>Disallow:</em></a>, but additionally prevents URLs blocked with <em>Noindex:</em> both from accumulating PageRank as well as from indexing based on 3rd party signals like inbound links. </p>
<p>This functionality would be nice to have, but accomplishing it with &#8220;Noindex&#8221; is badly wrong. The REP&#8217;s &#8220;Noindex&#8221; value without an explicit &#8220;Nofollow&#8221; means &#8220;crawl it, follow its links, but don&#8217;t list it on SERPs&#8221;. With pagel-level directives (robots meta tags and X-Robots-Tags) Google handles &#8220;Noindex&#8221; exactly as defined, that means with an implicit &#8220;Follow&#8221;. Not so in robots.txt. Mixing crawler directives (Disallow:) with indexer directives (Noindex:) this way takes the &#8220;Follow&#8221; out of the game, because a search engine can&#8217;t follow links from uncrawled documents. </p>
<p><b>Webmasters will not understand that &#8220;Nofollow&#8221; means totally different things in robots.txt and meta tags. Also, this approach steals granularity that we need</b>, for example for use with technically structured sitemap pages and other hubs. </p>
<p>According to Google their current interpretation of <em>Noindex:</em> in robots.txt is not yet set in stone. That means there&#8217;s an opportunity for improvement. I hope that Google, and other search engines as well, listen to the needs of Webmasters. </p>
<p>Dear Googlers, don&#8217;t take the above said as Google bashing. I know, and often wrote, that Google is the search engine that puts the most efforts in boring tasks like REP evolvement. I just think that a dog company like Google needs to take real-world Webmasters into the boat when playing with standards like the REP, for the sake of the cats. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<h3 id="existing-robots-txt-statements">Recap: Existing robots.txt directives</h3>
<p id="robots-txt-patterns">The <em>/path</em> example in the following sections refers to any way to assign URIs to REP directives, not only complete URIs relative to the server&#8217;s root. Patterns can be useful to set crawler directives for a bunch of URIs:</p>
<ul>
<li><a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35303"><b>*</b></a>: any string in path or query string, including the query string delimiter &#8220;?&#8221;, multiple wildcards should be allowed.</li>
<li><a href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&#038;answer=35308"><b>$</b></a>: end of URI</li>
<li><b>Trailing /</b>: (not exactly a pattern) addresses a directory, its files and subdirectories, the subdirectorie&#8217;s files etc., for example
<ul>
<li><code>Disallow: /path/</code><br />
matches /path/index.html but not /path.html</li>
<li><code>/path</code><br />
matches both /path/index.html and /path.html, as well as /path_1.html. It&#8217;s a pretty common mistake to &#8220;forget&#8221; the trailing slash in crawler directives meant to disallow particular directories. Such mistakes can result in blocking script/page-URIs that should get crawled and indexed.</li>
</ul>
</li>
</ul>
<p>Please note that patterns aren&#8217;t supported by all search engines, for example MSN supports only file extensions (yet?).</p>
<p><b id="robots-txt-user-agent">User-agent: [crawler name]</b><br />
Groups a set of instructions for a particular crawler. Crawlers that find their own section in robots.txt ignore the <code>User-agent: *</code> section that addresses all Web robots. Each <em>User-agent:</em> section must be terminated with at least one empty line.</p>
<p><b id="robots-txt-disallow">Disallow: /path</b><br />
Prevents from crawling, but allows indexing based on 3rd party information like anchor text and surrounding text of inbound links. Disallow&#8217;ed URLs can gather PageRank.</p>
<p><b id="robots-txt-allow">Allow: /path</b><br />
Refines previous <em>Disallow:</em> statements. For example <code><br />
Disallow: /scripts/<br />
Allow: /scripts/page.php </code><br />
tells crawlers that they may fetch http://example.com/scripts/page.php or http://example.com/scripts/page.php?article=1, but not any other URL in http://example.com/scripts/.</p>
<p><b id="robots-txt-sitemap">Sitemap: [absolute URL]</b><br />
Announces XML sitemaps to search engines. Example: <code><br />
Sitemap: http://example.com/sitemap.xml<br />
Sitemap: http://example.com/video-sitemap.xml </code><br />
points all search engines that support Google&#8217;s <a href="http://sitemaps.org/">Sitemaps Protocol</a> to the sitemap locations. Please note that sitemap autodiscovery via robots.txt doesn&#8217;t replace sitemap submissions. <a href="http://google.com/webmasters/tools/">Google</a>, <a href="https://siteexplorer.search.yahoo.com/">Yahoo</a> and <a href="http://webmaster.live.com/">MSN</a> provide Webmaster Consoles where you not only can submit your sitemaps, but follow the indexing process (wishful thinking WRT particular SEs). In some cases <a href="http://www.wolf-howl.com/case-study/using-google-sitemaps-for-competitive-intelligence/">it might be a bright idea</a> to avoid the default file name &#8220;sitemap.xml&#8221; and keep the sitemap URLs out of robots.txt, <a href="http://sebastians-pamphlets.com/is-xml-sitemap-autodiscovery-for-everyone/">sitemap autodiscovery is not for everyone</a>.</p>
<h3 id="existing-rep-tags">Recap: Existing REP tags</h3>
<p>REP tags are values that you can use in a page&#8217;s <a href="http://sebastians-pamphlets.com/links/categories/?cat=robots-meta-tags">robots meta tag</a> and <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">X-Robots-Tag</a>. Robots meta tags go to the HTML document&#8217;s HEAD section <code><br />
&lt;meta name="robots" content="noindex, follow, noarchive" /&gt;</code><br />
whereas X-Robots-Tags supply the same information in the HTTP header <code><br />
X-Robots-Tag: noindex, follow, noarchive </code><br />
and thus can instruct crawlers how to handle non-HTML resources like PDFs, images, videos, and whatnot.</p>
<ul><b style="margin-left:-20px;">Widely supported REP tags are:</b></p>
<li>INDEX|NOINDEX - Tells whether the page may be indexed (listed on SERPs) or not</li>
<li>FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided in the document or not</li>
<li>ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW</li>
<li>NOODP - tells search engines not to use page titles and descriptions pulled from DMOZ on their SERPs.</li>
<li>NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.</li>
<li>NOARCHIVE - Google specific, used to prevent archiving (cached page copy)</li>
<li>NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs</li>
<li>UNAVAILABLE_AFTER: <a href="http://www.ietf.org/rfc/rfc0850.txt">RFC 850</a> formatted timestamp - Removes an URL from Google’s search index a day after the given date/time</li>
</ul>
<h3 id="probs-with-rep-tags-in-robots-txt">Problems with REP tags in robots.txt</h3>
<p>REP tags (index, noindex, follow, nofollow, all, none, noarchive, nosnippet, noodp, noydir, unavailable_after) were designed as page-level directives. Setting those values for groups of URLs makes steering search engine crawling and indexing a breeze, but also comes with more complexity and a few pitfalls as well. </p>
<ul>
<li>
<p>Page-level directives are instructions for indexers and query engines, not crawlers. A search engine can&#8217;t obey REP tags without crawling the resource that supplies them. That means that not a single REP tag put as robots.txt statement shall be misunderstood as crawler directive.</p>
<p>For example <em>Noindex: /path</em> must not block crawling, not even in combination with <em>Nofollow: /path</em>, because there&#8217;s still the implicit &#8220;archive&#8221; (= absence of <em>Noarchive: /path</em>). Providing a cached copy even of a not indexed page makes sense for toolbar users.</p>
<p>Whether or not a search engine actually crawls a resource that&#8217;s tagged with &#8220;noindex, nofollow, noarchive, nosnippet&#8221; or so is up to the particular SE, but none of those values implies a <em>Disallow: /path</em>.</p>
</li>
<li>
<p>Historically, a crawler instruction on HTML element level overrules the robots meta tag. For example when the meta tag says &#8220;follow&#8221; for all links on a page, the crawler will not follow a link that is condomized with rel=&#8221;nofollow&#8221;.</p>
<p>Does that mean that a robots meta tag overrules a conflicting robots.txt statement? Of course not in any case. Robots.txt is the gatekeeper, and so to say the &#8220;highest REP instance&#8221;. Actually, to this question there&#8217;s no absolute answer that satisfies everybody.</p>
<p>A Webmaster sitting on a huge conglomerate of legacy code may want to totally switch to robots.txt directives, that means search engines shall ignore all the BS in ancient meta tags of pages created in the stone age of the Internet. Back then the rules were different. An alternative/secondary landing page&#8217;s &#8220;index,follow&#8221; from 1998 most probably doesn&#8217;t fly with 2008&#8217;s duplicate content filters and high sophisticated link pattern analytics.</p>
<p>The Webmaster of a well designed brand new site on the other hand might be happy with a default behavior where page-level REP tags overrule site-wide directives in robots.txt.</p>
</li>
<li>
<p>REP tags used in robots.txt might refine crawler directives. For example a disallow&#8217;ed URL can accumulate PageRank, and may be listed on SERPs. We need at least two different directives ruling PageRank caluculation and indexing for uncrawlable resources (see below under <em>Noodp:/Noydir:</em>, <em>Noindex:</em> and <em>Norank:</em>).</p>
<p>Google&#8217;s current approach to handle this with the <em>Noindex:</em> directive alone is not acceptable, we need a new REP tag to handle this case. Next up, when we introduce a new REP tag for use in robots.txt, we should allow it in meta tags and HTTP headers too.</p>
</li>
<li>
<p>In theory it makes no sense to maintain a directive that describes a default behavior. But why has the REP &#8220;follow&#8221; although the absence of &#8220;nofollow&#8221; perfectly expresses &#8220;follow&#8221;? Because of the way non-geeks think (try to explain why the value nil/null doesn&#8217;t equal empty/zero/blank to a non-geek. Not!).</p>
<p>Implicit directives that aren&#8217;t explicitely named and described in the rules don&#8217;t exist for the masses. Even in the 10 commandments someone had to write &#8220;thou shalt not hotlink|scrape|spam|cloak|crosslink|hijack&#8230;&#8221; instead of a no-brainer like &#8220;publish unique and compelling content for people and make your stuff crawlable&#8221;. Unfortunately, that works the other way round too. If a statement (Index: or Follow:) is dependent on another one (Allow: respectively the absence of Disallow:) folks will whine, rant and argue when search engines ignore their stuff.</p>
<p>Obviously we need at least <em>Index:</em>, <em>Follow:</em> and <em>Archive</em> to keep the standard usable and somewhat understandable. Of course crawler directives might thwart such indexer directives. Ignorant folks will write alphabetically ordered robots.txt files like <code><br />
Disallow: /cgi-bin/<br />
Disallow: /content/<br />
...<br />
Follow: /cgi-bin/redirect.php<br />
Follow: /content/links/<br />
...<br />
Index: /content/articles/</code><br />
without <code>Allow: /content/links/</code>, <code>Allow: /content/articles/</code> and <code>Allow: /cgi-bin/redirect</code>.</p>
<p>Whether or not indexer directives that require crawling can overrule the crawler directive <em>Disallow:</em> is open for discussion. I vote for &#8220;not&#8221;.</p>
</li>
<li>
<p>Applying REP tags on site-level would be great, but it doesn&#8217;t solve other problems like the need of directives on block and element level. Both Google&#8217;s section targeting as well as Yahoo&#8217;s robots-nocontent class name aren&#8217;t acceptable tools capable to instruct search engines how to handle content in particular page areas (advertising blocks, navigation and other templated stuff, links in footers or sidebar elements, and so on).</p>
<p>Instead of editing bazillions of pages, templates, include files and whatnot to insert rel-nofollow/nocontent stuff for the sole purpose of sucking up to search engines, we need an elegant way to apply such micro-directives via robots.txt, or at least site-wide sets of instructions referenced in robots.txt. Once that&#8217;s doable, Webmasters will make use of such tools to improve their rankings, and not alone to comply to the ever changing search engine policies that cost the Webmaster community billions of man hours each year. </p>
<p>I consider these robots.txt statements sexy: <code><br />
Nofollow a.advertising, div#adblock, span.cross-links: /path<br />
Noindex .inherited-properties, p#tos, p#privacy, p#legal: /path </code><br />
but that&#8217;s a wish list for another post. However, while designing site-wide REP statements we should at least think of block/element level directives.</p>
</li>
</ul>
<p>Remember the <a href="#rel-nofollow-rep-abuse">rel-nofollow fiasco</a> where a REP tag was used on HTML element level producing so much confusion and conflicts. Lets learn from past mistakes and make it perfect this time. A perfect standard can be complex, but it&#8217;s clear and unambiguous.</p>
<h3 id="rep-command-priority">Priority settings</h3>
<p>The REP&#8217;s command hierarchy must be well defined:
<ol>
<li>robots.txt</li>
<li>Page meta tags and X-Robots-Tags in the HTTP header. X-Robots-Tag values overrule conflicting meta tag values.</li>
<li>[Future block level directives]</li>
<li>Element level directives like rel-nofollow</li>
</ol>
<p>That means, when crawling is allowed, page level instructions overrule robots.txt, and element level (or future block level) directives overrule page level instructions as well as robots.txt. As long as the Webmaster doesn&#8217;t revert the latter:</p>
<p><b>Priority-page-level: /path</b><br />
Default behavior, directives in robots meta tags overrule robots.txt statements. Necessary to reset previous <em>Priority-site-level:</em> statements.</p>
<p><b>Priority-site-level: /path</b><br />
Robots.txt directives overrule conflicting directives in robots meta tags and X-Robots-Tags.</p>
<p><b>Priority-site-level All: /path</b><br />
Robots.txt directives overrule all directives in robots meta tags or provided elsewhere, because those are completely ignored for all URIs under /path. The &#8220;All&#8221; parameter would even dofollow nofollow&#8217;ed links when the robots.txt lacks corresponding <em>Nofollow:</em> statements.</p>
<h3 id="robots-txt-noindex">Noindex: /path</h3>
<p>Follow outgoing links, archive the page, but don&#8217;t list it on SERPs. The URLs can accumulate PageRank etcetera. Deindex previously indexed URLs.</p>
<p>[Currently Google doesn&#8217;t crawl Noindex&#8217;ed URLs and most probably those can&#8217;t accumulate PageRank, hence URLs in /path can&#8217;t distribute PageRank. That&#8217;s plain wrong. Those URLs should be able to pass PageRank to outgoing links when there&#8217;s no explicit <em>Nofollow:</em>, nor a &#8220;nofollow&#8221; meta tag respectively X-Robots-Tag.]</p>
<h3 id="robots-txt-norank">Norank: /path</h3>
<p>Prevents URLs from accumulating PageRank, anchor text, and whatever link juice. </p>
<p>Makes sense to refine <em>Disallow:</em> statements in company with <em>Noindex:</em> and <em>Noodp:/Noydir:</em>, or to prevent TOS/contact/privacy/&#8230; pages and alike from sucking PageRank (nofollow&#8217;ing TOS links and stuff like that to control PageRank flow is fault-prone).  </p>
<h3 id="robots-txt-nofollow">Nofollow: /path</h3>
<p>The uber-<a href="http://link-condom.com/">link-condom</a>. Don&#8217;t use outgoing links, not even internal links, for discovery crawling. Don&#8217;t credit the link destinations with any reputation (PageRank, anchor text, and whatnot).</p>
<h3 id="robots-txt-noarchive">Noarchive: /path</h3>
<p>Don&#8217;t make a cached copy of the resource available to searchers.</p>
<h3 id="robots-txt-nosnippet">Nosnippet: /path</h3>
<p>List the resource with linked page title on SERPs, but don&#8217;t create a text snippet, and don&#8217;t reprint the description meta tag.</p>
<p>[Why don&#8217;t we have a REP tag saying &#8220;use my description meta tag or nothing&#8221;?]</p>
<h3 id="robots-txt-nopreview">Nopreview: /path</h3>
<p>Don&#8217;t create/link an HTML preview of this resource. That&#8217;s interesting for subscriptions sites and applies mostly to PDFs, Word documents, spread sheets, presentations, and other non-HTML resources. <a href="http://sebastians-pamphlets.com/nopreview-the-missing-x-robots-tag/">More information here</a>. </p>
<h3 id="robots-txt-noodp">Noodp: /path</h3>
<p>Don&#8217;t use the DMOZ title nor the DMOZ description for this URL on SERPs, not even when this resource is a non-HTML document that doesn&#8217;t supply its own title/meta description.</p>
<h3 id="robots-txt-noydir">Noydir: /path</h3>
<p>I&#8217;m not sure this one makes sense in robots.txt, because only Yahoo search uses titles and descriptions from the Yahoo directory. Anyway: &#8220;Don&#8217;t overwrite the page title listed on the SERPs with information pulled from the Yahoo directory, although I paid for it.&#8221;</p>
<h3 id="robots-txt-unavail-after">Unavailable_after [date]: /path</h3>
<p>Deindex the resource the day after [date]. The parameter [date] is put in any date or date/time format, if it lacks a timezone then GMT is assumed.</p>
<p>[Google&#8217;s <a href="http://www.ietf.org/rfc/rfc0850.txt">RFC 850</a> obsession is somewhat weird. There are many ways to put a timestamp other than &#8220;25-Aug-2007 15:00:00 EST&#8221;.]</p>
<h3 id="robots-txt-truncate-variable">Truncate-variable [string|pattern]: /path</h3>
<h3 id="robots-txt-truncate-value">Truncate-value [string|pattern]: /path</h3>
<p>In the search index remove the unwanted variable/value pair(s) from the URL&#8217;s query string and transfer PageRank and other link juice to the matching URL without those parameters. If this &#8220;bare URL&#8221; redirects, or is uncrawlable for other reasons, index it with the content pulled from the page with the more complex URL.</p>
<p>Regardless whether the variable name or the variable&#8217;s value matches the pattern, &#8220;Truncate_*&#8221; statements remove a complete argument from the query string, that is <code>&amp;variable=value</code>. If after the (last) truncate operation the query string is empty, the querystring delimiter &#8220;?&#8221; (questionmark) must be removed too. </p>
<h3 id="robots-txt-order-arguments">Order-arguments [charset]: /path</h3>
<p>Sort the query strings of all dynamic URLs by variable name, then within the ordered variables by their values. Pick the first URL from each set of identical results as canonical URL. Transfer PageRank etcetera from all dupes to the canonical URL. </p>
<p>Lots of sites out there were developed by coders who are utterly challenged by all things SEO. Most Web developers don&#8217;t even know what URL canonicalization means. Those sites suffer from tons of URLs that all serve identical contents, just because the query string arguments are put in random order, usually inventing a new sequence for each script, function, or include file. Of course most search engines run high sophisticated URL canonicalization routines to prevent their indexes from too much duplicate content, but those algos can fail because every Web site is different. </p>
<p>I totally can resist to suggest a <code>Canonical-uri /: /Default.asp</code> statement that gathers all IIS default-document-URI maladies. Also, case issues shouldn&#8217;t get fixed with <code>Case-insensitive-uris: /</code> but by the clueless developers in Redmond.</p>
<h3 id="will-my-dreams-come-true">Will all this come true?</h3>
<p>Well, Google has silently started to support REP tags in robots.txt, it totally makes sense both for search engines as well as for Webmasters, and Joe Webmaster&#8217;s life would be way more comfortable having REP tags for robots.txt. </p>
<p>A better question would be &#8220;will search engines implement REP tags for robots.txt in a way that Webmasters can live with it?&#8221;. Although Google launched the sitemaps protocol without significant help from the Webmaster community, I strongly feel that they desperately need our support with this move. </p>
<p>Currently it looks like they will fuck up the REP, respectively the robots.txt standard, hence go grab your AdWords rep and choke her/him until s/he promises to involve <a href="http://www.google.com/corporate/execs.html#larry">Larry</a>, <a href="http://www.google.com/corporate/execs.html#sergey">Sergey</a>, <a href="http://mattcutts.com/blog/">Matt</a>, <a href="http://www.bladam.com/">Adam</a>, <a href="http://johnmu.com/">John</a>, and the whole <a href="http://googlewebmastercentral.blogspot.com/2007/12/festivus-for-webmasterus.html">Webmaster Support Team</a> for the sake of common sense and the worldwide Webmaster community. <b>Thank you!</b></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google to change the Robots Exclusion Protocol again</title>
		<link>http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/</link>
		<comments>http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/#comments</comments>
		<pubDate>Fri, 30 Nov 2007 18:33:22 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[XML-Sitemaps]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/</guid>
		<description><![CDATA[Web crawler directives, partly standardized in the Robots Exclusion Protocol (REP), evolved since 1994. Nowadays we&#8217;ve to deal with a conglomerate of not binding de facto standards and microformats, all of them extended by various organizations. All search engines claim that they obey &#8220;the standard&#8221;, but they refer to their very own REP implementation. In [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/google/rep-experiments-google-jumping-the-shark.png" width="250" height="309" align="right" style="margin-left:4px;" alt="Google jumping the shark" title="Google jumping the shark" />Web crawler directives, partly standardized in the Robots Exclusion Protocol (<a href="http://www.robotstxt.org/">REP</a>), evolved since 1994. Nowadays we&#8217;ve to deal with a <a href="http://sebastians-pamphlets.com/in-need-of-a-web-robot-directives-standard/">conglomerate of not binding de facto standards and microformats</a>, all of them extended by various organizations. All search engines claim that they obey &#8220;the standard&#8221;, but they refer to their very own REP implementation. In fact, each search engine supports a proprietary set of REP directives, differently from other players as a rule.</p>
<p>Google is the search engine putting the most efforts into Robots Exclusion Protocol (REP) evolvements. Their <a href="http://sitemaps.org/">XML Sitemaps</a> handling submissions instead of crawl restrictions changed the REP to a wider scope, the <a href="http://googleblog.blogspot.com/2007/07/robots-exclusion-protocol-now-with-even.html">X-Robots-Tag</a> brought us robots meta tags for non-HTML resources like PDF documents, images or video clips, and with <a href="http://sebastians-pamphlets.com/unavailable_after-is-totally-and-utterly-useless/">Unavailable_after</a> Google made a few clueless news sites happy. With the <a href="http://microformats.org/wiki/rel-nofollow#open_issues">rel-nofollow microformat</a> on the other hand, respectively its <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">sneaky morphing</a> from a <a href="http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html">spam fighting tool</a> to its <a href="http://sebastians-pamphlets.com/googles-5-sure-fire-steps-to-safer-indexing/">current shape</a>, Google made nobody happy. Yahoo contributed the well meant but <a href="http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/">half-assed &#8220;robots-nocontent&#8221; class name</a>, and of course &#8220;noydir&#8221; (it&#8217;s unlikely that any other engine will support those).</p>
<p>Now <a href="http://sebastians-pamphlets.com/about-noindex-crawler-directives-in-robots-txt/">Google is working on new robots.txt syntax</a>, and I am, politely put, <a href="http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex-your-site/#robots-txt-test-results-2007-11-28">not amused</a>. <a href="http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex-your-site/#just-in-case-rant">Here is why</a> <b>I fear that Google is going to totally mess up the REP</b>:</p>
<p>Google supports a &#8220;Noindex:&#8221; directive in robots.txt, which is treated as &#8220;Disallow:&#8221;<sup><a href="#noindex-disallow-peculiarities">1)</a></sup>. Of course that&#8217;s an experiment, but if this behavior doesn&#8217;t change we&#8217;ll get a beast that is &#8211;with regard to the confusion it will produce&#8211; way more evil than the <a href="http://sebastians-pamphlets.com/dear-search-engines-please-bury-the-relnofollow-fiasko/">rel-nofollow fiasco</a>.
<ul>
<li>A noindex-alias for disallow makes no sense, even when such syntax errors are out there.</li>
<li>Mixing crawler directives (allow/disallow) with indexer directives (noindex) is not always a bright idea. It&#8217;s bad enough that most Webmasters still believe that &#8220;Googlebot ranks their stuff&#8221;. (Actually, in some cases it can make sense. For example &#8220;nofollow&#8221; in robots meta tags (or at least for Google in REL attributes too) is both a crawler instruction as well as an indexer directive.)</li>
<li>Noindex and disallow are completely different commands. The REP&#8217;s noindex directive means &#8220;crawl it, follow its links, but don&#8217;t list it on the SERPs&#8221;. Disallow forbids crawling, but allows indexing URLs from directory listings or other inbound links.</li>
</ul>
<p><b>Standards should be clear and unambiguous.</b> Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they&#8217;ve the power to fuck up the REP, but they also have &#8220;do no evil&#8221;.</p>
<p>Considering that Google is run by a bunch of smart engineers, I hope that they&#8217;ll do the right thing eventually. The right thing in this case is giving more power to REP evolvements, before questionable and selfish anti-search initiatives like <a href="http://www.the-acap.org/project_documents/ACAP-TF-CrawlerCommunications-Part1-V1.0.pdf" rel="nofollow crap" title="ACAP's robots.txt RFC">ACAP ruin both the robots.txt consensus</a> <a href="http://www.the-acap.org/project_documents/ACAP-TF-CrawlerCommunications-Part2-V1.0.pdf" rel="nofollow crap" title="ACAP's robots meta tag RFC">as well as the robots meta tag standard</a>.</p>
<p>My idea of <b>more power to REP evolvements</b> is:
<ul>
<li><b>Sensible implementation of crawler/indexer-directives adapted from REP <em>tags</em>&nbsp; in robots.txt.</b> Applying page-level instructions ((no)index, (no)follow, noarchive, nosnippet, noodp/noydir, unavailable_after and hopefully <a href="http://sebastians-pamphlets.com/nopreview-the-missing-x-robots-tag/">nopreview</a>) to groups of URIs is a great way to steer crawling and indexing, especially for sites which for various reasons cannot make use of the HTTP header&#8217;s X-Robots-Tag.</li>
<li><b>Implementation of block-level directives in robots.txt.</b> Allowing Webmasters to apply crawler instructions like &#8220;noindex&#8221; or &#8220;nofollow&#8221; to particular page areas, like advertising blocks, duplicated text or repetitive navigation elements, addressed via HTML element names and class names and/or DOM-IDs, would be a very flexible instrument to steer crawling and indexing, and it could eleminate many points of failure.</li>
<li>Getting Webmasters, Publishers, SEOs and all major engines together to discuss possibly missing granularity and to develop a binding norm obeyed by all players.</li>
</ul>
<p>The last one sounds like wishful thinking. The alternative is that Google (and, if possible, the bigger engines) talk with Webmasters and then launch the necessary REP extensions. The other engines will follow sooner or later.  The publishers, although not getting all their desired ACAP restrictions, will be happy too. Standards like the Robots Exclusion Protocol should be developed by engineers. </p>
<hr color="silver" width="128" align="left" />
<sup id="noindex-disallow-peculiarities">1)</sup>&nbsp;<em>Noindex:</em> is not a plain <em>Disallow:</em>, there&#8217;s an interesting difference. In Google&#8217;s experiment both directives block crawling, but <em>Disallow:</em> allows URL-indexing based on 3rd party information, and <em>Disallow:</em>&#8216;ed URLs can accumulate PageRank from internal as well as external links. <em>Noindex:</em>&#8216;ed URLs on the other hand will not appear on SERPs as URL-only listing or with an ODP title and snippet, and I&#8217;m quite sure that they will not gather PageRank nor other link juice. That means links from any pages to such URLs get an implicit rel-nofollow in Google&#8217;s PageRank calculation, just like dangling links. This apparatus could be a great way to handle PageRank leaks (monthly blog archives, printer friendly pages and stuff like that), because shit happens, hence some links to such pages will slip through without condom. I admit that&#8217;s a neat idea, but its implementation is flawed because it doesn&#8217;t consider the implicit <em>Follow:</em> (that&#8217;s syntax Google doesn&#8217;t support in robots.txt). A better way to mark site areas which shall not gather PageRank without raping the REP would be a <em>Norank:</em> directive or so. <em>Noindex:</em> without a <em>Nofollow:</em> must not block crawling. Googlebot must fetch those URLs to follow their links.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Blogger abuses rel-nofollow due to ignorance</title>
		<link>http://sebastians-pamphlets.com/blogger-abuses-rel-nofollow-due-to-ignorance/</link>
		<comments>http://sebastians-pamphlets.com/blogger-abuses-rel-nofollow-due-to-ignorance/#comments</comments>
		<pubDate>Sat, 09 Jun 2007 00:09:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Blogger]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/blogger-abuses-rel-nofollow-due-to-ignorance/</guid>
		<description><![CDATA[I had planned a full upgrade of this blog to the newest blogger version this weekend. The one and only reason to do the upgrade was the idea that I perhaps could disable the auto-nofollow functionality in the comments. Well, what I found was a way to dofollow the author&#8217;s link by editing the &#60;dl [...]]]></description>
			<content:encoded><![CDATA[<p>I had planned a full upgrade of this blog to the newest blogger version this weekend. The one and only reason to do the upgrade was the idea that I perhaps could disable the auto-nofollow functionality in the comments. Well, what I found was a way to <a href="http://betabloggerfordummies.blogspot.com/2007/03/remove-nofollow-attribute-on-comments.html">dofollow the author&#8217;s link</a> by editing the <code style="color:black;">&lt;dl id='comments-block'&gt;</code> block, but I couldn&#8217;t figure out how to disable the auto-nofollow in embedded links. </p>
<p>Considering the hassles of converting all the template hacks into the new format, and the risk of most probably losing the ability to edit code my way, I decided to stick with the old template. It just makes no sense for me to dofollow the author&#8217;s link, when a comment author&#8217;s links within the content get nofollow&#8217;ed automatically. <a href="http://andybeard.eu/2007/03/how-to-remove-nofollow-on-new-blogger-dofollow-on-blogspot.html">Andy Beard</a> and others will hate me now, so let me explain why I don&#8217;t move this blog to my own domain using a not that insane software like WordPress.
<ul>
<li>I own respectively author on various WordPress blogs. Google&#8217;s time to index for posts and updates from this blogspot thingy is 2-3 hours (Web search, <b>not</b> blog search). My Wordpress blogs, even with higher PageRank, suffer from a way longer time to index.</li>
<li>I can&#8217;t afford the time to convert and redirect 150 posts to another blog.</li>
<li>I hope that Google/Blogger can implement reasonable change requests (most probably that&#8217;s just wishful thinking).</li>
</ul>
<p>That said, WordPress is a way better software than Blogger. I&#8217;ll have to move this blog if Blogger is not able to fulfill at least my basic needs. I&#8217;ll explain below why I think that Blogger lacks <b>any</b> understanding of the rel-nofollow semantics. In fact, they throw nofollow crap on everything they get a hand on. It seems to me that they won&#8217;t stop jeopardizing the integrity of the Blogosphere (at least where they control the linkage) until they get bashed really hard by a Googler who understands what rel-nofollow is all about. I nominate Matt Cutts, who invented and evolved it, and who does not tolerate BS.</p>
<p>So here is my wishlist. I want (regardless of the template type!)
<ul>
<li>A checkbox &#8220;apply rel=nofollow to comment author links&#8221;</li>
<li>A checkbox &#8220;apply rel=nofollow to links within comment text&#8221;</li>
<li>To edit comments, for example to nofollow links myself, or to remove offensive language</li>
<li>A checkbox &#8220;apply rel=nofollow to links to label/search pages&#8221;</li>
<li>A checkbox &#8220;apply a robots meta tag &#8216;noindex,follow&#8217; to label/search pages&#8221;</li>
<li>A checkbox &#8220;apply rel=nofollow to links to archive pages&#8221;</li>
<li>A checkbox &#8220;apply a robots meta tag &#8216;noindex,follow&#8217; to archive pages&#8221;</li>
<li>A checkbox &#8220;apply rel=nofollow to backlink listings&#8221;</li>
</ul>
<p>As for the comments functionality, I&#8217;d understand when these options get disabled when comment moderation is set to off.</p>
<p>And here are the nofollow-bullshit examples.
<ul>
<li>When comment moderation and captchas are activated, why are comment author links as well as links within the comments nofollow&#8217;ed? Does blogger think their bloggers are minor retards? I mean, when I approve a comment, then I do vouch for it. But wait! I can&#8217;t edit the comment, so a low-life link might slip through. Ok, then let me edit the comments.</li>
<p>
<li>When I&#8217;ve submitted a comment, the link to the post is nofollowed. <a href="http://www.smart-it-consulting.com/img/misc/nofollow-insane-submitted-post.png"><img src="http://www.smart-it-consulting.com/img/misc/nofollow-insane-submitted-post.png" width="99%" border="0" alt="Nofollow insane II." title="Nofollow insane II." /></a>This page belongs to the blog, so why the fudge does Blogger nofollow navigational links? And if it makes sense for a weird reason not understandable by a simple webmaster like me, why is the link to the blog&#8217;s main page as well as the link to the post one line below not nofollow&#8217;ed? Linking to the same URL with <b>and</b> without rel-nofollow on the same page deserves a bullshit award.</li>
<p>
<li><a href="http://www.smart-it-consulting.com/img/misc/nofollow-insane-dashboard-blogs-of-note.png"><img src="http://www.smart-it-consulting.com/img/misc/nofollow-insane-dashboard-blogs-of-note.png"  width="101" height="194" border="0" alt="Nofollow insane III. (dashboard)" title="Nofollow insane III. (dashboard)" align="left" /></a>On my <a href="http://www.blogger.com/home">dashbord</a> Blogger features a few blogs as &#8220;Blogs Of Note&#8221;, all links nofollow&#8217;ed. These are blogs recommended by the Blogger crew. That means they have reviewed them and the links are clearly editorial content. They&#8217;re <a href="http://buzz.blogger.com/2007/04/blogs-of-note-1000-and-counting.html">proud of it</a>: &#8220;we&#8217;ve done a pretty good job of publishing a new one each day&#8221;. Blogger&#8217;s very own <a href="http://blogsofnote.blogspot.com/">Blogs Of Note</a> blog does not nofollow the links, and that&#8217;s correct. </p>
<p>So why the heck are these recommended blogs nofollow&#8217;ed on the dashboard? <a href="http://www.smart-it-consulting.com/img/misc/nofollow-insane-blogspot-blogs-of-note.png"><img src="http://www.smart-it-consulting.com/img/misc/nofollow-insane-blogspot-blogs-of-note.png" width="99%" border="0" alt="Nofollow insane III. (blogspot)" title="Nofollow insane III. (blogspot)" /></a></li>
<p>
<li>Blogger inserted robots meta tags &#8220;nofollow,noindex&#8221; on each and every blog hosted outside the controlled blogspot.com domain <a href="http://www.seroundtable.com/archives/012446.html">earlier this year</a>.</li>
<p>
<li>Blogger inserted robots meta tags &#8220;nofollow,noindex&#8221; on Google blogs <a href="http://sebastianx.blogspot.com/2007/06/google-nofollows-itself.html">a few days ago</a>.</li>
<p></ul>
<p>If Blogger&#8217;s <a href="http://www.blogger.com/about">recommendation</a> &#8220;Check google.com. (Also good for searching.)&#8221; is a honest one, why don&#8217;t they invest a few minutes to educate themselves on rel-nofollow? I mean, it&#8217;s a Google-block/avoid-indexing/ranking-thingy they use to prevent Google.com users from finding valuable contents hosted on their own domains. And they annoy me. And they insult their users. They shouldn&#8217;t do that. That&#8217;s not smart. That&#8217;s not Google-ish.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/blogger-abuses-rel-nofollow-due-to-ignorance/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google nofollow&#8217;s itself</title>
		<link>http://sebastians-pamphlets.com/google-nofollows-itself/</link>
		<comments>http://sebastians-pamphlets.com/google-nofollows-itself/#comments</comments>
		<pubDate>Mon, 04 Jun 2007 21:55:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Fun]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/google-nofollows-itself/</guid>
		<description><![CDATA[Awesome. Nofollow-insane at its best. Check the source of Google&#8217;s Webmaster Blog. In HEAD you&#8217;ll find an insane meta tag:&#60;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOINDEX,NOFOLLOW&#8221; /&#62;
Well, that&#8217;s one of many examples. Read the support forums. Another case of Google nofollow&#8217;ing herself: 
Matt thought that all teams understood the syntax and semantics of rel-nofollow. It seems to me that&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Awesome. Nofollow-insane at its best. Check the source of <a href="http://googlewebmastercentral.blogspot.com/">Google&#8217;s Webmaster Blog</a>. In HEAD you&#8217;ll find an insane meta tag:<br />&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOINDEX,NOFOLLOW&#8221; /&gt;</p>
<p>Well, that&#8217;s one of many examples. Read the support forums. Another case of Google nofollow&#8217;ing herself: <a href="http://www.smart-it-consulting.com/img/misc/google-webmaster-help-fofollows-itself.png"><img src="http://www.smart-it-consulting.com/img/misc/google-webmaster-help-fofollows-itself.png" border="0" alt="Google fun" title="Google nofollow'ing herself" width="98%"></a></p>
<p>Matt thought that all teams understood the syntax and semantics of rel-nofollow. It seems to me that&#8217;s not the case. I really can&#8217;t blame Googlers applying rel-nofollow or even nofollow/noindex meta tags to everything they get a hand on. It is not understandable. It&#8217;s not useable. It&#8217;s misleading. It&#8217;s confusing. It should get <a href="http://sebastianx.blogspot.com/2007/01/dear-search-engines-please-bury.html">buried</a> asap. </p>
<p>Hat tip to <a href="http://www.jlh-design.com/">John</a> (<a href="http://www.jlh-design.com/2007/06/34499998-pages-went-supplemental/">JLH&#8217;s post</a>).</p>
<p>Update 1: A friendly Googler just told me that a Blogger glitch (pertaining <b>only</b> Google blogs) inserted the crawler-unfriendly meta element, it should be solved soon. I thought this bug was fixed months ago  <code>... if page.isPrivate == <b>true</b> by mistake then insert &#8220;&lt;meta content=&#8217;NOINDEX,NOFOLLOW&#8217; name=&#8217;ROBOTS&#8217; /&gt;&#8221; &#8230; (made up)</code></p>
<p>Update 2: The &#8216;noindex,nofollow&#8217; robots meta tag is gone now, and the <a href="http://googlewebmastercentral.blogspot.com/">Webmaster Central Blog</a> got a neat new logo: <br /><img src="http://photos1.blogger.com/x/blogger2/6495/3914/1600/z/104716/gse_multipart40542.png" width="99%" border="0" alt="Google Webmaster Central Blog - Offic'ial news on crawling and indexing sites for the Google index" title="Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index" /> (I&#8217;d add <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769" title="Make sure that your TITLE and ALT tags are descriptive and accurate">ALT and TITLE text</a>: <code>alt="Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index" title="Official news on crawling and indexing sites for the Google index"</code>)</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/google-nofollows-itself/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Yahoo! search going to torture Webmasters</title>
		<link>http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/</link>
		<comments>http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/#comments</comments>
		<pubDate>Wed, 02 May 2007 21:56:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Copy+Paste-Penalties]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[Microformats]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/</guid>
		<description><![CDATA[According to Danny Yahoo! supports a multi-class nonsense called robots-nocontent tag. CRAP ALERT!
Can you senseless and cruel folks at Yahoo!-search imagine how many of my clients who&#8217;d like to use that feature have copied and pasted their pages? Do you&#8217;ve a clue how many sites out there don&#8217;t make use of SSI, PHP or ASP [...]]]></description>
			<content:encoded><![CDATA[<p>According to <a href="http://searchengineland.com/070502-132315.php">Danny</a> Yahoo! supports a <a href="http://www.ysearchblog.com/archives/000444.html">multi-class nonsense called <em>robots-nocontent</em> tag</a>. CRAP ALERT!</p>
<p>Can you senseless and cruel folks at Yahoo!-search imagine how many of my clients who&#8217;d like to use that feature have copied and pasted their pages? Do you&#8217;ve a clue how many sites out there don&#8217;t make use of SSI, PHP or ASP includes, and how many sites never heard of dynamic content delivery, respectively how many sites can&#8217;t use proper content delivery techniques because they&#8217;ve to deal with legacy systems and ancient business processes? Did you ask how common templated Web design is, and I mean the weird static variant, where a new page gets build from a randomly selected source page saved as new-page.html?</p>
<p>It&#8217;s great that you came out with a bastardized copy of Google&#8217;s somewhat hapless (in the sense of cluttering structured code) section targeting, because we dreadfully need that functionality across all engines. And I admit that your approach is a little better than AdSense section targeting because you don&#8217;t mark payload by paydirt in comments. But why the heck did you design it that crappy? The unthoughtful <a href="http://microformats.org/wiki/robots-exclusion">draft of a microformat</a> from what you&#8217;ve &#8220;stolen&#8221; that unfortunate idea didn&#8217;t become a standard for <a href="http://microformats.org/wiki/robots-exclusion#Issues">very good reasons</a>. Because it&#8217;s crap. Assigning multiple class names to markup elements for the sole purpose of setting crawler directives is as crappy as inline style assignments.</p>
<p>Well, due to my zero-bullshit tolerance I&#8217;m somewhat upset, so I repeat: Yahoo&#8217;s robots-nocontent class name is crap by design. Don&#8217;t use it, boycott it, because if you make use of it you&#8217;ll change gazillions of files for each and every proprietary syntax supported by a single search engine in the future. When the united search geeks can agree on flawed standards like rel-nofollow, they should be able to talk about a sensible evolvement of robots.txt. </p>
<p>There&#8217;s a way easier solution, which doesn&#8217;t require editing tons of source files, that is standardizing CSS-like syntax to assign crawler directives to existing classes and DOM-IDs. For example <a href="http://sebastianx.blogspot.com/2007/04/in-need-of-web-robot-directives.html">extent robots.txt syntax</a> like:</p>
<p><code>A.advertising { rel: nofollow; } /* devalue aff links */</p>
<p>DIV.hMenu, TD#bNav { content:noindex; rel:nofollow; } /* make site wide links unsearchable */</code></p>
<p><b>Unsupported robots.txt syntax doesn&#8217;t harm, proprietary attempts do harm!</b></p>
<p>Dear search engines, get together and define something useful, before each of you comes out with different half-baked workarounds like section targeting or robots-nocontent class values. Thanks!</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How Google &#38; Yahoo handle the link condom</title>
		<link>http://sebastians-pamphlets.com/how-google-yahoo-handle-the-link-condom/</link>
		<comments>http://sebastians-pamphlets.com/how-google-yahoo-handle-the-link-condom/#comments</comments>
		<pubDate>Fri, 27 Apr 2007 19:51:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-google-yahoo-handle-the-link-condom/</guid>
		<description><![CDATA[Loren Baker over at SEJ got a few official statements on use and abuse of the rel-nofollow microformat by the major players: How Google, Yahoo &#038; Ask treat NoFollow&#8217;ed links. Great job, thanks!
Ask doesn&#8217;t &#8220;officially&#8221; support nofollow, whatever that means. Loren didn&#8217;t ask MSN, probably because he didn&#8217;t expect that they&#8217;ve even noticed that they [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.bumpzee.com/users/view/lorenbaker/">Loren Baker</a> over at <a href="http://www.searchenginejournal.com/">SEJ</a> got a few official statements on <a href="http://sebastianx.blogspot.com/2007/01/dear-search-engines-please-bury.html">use and abuse</a> of the <a href="http://microformats.org/wiki/rel-nofollow">rel-nofollow microformat</a> by the major players: <a href="http://www.searchenginejournal.com/how-google-yahoo-askcom-treat-the-no-follow-link-attribute/4801/">How Google, Yahoo &#038; Ask treat NoFollow&#8217;ed links</a>. Great job, thanks!</p>
<p>Ask doesn&#8217;t &#8220;officially&#8221; support nofollow, whatever that means. Loren didn&#8217;t ask MSN, probably because he didn&#8217;t expect that they&#8217;ve even noticed that they <a href="http://blogs.msdn.com/livesearch/archive/2005/01/18/nofollow_tags.aspx">officially support nofollow since 2005</a>, <a href="http://blogs.msdn.com/livesearch/archive/2007/04/11/discovering-sitemaps.aspx">same procedure</a> with <a href="http://blogs.msdn.com/livesearch/archive/2006/11/15/microsoft-google-yahoo-unite-to-support-sitemaps.aspx">sitemaps</a> by the way. Yahoo implemented it along the specs, and <a href="http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html">Google</a> stepped way over the line the norm sets. So here is the difference:</p>
<p>1. Do you follow a nofollow&#8217;ed link?<br />Google: No (longer)<br />Yahoo: Yes</p>
<p>2. Do you index the linked page following a nofollow&#8217;ed link?<br />Google: Obsolete, see 1.<br />Yahoo: Yes</p>
<p>3. Does your ranking algos factor in reputation, anchor/alt/title text or whichever link love sourced from a nofollow&#8217;ed link?<br />Google: Obsolete, see 1.<br />Yahoo: No</p>
<p>4. Do you show nofollow&#8217;ed links in reverse citation results?<br />Google: Yes (in link: searches by accident, in Webmaster Central if the source page didn&#8217;t make it into the supplemental index)<br />Yahoo: Yes (Site Explorer)</p>
<p>Q&#038;A#4 is made up but accurate. I think it&#8217;s safe to assume that MSN handles the link condom like Yahoo. (Update: As Loren clarifies in the <a href="http://www.searchenginejournal.com/how-google-yahoo-askcom-treat-the-no-follow-link-attribute/4801/#comment-454472">comments</a>,  he asked MSN search but they didn&#8217;t answer in a timely fashion.)</p>
<p>And here&#8217;s a remarkable statement from Google&#8217;s search evangelist Adam Lasnik, <a href="http://sebastianx.blogspot.com/2007/03/does-adam-lasnik-like-relnofollow-or.html">who may like nofollow or not</a>:<br />
<blockquote>On a related note, though, and echoing Matt’s earlier sentiments &#8230; we hope and expect that more and more sites — <b>including Wikipedia</b> — will adopt a less-absolute approach to no-follow &#8230; expiring no-follows, not applying no-follows to trusted contributors, and so on.</p></blockquote>
<p>Bravo!</p>
<p>Related link: <a href="http://www.seo-blog.com/rel-nofollow.php">rel=&#8221;nofollow&#8221; Google, Yahoo and MSN</a></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-google-yahoo-handle-the-link-condom/feed/</wfw:commentRss>
		</item>
		<item>
		<title>In need of a &#34;Web-Robot Directives Standard&#34;</title>
		<link>http://sebastians-pamphlets.com/in-need-of-a-web-robot-directives-standard/</link>
		<comments>http://sebastians-pamphlets.com/in-need-of-a-web-robot-directives-standard/#comments</comments>
		<pubDate>Thu, 12 Apr 2007 11:56:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[XML-Sitemaps]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Microformats]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/in-need-of-a-web-robot-directives-standard/</guid>
		<description><![CDATA[The Robots Exclusion Protocol from 1994 gets used and abused, best described by Lisa Barone citing Dan Crow from Google: &#8220;everyone uses it but everyone uses a different version of it&#8221;. De facto we&#8217;ve a Robots Exclusion Standard covering crawler directives in robots.txt and robots meta tags as well, said Dan Crow. Besides non-standardized directives [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.robotstxt.org/wc/exclusion.html">Robots Exclusion Protocol</a> from 1994 gets used and abused, best described by <a href="http://www.bruceclay.com/blog/archives/2007/04/robotstxt_summi.html">Lisa Barone</a> citing Dan Crow from Google: &#8220;everyone uses it but everyone uses a different version of it&#8221;. De facto we&#8217;ve a Robots Exclusion Standard covering crawler directives in <a href="http://www.smart-it-consulting.com/article.htm?node=140&#038;page=46">robots.txt</a> and <a href="http://www.smart-it-consulting.com/article.htm?node=140&#038;page=47">robots meta tags</a> as well, said Dan Crow. Besides non-standardized directives like &#8220;Allow:&#8221;, <a href="http://www.smart-it-consulting.com/article.htm?node=133&#038;page=38">Google&#8217;s Sitemaps Protocol</a> adds more inclusion to the mix, now even closely <a href="http://sebastianx.blogspot.com/2007/04/xml-sitemap-auto-discovery.html">bundled with robots.txt</a>. There are more ways to put crawler directives. Unstructured (in the sense of independence from markup elements) like with <a href="https://www.google.com/adsense/support/bin/answer.py?answer=23168">Google&#8217;s section targeting</a>, on link level applying the <a href="http://sebastianx.blogspot.com/2007/01/dear-search-engines-please-bury.html">commonly disliked</a> <a href="http://microformats.org/wiki/rel-nofollow">rel-nofollow microformat</a> or <a href="http://gmpg.org/xfn/">XFN</a>, and related <a href="http://microformats.org/wiki/robots-exclusion">thoughts on block level directives</a>. </p>
<p>All in all that&#8217;s a pretty confusing conglomerate of inclusion and exclusion, utilizing many formats respectively markup elements, and lots of places to put crawler directives. Not really the sort of norm the webmaster community can successfully work with. No wonder that over 75,000 robots.txt files have pictures in them, that less than 35 percent of servers have a robots.txt file, that the average robots.txt file is 23 characters (&#8221;User-agent: * Disallow:&#8221;), gazillions of Web pages carry useless and unsupported meta tags like &#8220;<a href="http://code.google.com/webstats/2005-12/metadata.html">revisit-after</a>&#8221; &#8230; for more funny stats and valuable information see Lisa&#8217;s <a href="http://www.bruceclay.com/blog/archives/2007/04/robotstxt_summi.html">robots.txt summit coverage</a> (SES NY 2007), also covered by <a href="http://www.seroundtable.com/archives/013068.html">Tamar</a> (read both!).</p>
<h3>How to structure a &#8220;Web-Robot Directives Standard&#8221;?</h3>
<p>To handle redundancies as well as cascading directives properly, we need a clear and understandable chain of command. The following is just a first idea off the top of my head, and likely gets updated soon:</p>
<ul style="margin-left:0px; padding-left:0px;">
<li>
<h4>Robots.txt</h4>
</li>
<li>
<ol>
<li>Disallows directories, files/file types, and URI fragments like query string variables/values by user agent.</li>
<li>Allows sub-directories, file names and URI fragments to refine Disallow statements.</li>
<li>Gives general directives like crawl-frequency or volume per day and maybe even folders, and restricts crawling in particular time frames.</li>
<li>References general XML sitemaps accessible to all user agents, and specific XML sitemaps addressing particular user agents as well.</li>
<li>Sets site-level directives like &#8220;noodp&#8221; or &#8220;noydir&#8221;.</li>
<li>Predefines page-level instructions like &#8220;nofollow&#8221;, &#8220;nosnippet&#8221; or &#8220;noarchive&#8221; by directory, document type or URL fragments.</li>
<li>Predefines block-level respectively element-level conditions like &#8220;noindex&#8221; or &#8220;nofollow&#8221; on class names or DOM-IDs by markup element. For example &#8220;DIV.hMenu,TD#bNav &#8216;noindex,nofollow&#8217;&#8221; could instruct crawlers to ignore the horizontal menu as well as navigation at the very bottom on all pages.</li>
<li>Predefines attribute-level conditions like &#8220;nofollow&#8221; on A elements. For example &#8220;A.advertising REL &#8216;nofollow&#8217;&#8221; could tell crawlers to ignore links in ads, or &#8220;P#tos &gt; A &#8216;nofollow&#8217;&#8221; could instruct spiders to ignore links in TOS excerpts found on every page in a P element with the DOM-ID &#8220;tos&#8221;.</li>
</ol>
<ul>
<li>
<h4>XML Sitemaps</h4>
</li>
<li>
<ol>
<li>Since robots.txt deals with inclusion now, why not add an optional URL specific &#8220;action&#8221; element allowing directives like &#8220;nocache&#8221; or &#8220;nofollow&#8221;? Also a &#8220;delete&#8221; directive to get outdated pages removed from search indexes would make sound sense.</li>
<li>To make XML sitemap data reusable, and to allow centralized maintenance of page meta data, a couple of new optional URL elements like &#8220;title&#8221;, &#8220;description&#8221;, &#8220;document type&#8221;, &#8220;language&#8221;, &#8220;charset&#8221;, &#8220;parent&#8221; and so on would be a neat addition. This way it would be possible to visualize XML sitemaps as native (and even hierarchical) site maps.</li>
</ol>
<p>Robots.txt exclusions overrule URLs listed for inclusion in XML sitemaps.
<ul>
<li>
<h4>Meta Tags</h4>
</li>
<li>Page meta data overrule directives and information provided in robots.txt and XML sitemaps. Empty contents in meta tags suppress directives and values given in upper levels. Non-existent meta tags implicitly apply data and instructions from upper levels. The same goes for everything below.
<ul>
<li>
<h4>Body Sections</h4>
</li>
<li>Unstructured parenthesizing of parts of code certainly is undoable with XMLish documents, but may be a pragmatic procedure to deal with legacy code. Paydirt in HTML comments may be allowed to <a href="https://www.google.com/adsense/support/bin/answer.py?answer=23168">mark payload for contextual advertising purposes</a>, but it&#8217;s hard to standardize. Lets leave that for proprietary usage.
<ul>
<li>
<h4>Body Elements</h4>
</li>
<li>Implementing a new attribute for messages to machines should be avoided for several good reasons. Classes are additive, so multiple values can be specified for most elements. That would allow to put standarized directives as class names, for example class=&#8221;menu robots-noindex googlebot-nofollow slurp-index-follow&#8221; where the first class addresses CSS. Such inline robot directives come with the same disadvantages as inline style assignments and open a <a href="http://microformats.org/wiki/robots-exclusion#Issues">can of worms</a> so to say. Using classes and DOM-IDs just as a reference to user agent specific instructions given in robots.txt is surely the preferable procedure.
<ul>
<li>
<h4>Element Attributes</h4>
</li>
<li>More or less this level is a playground for <a href="http://microformats.org/">microformats</a> utilizing the A element&#8217;s REV and REL attributes.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>
<p>Besides the common values &#8220;nofollow&#8221;, &#8220;noindex&#8221;, &#8220;noarchive&#8221;/&#8221;nocache&#8221; etc. and their omissible positive defaults &#8220;follow&#8221; and &#8220;index&#8221; etc., we&#8217;d need a couple more, for example &#8220;unapproved&#8221;, &#8220;untrusted&#8221;, &#8220;ignore&#8221; or &#8220;skip&#8221; and so on. There&#8217;s a lot of work to do.</p>
<p>In terms of of complexity, a mechanism as outlined above should be as easy to use as CSS in combination with client sided scripting for visualization purposes. </p>
<p>However, whatever better ideas are out there, we need a widely accepted &#8220;Web-Robot Directives Standard&#8221; as soon as possible.</p>
<p><span style="font-family:arial;font-size:78%;">Tags: <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/Search+Engine+Optimization" rel="tag">Search Engine Optimization</a> (<a style="TEXT-DECORATION: none" href="http://technorati.com/tag/SEO" rel="tag">SEO</a>) <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/robots.txt" rel="tag">Robots.txt</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/xml+sitemaps" rel="tag">XML Sitemaps</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/meta-tags" rel="tag">Meta Tags</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/crawler+directives" rel="tag">Crawler Directives</a> </span></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/in-need-of-a-web-robot-directives-standard/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google going to revamp the rel=nofollow microformat?</title>
		<link>http://sebastians-pamphlets.com/google-going-to-revamp-the-relnofollow-microformat/</link>
		<comments>http://sebastians-pamphlets.com/google-going-to-revamp-the-relnofollow-microformat/#comments</comments>
		<pubDate>Sat, 17 Feb 2007 14:01:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Google]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/google-going-to-revamp-the-relnofollow-microformat/</guid>
		<description><![CDATA[I&#8217;ve asked Adam Lasnik, Google&#8217;s search evangelist:
Adam, what is Google&#8217;s take on extending the nofollow functionality by working out a microformat that covers the existing mechanism w/o being that unclear and confusing, and which takes care of similar needs like section targeting on element level and qualified votes as well?
and he answered
Sebastian, nothing&#8217;s set in [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve <a href="http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/7f1d9eba4852e319/e250d5bdb52a22b7" style="text-decoration:none;">asked</a> Adam Lasnik, Google&#8217;s search evangelist:</p>
<blockquote><p>Adam, what is Google&#8217;s take on extending the nofollow functionality by working out a microformat that covers the existing mechanism w/o being that unclear and confusing, and which takes care of similar needs like section targeting on element level and qualified votes as well?</p></blockquote>
<p style="display:inline;">and he answered</p>
<blockquote><p><strong>Sebastian, nothing&#8217;s set in stone. Stuff is likely to evolve <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </strong></p></blockquote>
<p>That&#8217;s an elating signal, thank you Adam. And it leads to a bunch of questions. </p>
<p>Will Google continue to cook nofollow in its secret sauce, revealing morphed semantics (affiliate links), unpopular areas of application (paid links) and changed functionality (no longer fetching the linked resource) every now and then? From my interpretation of Google&#8217;s ongoing move to candidness I guess not. </p>
<p>Will Google gather a couple search companies to work out a new standard? I hope not, it would be a mistake not to involve content providers, webmasters, publishers, CMS vendors, even SEOs and opinion makers again.</p>
<p>Will Google ask for input? Will the process of defining a <b>standard for micro crawler directives</b> be an open and public discussion? Are we talking about an extended microformat, limited to the A element&#8217;s rel and rev attributes, or does Google think of a broader approach covering for example section targeting and other crawler directives in class attributes on block level too? Will a new or more powerful <a href="http://microformats.org/wiki/rel-nofollow" rel="tag">standard</a> interfere other norms like <a href="http://microformats.org/wiki/rel-nofollow" rel="tag">tag</a>, <a href="http://microformats.org/wiki/vote-links" rel="tag">vote links</a>, <a href="http://gmpg.org/xfn/" rel="tag">XFN</a>, or drafts like the not yet that comprehensive <a href="http://microformats.org/wiki/robots-exclusion" rel="tag">robots-exclusion</a> microformat (also badly named because it covers inclusion too)? By the way, the links above lead you to interesting thoughts on reach, functionality and implementation of an extended norm replacing nofollow, and I, like many of you, have a couple more ideas and concepts in mind.</p>
<p>I take Adam&#8217;s tidbit as call for participation. Dear no-to-nofollow-sayers and nofollow-supporters out there, join the crowd at the white board! Throw in your thoughts, concepts, wishes and ideas.</p>
<p>In the meantime make use of this <a href="http://andybeard.eu/2007/02/ultimate-list-of-dofollow-plugins-banish-nofollow-from-comments-and-trackbacks.html">catalogue of do-follow plugins</a>.</p>
<p><span style="font-family:arial;font-size:78%;">Tags: <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/Search+Engine+Optimization" rel="tag">Search Engine Optimization</a> (<a style="TEXT-DECORATION: none" href="http://technorati.com/tag/SEO" rel="tag">SEO</a>) <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/link+condom" rel="tag">Link-Condom</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/rel=nofollow" rel="tag">rel=nofollow</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/google" rel="tag">Google</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/yahoo" rel="tag">Yahoo</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/msn" rel="tag">MSN</a></span></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/google-going-to-revamp-the-relnofollow-microformat/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Dear search engines, please bury the rel=nofollow-fiasko</title>
		<link>http://sebastians-pamphlets.com/dear-search-engines-please-bury-the-relnofollow-fiasko/</link>
		<comments>http://sebastians-pamphlets.com/dear-search-engines-please-bury-the-relnofollow-fiasko/#comments</comments>
		<pubDate>Wed, 31 Jan 2007 14:36:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/dear-search-engines-please-bury-the-relnofollow-fiasko/</guid>
		<description><![CDATA[The misuse of the rel=nofollow initiative is getting out of control. Invented to fight comment spam, nowadays it is applied to commercial links, biased editorial links, navigational links, links to worst enemies (funny example: Matt Cutts links to a SEO-Blackhat with rel=nofollow) and whatever else. Gazillions of publishers and site owners add it to their [...]]]></description>
			<content:encoded><![CDATA[<p>The misuse of the rel=nofollow initiative is getting out of control. Invented to fight comment spam, nowadays it is applied to commercial links, biased editorial links, navigational links, links to worst enemies (funny example: <a href="http://www.mattcutts.com/blog/watching-a-story-part-i/">Matt Cutts</a> links to a <a href="http://seoblackhat.com/2006/12/08/google-funding-terrorists/">SEO-Blackhat</a> with rel=nofollow) and whatever else. Gazillions of publishers and site owners add it to their links for the wrong reasons, simply because they don&#8217;t understand its intention, its mechanism, and especially not the ongoing morphing of its semantics. Even professional webmasters and search engine experts have a hard time to follow the nofollow-beast semantically. As more its initial usage gets diluted, as more folks suspect search engines cook their secret sauce with indigestibly nofollow-ingredients.</p>
<p><strong>Not only rel=nofollow wasn&#8217;t able to stop blog-spam-bots, it came with a build-in flaw: confusion.</strong> </p>
<p>Good news is that currently the nofollow-debate gets stoked again. Threadwatch hosts a thread titled <a href="http://www.threadwatch.org/node/11538">Nofollow&#8217;s Historical Changes and Associated Hypocrisy</a>, folks are ranting on the questionable <a href="http://blog.outer-court.com/archive/2007-01-22-n21.html">Wikipedia decision to nofollow all outbound links</a>, <a href="http://web-professor.net/wp/2007/01/26/google-video-selling-page-rank/">Google video folks manipulated the PageRank algo</a> by plastering most of their links with <a href="http://web-professor.net/wp/2007/01/26/google-video-selling-page-rank/#comment-12560">rel=nofollow by mistake</a>, and even Yahoo&#8217;s top gun <a href="http://jeremy.zawodny.com/blog/archives/006800.html">Jeremy Zawodny is not that happy with the nofollow-debacle</a> for a while now.</p>
<p><img src="http://www.smart-it-consulting.com/img/misc/say-no-to-nofollow.jpg" border="0" width="70" height="70" alt="Say NO to NOFOLLOW - copyright jlh-design.com" title="Enhance / replace NOFOLLOW now!" align="left" style="margin-top: 6px; margin-right:3px; margin-bottom:3px;" />I say that it is possible to replace the unsuccessful nofollow-mechanism with an understandable and reasonable functionality to allow search engine crawler directives on link level. It can be done although there are shitloads of rel=nofollow links out there. Here is why, and how:</p>
<p>The <b>value</b> &#8220;nofollow&#8221; in the link&#8217;s REL attribute creates misunderstandings, recently even in the inventor&#8217;s company, because it is, hmmm, hapless. </p>
<p>In fact, <a href="http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html">back then</a> it meant &#8220;<a href="http://www.smart-it-consulting.com/article.htm?node=155&#038;page=100#the-links-load">passnoreputation</a>&#8221; and nothing more. That is search engines shall follow those links, and they shall index the destination page, and they shall show those links in reversed citation results. They just must not pass any reputation or topical relevancy with that link.</p>
<p>There were <a href="http://www.smart-it-consulting.com/article.htm?node=155&#038;page=90#a-rel">micro formats</a> better suitable to achieve the goal, for example Technorati&#8217;s <a href="http://microformats.org/wiki/votelinks">votelinks</a>, but unfortunately the united search geeks have chosen a value adapted from the robots exclusion standard, which is plain misleading because it has absolutely nothing to do with its (intended) core functionality.</p>
<p>I can think of cases where a <b>real</b> nofollow-directive for spiders on link level makes perfect sense. It could tell the spider not to fetch a particular link destination, even if the page&#8217;s robots tag says &#8220;follow&#8221;, for example printer friendly pages. I&#8217;d use an &#8220;ignore this link&#8221; directive for example in crawlable horizontal popup menus to avoid theme dilution when every page of a section (or site) links to every other page. Actually, there is more need for spider directives on HTML element level, not only in links, for example to tag templated and/or navigational page areas like with <a href="https://www.google.com/adsense/support/bin/answer.py?answer=23168">Google&#8217;s section targeting</a>. </p>
<p>There is nothing wrong with a mechanism to neutralize links in user input. Just the value &#8220;nofollow&#8221; in the type-of-forward-relationship attribute is not suitable to label unchecked or not (yet) trusted links. If it is really necessary to adopt a well known value from the robots exclusion standard (and don&#8217;t misunderstand me, reusing familiar terms in the right context is a good idea in general), the &#8220;noindex&#8221; value would have been be a better choice (although not perfect). &#8220;Noindex&#8221; describes way better what happens in a SE ranking algo: it doesn&#8217;t index (in its technical meaning) a vote for the target. Period. </p>
<p>It is not too late to replace the rel=nofollow-fiasco with a better solution which could take care of some similar use cases too. Folks at Technorati, the W3C and whereever have done the initial work already, so it&#8217;s just a tiny task left: extending an existing norm to enable a reasonable granularity of crawler directives on link level, or better for  HTML elements at all. Rel=nofollow would get deprecated, replaced by suitable and standardized values, and for a couple years the engines could interpret rel=nofollow in its primordial meaning.</p>
<p>Since the rel=nofollow thingy exists, it has confused gazillions of non-geeky site owners, publishers and editors on the net. Last year I&#8217;ve got a new client who added rel=nofollow to all his internal links because he saw nofollowed links on a popular and well ranked site in his industry and thought rel=nofollow could perhaps improve his own rankings. That&#8217;s just one example of many where I&#8217;ve seen intended as well as mistakenly misuse of the way too geeky nofollow-value. As Jill Whalen <a href="http://www.threadwatch.org/node/11538#comment-51111">points out</a> to Matt Cutts, that&#8217;s just the beginning of net-wide nofollow-insane.</p>
<p>Ok, we&#8217;ve learned that the &#8220;nofollow&#8221; value is a notional monster, so can we please have it removed from the search engine algos in favour of a well thought out solution, preferably asap? Thanks.</p>
<p><span style="font-family:arial;font-size:78%;">Tags: <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/Search+Engine+Optimization" rel="tag">Search Engine Optimization</a> (<a style="TEXT-DECORATION: none" href="http://technorati.com/tag/SEO" rel="tag">SEO</a>) <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/link+condom" rel="tag">Link-Condom</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/rel=nofollow" rel="tag">rel=nofollow</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/google" rel="tag">Google</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/yahoo" rel="tag">Yahoo</a> <a style="TEXT-DECORATION: none" href="http://technorati.com/tag/msn" rel="tag">MSN</a></span></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/dear-search-engines-please-bury-the-relnofollow-fiasko/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
