<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; Cloaking</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Mon, 30 Jun 2008 20:12:40 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Nofollow still means don&#8217;t follow, and how to instruct Google to crawl nofollow&#8217;ed links nevertheless</title>
		<link>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/</link>
		<comments>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/#comments</comments>
		<pubDate>Sat, 23 Feb 2008 14:51:14 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Testing]]></category>

		<category><![CDATA[Anchor Text]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/</guid>
		<description><![CDATA[What was meant as a quick test of rel-nofollow once again (inspired by Michelle&#8217;s post stating that nofollow&#8217;ed comment author links result in rankings), turned out to some interesting observations:

Google uses sneaky JavaScript links (that mask nofollow&#8217;ed static links) for discovery crawling, and indexes the link destinations despite there&#8217;s no hard coded link on any [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/painting-nofollow-dofollow.png" width="250" height="220" align="right" alt="painting a nofollow'ed link dofollow" style="margin-left:4px;" title="How to paint a nofollow'ed link dofollow" />What was meant as a quick test of <a href="http://sebastians-pamphlets.com/links/categories/&amp;cat=nofollow">rel-nofollow</a> once again (inspired by <a href="http://www.michellemacphearson.com/do-nofollow-links-count-redux/">Michelle&#8217;s post</a> stating that nofollow&#8217;ed comment author links result in rankings), turned out to some interesting observations:</p>
<ul>
<li>Google uses sneaky JavaScript links (that mask nofollow&#8217;ed static links) for discovery crawling, and indexes the link destinations despite there&#8217;s no hard coded link on any page on the whole Web.</li>
<li>Google doesn&#8217;t crawl URIs found in nofollow&#8217;ed links only.</li>
<li>Google most probably doesn&#8217;t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.</li>
<li>Google most probably doesn&#8217;t pass anchor text of JavaScript links to the link destination.</li>
<li>Google doesn&#8217;t pass anchor text of (hard coded) nofollow&#8217;ed links to the link destination.</li>
</ul>
<p>As for my inspiration, I guess not all links in Michelle&#8217;s test were truly nofollow&#8217;ed. However, she&#8217;s spot on stating that condomized author links aren&#8217;t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don&#8217;t pay too much attention on REL attributes when you spread your links.</p>
<p>As for my quick test explained below, please consider it an inspiration too. It&#8217;s not a full blown SEO test, because I&#8217;ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn&#8217;t influenced by external noise, for example scraped links and such stuff.</p>
<p>On 2008-02-22 06:20:00 I&#8217;ve put a new nofollow&#8217;ed link onto my sidebar: <a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-2-a" rel="nofollow"><span id="repstuff-something-2-b">Zilchish Crap</span></a> <script type="text/javascript"> handle=document.getElementById("repstuff-something-2-b"); handle.firstChild.data="Nillified, Nil"; handle=document.getElementById("repstuff-something-2-a"); handle.href="http://sebastians-pamphlets.com/repstuff/something.php?nil=js1"; handle.rel="dofollow"; </script><code><small><br />
&lt;a href=&quot;http://sebastians-pamphlets.com/repstuff/something.php&quot; id=&quot;repstuff-something-a&quot; rel=&quot;nofollow&quot;&gt;&lt;span id=&quot;repstuff-something-b&quot;&gt;Zilchish Crap&lt;/span&gt;&lt;/a&gt;<br />
&lt;script type=&quot;text/javascript&quot;&gt;<br />
handle=document.getElementById(&lsquo;repstuff-something-b&rsquo;);<br />
handle.firstChild.data=&lsquo;Nillified, Nil&rsquo;;<br />
handle=document.getElementById(&lsquo;repstuff-something-a&rsquo;);<br />
handle.href=&lsquo;http://sebastians-pamphlets.com/repstuff/something.php?nil=js1&rsquo;;<br />
handle.rel=&lsquo;dofollow&rsquo;;<br />
&lt;/script&gt; </small></code><br />
(The JavaScript code changes the link&#8217;s HREF, REL and anchor text.)</p>
<p>The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow&#8217;ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.</p>
<h3>Google crawls URIs extracted from somewhat sneaky JavaScript code</h3>
<p>20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element&#8217;s HREF: <code><br />
66.249.72.5 	2008-02-22 06:47:07 	200-OK 	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 	/repstuff/something.php?nil=js1</code></p>
<p>Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page: <code><small><br />
handle=document.getElementById(&lsquo;a1&rsquo;);<br />
handle.href=&lsquo;http://sebastians-pamphlets.com/repstuff/something.php?nil=js2&rsquo;;<br />
handle.rel=&lsquo;dofollow&rsquo;; </small></code><br />
From the log: <code><br />
66.249.72.5 	2008-02-22 09:37:11 	200-OK 	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 	/repstuff/something.php?nil=js2</code></p>
<p>So far Google ignored the hidden JavaScript link to <code>/repstuff/something.php?nil=js3</code> on the test page. Its code doesn&#8217;t change a static link, so that makes sense in the context of repeated statements like &#8220;Google ignores JavaScript links / treats them like nofollow&#8217;ed links&#8221; by Google reps.</p>
<p class="excursus">Of course the JS code above is easy to analyze, but don&#8217;t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!</p>
<h3>Google indexes pages that have only JavaScript links pointing to them</h3>
<p>The next day I&#8217;ve checked the search index, and the <a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=zilchish%7Cnillyfiable+site%3Asebastians-pamphlets.com">results</a> are interesting:</p>
<p><img src="http://sebastians-pamphlets.com/img/google/nofollow-zilchish-nullifable-google-serp-24h.png" width="498" height="421" alt="rel-nofollow-test search results" title="Google indexes JS manipulated anchor text and content referenced only in JS links" /></p>
<p>The first search result is the content of the URI with the query string parameter <code>?nil=js1</code>, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI <code>/repstuff/something.php</code> without query string. There&#8217;s not a single real link to this URI elsewhere.</p>
<p>The second search result is a post URI where Google recognized the hard coded anchor text &#8220;zilchish crap&#8221;, but not the JS code that overwrites it with &#8220;Nillified, Nil&#8221;. With the SERP-URI parameter &#8220;&amp;filter=0&#8243; Google shows more posts that are findable with the search term [zilchish]. (Hey <a href="http://mattcutts.com/blog/">Matt</a> and <a href="http://brianwhite.org/">Brian</a>, here&#8217;s room for improvement!)</p>
<h3>Google doesn&#8217;t pass anchor text of nofollow&#8217;ed links to the link destination</h3>
<p>A search for [<a href="http://www.google.com/search?q=zilchish+site:sebastians-pamphlets.com&#038;num=100&#038;hl=en&#038;filter=0&#038;safe=off">zilchish site:sebastians-pamphlets.com</a>] doesn&#8217;t show the testpage that doesn&#8217;t carry this term. In other words, so far the anchor text &#8220;zilchish crap&#8221; of the nofollow&#8217;ed sidebar link didn&#8217;t impact the test page&#8217;s rankings yet. </p>
<h3>Google doesn&#8217;t treat anchor text of JavaScript links as textual content</h3>
<p>A search for [<a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=nillified+site%3Asebastians-pamphlets.com">nillified site:sebastians-pamphlets.com</a>] doesn&#8217;t show any URIs that have &#8220;nil, nillified&#8221; as client sided anchor text on the sidebar, just the test page:</p>
<p><img src="http://sebastians-pamphlets.com/img/google/nofollow-nillified-google-serp-24h.png" width="498" height="277" alt="rel-nofollow-test search results" title="Google indexes content from JS manipulated URIs" /></p>
<h3>Results, conclusions, speculation</h3>
<p>This test wasn&#8217;t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately &#8220;nil&#8221; and &#8220;nillified&#8221; appear both in the JS anchor text as well as on the page, so that&#8217;s for another post. However, it seems the JS anchor text isn&#8217;t indexed for the pages carrying the JS code, at least they don&#8217;t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination&#8217;s relevancy for &#8220;nil&#8221; or &#8220;nillified&#8221; as well. </p>
<p>Maybe Google&#8217;s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows. </p>
<p>However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I&#8217;m quite confident that condomized links still don&#8217;t pass reputation, nor topically relevance. My test page is unfindable for the nofollow&#8217;ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.</p>
<h3>How to safely strip a <a href="http://link-condom.com/">link condom</a></h3>
<p><b>And what&#8217;s the actual &#8220;news&#8221;?</b> Well, say you&#8217;ve links that you must condomize because they&#8217;re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow&#8217;ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don&#8217;t count, so that&#8217;s perfectly in line with Google&#8217;s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don&#8217;t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn&#8217;t that neat?</p>
<h3>Link condoms <strike>with juicy taste</strike> faking good karma</h3>
<p>Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they <i style="border: medium dotted firebrick; color:navy; background:pink;">look plain ugly</i> otherwise.</p>
<p>Here is how you can do this for a complete Web page. <a href="http://example.com/" rel="nofollow example" title="Nofollow'ed and unclickable link example, use 'view source' to check it out" onclick="return false;">This link is nofollow&#8217;ed</a>. The JavaScript code below changed its REL value to &#8220;dofollow&#8221;. When you put this code <em>at the bottom of your pages</em>, it will un-condomize all your nofollow&#8217;ed links. <code><br />
&lt;script type=&quot;text/javascript&quot;&gt;<br />
    if (document.getElementsByTagName) {<br />
        var aElements = document.getElementsByTagName(&quot;a&quot;);<br />
        for (var i=0; i&lt;aElements.length; i++) {<br />
            var relvalue = aElements[i].rel.toUpperCase();<br />
            if (relvalue.match(&quot;NOFOLLOW&quot;) != &quot;null&quot;) {<br />
                aElements[i].rel = &quot;dofollow&quot;;<br />
            }<br />
        }<br />
    }<br />
&lt;/script&gt;   </code></p>
<p><script type="text/javascript">
    if (document.getElementsByTagName) {
        var aelements = document.getElementsByTagName("a");
        for (var i=0; i<aelements.length; i++) {
            var relvalue = aelements[i].rel.toUpperCase();
            if (relvalue.match("NOFOLLOW") != "null") {
                aelements[i].rel = "dofollow";
            }
        }
    }
</script></p>
<p>(You&#8217;ll find still condomized links on this page. That&#8217;s because the JavaScript routine above changes only links placed above it.)</p>
<p>When you add JavaScript routines like that to your pages, you&#8217;ll increase their page loading time. IOW you slow them down. Also, you should add a note to your <a href="http://sebastians-pamphlets.com/links/full-disclosure/">linking policy</a> to avoid confused advertisers who chase toolbar PageRank.</p>
<p><b>Updates:</b> Obviously Google distrusts me, how come? Four days after the link discovery the <abbr title="Googlebot coming from another IP">search quality archangel</abbr> requested the nofollow&#8217;ed URI &#8211;without query string&#8211; possibly to check whether I serve different stuff to bots and people. As if I&#8217;d cloak, laughable. (Or an assclown linked the URI without condom.)<br />
Day five: Google&#8217;s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren&#8217;t interested in client-sided links at all?</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Update your crawler detection: MSN/Live Search announces msnbot/1.1</title>
		<link>http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/</link>
		<comments>http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/#comments</comments>
		<pubDate>Tue, 12 Feb 2008 18:41:28 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Analytics]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/</guid>
		<description><![CDATA[Fabrice Canel from Live Search announces significant improvements of their crawler today. The very much appreciated changes are:

HTTP compression

The revised msnbot supports gzip and deflate as defined by RFC 2616 (sections 14.11 and 14.39). Microsoft also provides a tool to check your server&#8217;s compression / conditional GET support. (Bear in mind that most dynamic pages [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/msnbot-1-1.png" width="250" height="180" align="right" alt="msnbot/1.1" style="margin-left:4px;" title="MSNBOT/1.1" />Fabrice Canel from <a href="http://blogs.msdn.com/webmaster/archive/2008/02/12/announcing-crawling-improvements-for-live-search.aspx">Live Search announces significant improvements of their crawler</a> today. The very much appreciated changes are:</p>
<dl>
<dt>HTTP compression</dt>
<dd>
<p>The revised msnbot supports <b>gzip</b> and <b>deflate</b> as defined by <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html">RFC 2616</a> (sections 14.11 and 14.39). Microsoft also provides a <a href="http://go.microsoft.com/?linkid=8272590">tool to check your server&#8217;s compression / conditional GET support</a>. (Bear in mind that most dynamic pages (blogs, forums, &#8230;) will fool such <a href="http://www.microsoft.com/search/Tools/">tools</a>, try it with a static page or use your robots.txt.)</p>
</dd>
<dt>No more crawling of unchanged contents</dt>
<dd>
<p>The new msnbot/1.1 will not fetch pages that didn&#8217;t change since the last request, as long as the Web server supports the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25">&#8220;If-Modified-Since&#8221; header</a> in conditional GET requests. If a page didn&#8217;t change since the last crawl, the server responds with 304 and the crawler moves on. In this case your Web server exchanges only a handful of short lines of text with the crawler, not the contents of the requested resource.</p>
<p>If your server isn&#8217;t configured for HTTP compression and conditional GETs, you really should request that at your hosting service for the sake of your bandwidth bills.</p>
</dd>
<dt>New user agent name</dt>
<dd>
<p>From reading server log files we know the Live Search bot as &#8220;msnbot/1.0 (+http://search.msn.com/msnbot.htm)&#8221;, or &#8220;msnbot-media/1.0&#8243;, &#8220;msnbot-products/1.0&#8243;, and &#8220;msnbot-news/1.0&#8243;. From now on you&#8217;ll see &#8220;<b>msnbot/1.1</b>&#8220;. Nathan Buggia from Live Search clarifies: &#8220;<b>This update does not apply to all the other &#8216;msnbot-*&#8217; crawlers, just the main msnbot</b>. We will be updating those bots in the future&#8221;.</p>
<p>If you just check the user agent string for &#8220;msnbot&#8221; you&#8217;ve nothing to change, otherwise you should check the user agent string for both &#8220;msnbot/1.0&#8243; as well as &#8220;msnbot/1.1&#8243; before you do the reverse DNS lookup to identify bogus bots. MSN will not change the host name &#8220;.search.live.com&#8221; used by the crawling engine.</p>
<p>The announcement didn&#8217;t tell us whether the new bot will utilize HTTP/1.1 or not (MS and Yahoo crawlers, like other Web robots, still perform, respectively fake, HTTP/1.0 requests).</p>
</dd>
</dl>
<p>It looks like it&#8217;s no longer necessary to <a href="http://searchengineland.com/080207-174632.php">charge Live Search for bandwidth their crawler has burned</a>. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  Jokes aside, instead of reporting crawler issues to msnbot@microsoft.com, you can post your questions or concerns at a forum dedicated to <a href="http://forums.microsoft.com/webmaster/ShowForum.aspx?ForumID=1984&#038;SiteID=79">MSN crawler feedback and discussions</a>.</p>
<p>I&#8217;m quite nosy, so I just had to investigate what &#8220;there are many more improvements&#8221; in the blog post meant. I&#8217;ve asked <a href="http://nathanbuggia.com/">Nathan Buggia</a> from Microsoft a few questions. </p>
<p class="question">Nate, thanks for the opportunity to <em>talk crawling</em>&nbsp; with you. Can you please reveal a few msnbot/1.1 secrets? <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p class="answer">I&#8217;m glad you&#8217;re interested in our update, but we&#8217;re not yet ready to provide more details about additional improvements. However, there are several more that we&#8217;ll be shipping in the next couple months.</p>
<p class="question">Fair enough. So lets talk about related topics.</p>
<p class="question">Currently I can set crawler directives for file types identified by their extensions in my robots.txt&#8217;s msnbot section. Will you fully support wildcards (* and $ for all URI components, that is path and query string) in robots.txt in the foreseeable future?</p>
<p class="answer">This is one of several additional improvements that we are looking at today, however it has not been released in the current version of MSNBot. In this update we were squarely focused on reducing the burden of MSNBot on your site.</p>
<p class="question">What can or should a Webmaster do when you seem to crawl a site way too fast, or not fast enough? Do you plan to provide a tool to reduce the server load, respectively speed up your crawling for particular sites?</p>
<p class="answer">We currently support the &#8220;<a href="http://search.live.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIndexing.htm&#038;FORM=WFDD#D">crawl-delay</a>&#8221; option in the robots.txt file for webmasters that would like to slow down our crawling. We do not currently support an option to increase crawling frequency, but that is also a feature we are considering.</p>
<p class="question">Will msnbot/1.1 extract URLs from client sided scripts for discovery crawling? If so, will such links pass reputation?</p>
<p class="answer">Currently we do not extract URLs from client-side scripts.</p>
<p class="question">Google&#8217;s last change of their infrastructure made nofollow&#8217;ed links completely worthless, because they no longer used those in their discovery crawling. Did you change your handling of links with a &#8220;nofollow&#8221; value in the REL attribute with this upgrade too?</p>
<p class="answer">No, changes to how we process nofollow links were not part of this update.</p>
<p class="question">Nate, many thanks for your time and your interesting answers! </p>
<ul><b>Related posts:</b></p>
<li><a href="http://blogs.msdn.com/webmaster/archive/2008/02/12/announcing-crawling-improvements-for-live-search.aspx">Official announcement</a> - by <a href="http://nathanbuggia.com/">Nathan Buggia</a>, Live Search Webmaster Center Blog</a></li>
<li><a href="http://searchengineland.com/080212-160910.php">MSNbot 1.1: Live Search Implements A More Efficient Crawl</a> - by <a href="http://vanessafoxnude.com/">Vanessa Fox</a>, Search Engine Land</li>
</ul>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>MSN spam to continue says the Live Search Blog</title>
		<link>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/</link>
		<comments>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/#comments</comments>
		<pubDate>Wed, 05 Dec 2007 08:58:46 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Spoofing]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[Cloaking]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/</guid>
		<description><![CDATA[It seems MSN/LiveSearch has tweaked their rogue bots and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and news.
Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/msn-live-search-clueless-webspam-detection.png" width="250" height="352" style="margin-left:4px;" align="right" alt="MSN Live Search clueless webspam detection" title="MSN Live Search is totally clueless when it comes to spam detection"  />It seems MSN/LiveSearch has tweaked their <a href="http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/">rogue bots</a> and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and <a href="http://blogs.msdn.com/webmaster/archive/2007/12/04/live-search-and-cloaking-detection.aspx">news</a>.</p>
<p>Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot downloads everything from a page, that is images and other objects, external CSS/JS files, and ad blocks rendering even contextual advertising from Google and Yahoo. It fakes MSN SERP referrers diluting the search term stats with generic and unrelated keywords. Webmasters running non-adult sites wondered why a database tutorial suddenly ranks for [oral sex] and why MSN sends visitors searching for [<acronym title="Mothers I Like (to) Fuck">MILF</acronym> pix] to a teenager&#8217;s diary. Webmasters assumed that MSN is after deceitful cloaking, and laughed out loud because their webspam detection method was that primitive and easy to fool.</p>
<p>Now MSN admits <a href="http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/">all their sins</a> &#8211;except the launch of a porn affiliate program&#8211; and posted a <a href="http://blogs.msdn.com/webmaster/archive/2007/12/04/live-search-and-cloaking-detection.aspx">vague excuse on their Webmaster Blog</a> telling the world that they discovered the evil cloakers and their index is somewhat spam free now. <a href="http://www.seo-scoop.com/2007/12/04/msnlive-ponies-up-about-the-referrer-spam/">Donna has chatted with the MSN spam team about their spambot</a> and reports that blocking its IP addresses is a bad idea, even for sites that don&#8217;t cloak. <a href="http://www.vanessafoxnude.com/">Vanessa Fox</a> summarized MSN&#8217;s poor man&#8217;s cloaking detection at <a href="http://searchengineland.com/071204-150233.php">Search Engine Land</a>:</p>
<blockquote><p>And one has to wonder how effective methods like this really are. Those savvy enough to cloak may be able to cloak for this new cloaker detection bot as well.</p>
</blockquote>
<p>They say that they no longer spam sites that don&#8217;t cloak, but reverse this statement telling Donna</p>
<blockquote><p>we need to be able to identify the legitimate and illegitimate content</p>
</blockquote>
<p>and Vanessa </p>
<blockquote><p>sites that are cloaking may continue to see some amount of traffic from this bot. This tool crawls sites throughout the web &#8212; both those that cloak and those that don&#8217;t &#8212; but those not found to be cloaking won&#8217;t continue to see traffic.</p>
</blockquote>
<p>Here is an excerpt from yesterdays referrer log of a site that does not cloak, and never did: <code><br />
http://search.live.com/results.aspx?q=webmaster&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=smart&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=search&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=progress&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=google&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=google&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=domain&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=database&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=content&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=business&#038;mrt=en-us&#038;FORM=LIVSOP</code><br />
Why can&#8217;t the MSN dudes tell the truth, not even when they apologize?</p>
<p>Another lie is &#8220;we obey robots.txt&#8221;. Of course the spambot doesn&#8217;t request it to bypass bot traps, but according to MSN it uses a copy served to the LiveSearch crawler &#8220;msnbot&#8221;:</p>
<blockquote><p>Yes, this robot does follow the robots.txt file. The reason you don’t see it download it, is that we use a fresh copy from our index. The tool does respect the robots.txt the same way that MSNBot does with a caveat; the tool behaves like a browser and some files that a crawler would ignore will be viewed just like real user would.</p>
</blockquote>
<p>In reality, it doesn&#8217;t help to block CSS/JS files or images in robots.txt, because MSN&#8217;s spambot will download them anyway. The long winded statement above translates to &#8220;We promise to obey robots.txt, but if it fits our needs we&#8217;ll ignore it&#8221;. </p>
<p>Well, MSN is not the only search engine running <a href="http://www.webshoppehosting.com/weblog/?p=17">stealthy bots</a> to detect cloaking, but they aren&#8217;t clever enough to do it in a less abusive and detectable way. </p>
<p>Their insane spambot led all cloaking specialists out there to their not that obvious spam detection methods. They may have caught a few cloaking sites, but considering the short life cycle of Webspam on throwaway domains they shot themselves in both feet. What they really have achieved is that the cloaking scripts are MSN spam detection immune now. </p>
<p>Was it really necessary to annoy and defraud the whole Webmaster community and to burn huge amounts of bandwidth just to catch a few cloakers who launched new scripts on new throwaway domains hours after the first appearance of the MSN spam bot?</p>
<p>Can cosmetic changes with regard to their useless spam activities restore MSN&#8217;s lost reputation? I doubt it. They&#8217;ve admitted their miserable failure five months too late. Instead of dumping the spambot, they announce that they&#8217;ll spam away for the foreseeable future. How silly is that? I thought Microsoft is somewhat profit orientated, why do they burn their and our money with such amateurish projects?</p>
<p>Besides all this crap MSN has good news too. Microsoft Live Search told Search Engine Roundtable that <a href="http://www.seroundtable.com/archives/015534.html">they&#8217;ll spam our sites with keywords related to our content</a> from now on, at least they&#8217;ll try it. And they have a <a href="http://forums.microsoft.com/webmaster/ShowForum.aspx?ForumID=1984&#038;SiteID=79">forum</a> and a <a href="https://feedback.live.com/default.aspx?productkey=livesearchwebmastercenter&#038;mkt=en-us">contact form</a> to gather complaints. Crap on, so much bureaucratic efforts to administer their ridiculous spam fighting funeral. They&#8217;d better build a search engine that actually sends human traffic.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Advantages of a smart robots.txt file</title>
		<link>http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/</link>
		<comments>http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/#comments</comments>
		<pubDate>Mon, 26 Nov 2007 15:16:17 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/</guid>
		<description><![CDATA[A loyal reader of my pamphlets asked me:
I foresee many new capabilities with robots.txt in the future due to this [Google&#8217;s robots.txt experiments]. However, how the hell can a webmaster hide their robots.txt from the public while serving it up to bots without doing anything shady?

That&#8217;s a great question. On this blog I&#8217;ve a static [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/write-a-smart-robots-txt.png" width="200" height="200" align="right" style="margin-left:4px;" alt="Write a smart robots.txt" title="How to write a smart robots.txt"  />A <a href="http://www.jordankasteler.com/utah-seo-pro-blog/">loyal reader</a> of my pamphlets <a href="http://sebastians-pamphlets.com/about-noindex-crawler-directives-in-robots-txt/#comment-737">asked me</a>:</p>
<blockquote><p>I foresee many <a href="http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex-your-site/">new capabilities with robots.txt</a> in the future due to <a href="http://sebastians-pamphlets.com/about-noindex-crawler-directives-in-robots-txt/">this</a> [Google&#8217;s robots.txt experiments]. However, how the hell can a webmaster hide their robots.txt from the public while serving it up to bots without doing anything shady?</p>
</blockquote>
<p>That&#8217;s a great question. On this blog I&#8217;ve a static robots.txt, so I&#8217;ve set up a dynamic example using code snippets from other sites: <a href="http://sebastians-pamphlets.com/repstuff/robots.txt">this robots.txt</a> is what a user sees, and <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=Googlebot">here</a> <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=Googlebot-Mobile">is</a> <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=Googlebot-Image">what</a> <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=Mediapartners-Google">various</a> <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=Adsbot-Google">crawlers</a> <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=Slurp">get</a> <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=msnbot">on</a> <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=Teoma">request</a> of my <a href="http://sebastians-pamphlets.com/repstuff/robots.txt?crawlerName=*">robots.txt example</a>. Of course crawlers don&#8217;t request a robots.txt file with a query string identifying themselves (/robots.txt?crawlerName=*) like in the preview links above, so it seems you&#8217;ll need a pretty smart robots.txt file.</p>
<p>Before I tell you how to smarten a robots.txt file, lets define the anatomy of a somewhat intelligent robots.txt script:</p>
<ul>
<li>It exists. It&#8217;s not empty. <a href="http://sebastians-pamphlets.com/why-proper-error-handling-is-important/">I&#8217;m not kidding</a>.</li>
<li>A smart robots.txt detects and verifies crawlers to serve customized REP statements to each spider. Customized code means a section for the actual search engine, and general crawler directives. Example:<code><br />
User-agent: Googlebot-Image<br />
Disallow: /<br />
Allow: /cuties/*.jpg$<br />
Allow: /hunks/*.gif$<br />
Allow: /sitemap*.xml$<br />
Sitemap: http://example.com/sitemap-images.xml<br />
&nbsp;<br />
User-agent: *<br />
Disallow: /cgi-bin/</code><br />
This avoids confusion, because complex static robots.txt files with a section for all crawlers out there &#8211;plus a general section for other Web robots&#8211; are fault-prone, and might exceed the maximum file size some bots can handle. If you fuck up a single statement in a huge set of instructions, this may lead to the exitus of the process parsing your robots.txt, what results in no crawling at all, or possibly crawling of forbidden areas. Checking the syntax per engine with a lean robots.txt is way easier (supported robots.txt syntax: <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40364">Google</a>, <a href="http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html">Yahoo</a>, <a href="http://about.ask.com/en/docs/about/webmasters.shtml#6">Ask</a> and <a href="http://search.msn.com.sg/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm">MSN/LiveSearch</a> - don&#8217;t use wildcards with MSN because they don&#8217;t really support them, that means at MSN wildcards are valid to <a href="http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm#B">match filetypes only</a>).</li>
<li>A smart robots.txt <a href="http://sebastians-pamphlets.com/repstuff/robots-txt-request-log.htm?view=crawler">reports all crawler requests</a>. This helps with tracking when you change something. Please note that there&#8217;s a lag between the most recent request of robots.txt and the moment a search engine starts to obey it, because all engines cache your robots.txt.</li>
<li>A smart robots.txt helps identifying unknown Web robots, at least those which bother requesting it (<a href="http://incredibill.blogspot.com/">ask Bill</a> how to fondle rogue bots). From a <a href="http://sebastians-pamphlets.com/repstuff/robots-txt-request-log.htm?view=suspect">log of suspect requests of your robots.txt</a> you can decide whether particular crawlers need special instructions or not.</li>
<li>A smart robots.txt helps maintaining your crawler IP list.</li>
</ul>
<p><b>Here is my step by step &#8220;how to create a smart robots.txt&#8221; guide.</b> As always: if you suffer from <a href="http://sebastians-pamphlets.com/links/categories/?cat=iis">IIS/ASP</a> go search for <a href="http://www.nationalnet.com/hosting-services.php" rel="dofollow i-am-a-happy-camper">reliable hosting</a> (*ix/Apache).</p>
<p>In order to make robots.txt a script, tell your server to parse .txt files for PHP. (If you serve other .txt files than robots.txt, please note that you must add <code>&lt;?php ?&gt;</code> as first line to all .txt files on your server!) Add this line to your root&#8217;s .htaccess file: <code><br />
AddType application/x-httpd-php .txt</code></p>
<p>Next grab the PHP code for crawler detection from <a href="http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/#grab-php-code-check-crawler">this post</a>. In addition to the functions checkCrawlerUA() and checkCrawlerIP() you need a function delivering the right user agent name, so please welcome getCrawlerName() in your PHP portfolio: </p>
<p><b><a onclick="showContent('php-code-return-crawler-name'); return false;">View</a>|<a onclick="hideContent('php-code-return-crawler-name'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)<br />
<code id="php-code-return-crawler-name" style="display:none;"><b><br />
if (!function_exists(&quot;getCrawlerName&quot;)) {<br />
function getCrawlerName () {<br />
    GLOBAL $userAgent;<br />
    $crawlerName = &quot;&quot;;<br />
    if (stristr($userAgent,&quot;Googlebot&quot;)) $crawlerName = &quot;Googlebot&quot;;<br />
    if (stristr($userAgent,&quot;Googlebot-Mobile&quot;)) $crawlerName = &quot;Googlebot-Mobile&quot;;<br />
    if (stristr($userAgent,&quot;Googlebot-Image&quot;)) $crawlerName = &quot;Googlebot-Image&quot;;<br />
    if (stristr($userAgent,&quot;Mediapartners-Google&quot;)) $crawlerName = &quot;Mediapartners-Google&quot;;<br />
    if (stristr($userAgent,&quot;Adsbot-Google&quot;)) $crawlerName = &quot;AdsBot-Google&quot;;<br />
    if (stristr($userAgent,&quot;Slurp&quot;)) $crawlerName = &quot;Slurp&quot;;<br />
    if (stristr($userAgent,&quot;Ask&quot;) &#038;&#038;<br />
        stristr($userAgent,&quot;Teoma&quot;)) $crawlerName = &quot;Teoma&quot;;<br />
    if (stristr($userAgent,&quot;MSNbot&quot;)) $crawlerName = &quot;msnbot&quot;;<br />
    // Add other crawlers here:<br />
    // if (stristr($userAgent,&quot;somebot&quot;)) $crawlerName = &quot;somebot&quot;;<br />
    // Unknown crawler:<br />
    if (empty($crawlerName)) $crawlerName = &quot;*&quot;;<br />
    return $crawlerName;<br />
} // end function getCrawlerName<br />
}</b></code><br />
(If your instructions for Googlebot, Googlebot-Mobile and Googlebot-Image are identical, you can put them in one single &#8220;Googlebot&#8221; section.)
</p>
<p>And here is the PHP script &#8220;/robots.txt&#8221;. Include the general stuff like functions, shared (global) variables and whatnot. <code><b><br />
&lt;?php<br />
@require($_SERVER[&quot;DOCUMENT_ROOT&quot;] .&quot;/code/generalstuff.php&quot;);</b></code></p>
<p>Probably your Web server&#8217;s default settings aren&#8217;t suitable to send out plain text files, hence instruct it properly. <code><b><br />
@header(&quot;Content-Type: text/plain&quot;);<br />
@header(&quot;Pragma: no-cache&quot;);<br />
@header(&quot;Expires: 0&quot;);</b></code><br />
If a search engine runs wild requesting your robots.txt too often, comment out the &#8220;no-cache&#8221; and &#8220;expires&#8221; headers.</p>
<p>Next check whether the requestor is a verifiable search engine crawler. Lookup the host name and do a reverse DNS lookup. <code><b><br />
$isSpider = checkCrawlerIP($requestUri);</b></code></p>
<p>Depending on $isSpider log the request either in a <a href="http://sebastians-pamphlets.com/repstuff/robots-txt-request-log.htm?view=crawler">crawler log</a> or an <a href="http://sebastians-pamphlets.com/repstuff/robots-txt-request-log.htm?view=suspect">access log gathering suspect requests of robots.txt</a>. You can store both in a database table, or in a flat file if you operate a tiny site. (Write the logging function yourself.) <code><b><br />
$standardStatement = &quot;User-agent: * \n Disallow: /cgi-bin/ \n\n&quot;;<br />
print $standardStatement;<br />
if ($isSpider) {<br />
    $lOk = writeRequestLog(&quot;crawler&quot;);<br />
    $crawlerName = getCrawlerName();<br />
}<br />
else {<br />
    $lOk = writeRequestLog(&quot;suspect&quot;);<br />
    exit;<br />
}</b></code><br />
If the requestor is not a search engine crawler you can verify, send a <a href="http://sebastians-pamphlets.com/repstuff/robots.txt">standard statement</a> to the user agent and quit. Otherwise call getCrawlerName() to name the section for the requesting crawler.</p>
<p>Now you can output individual crawler directives for each search engine, respectively their specialized crawlers. <code><b><br />
$prnUserAgent = &quot;User-agent: &quot;;<br />
$prnContent   = &quot;&quot;;<br />
if (&quot;$crawlerName&quot; == &quot;Googlebot-Image&quot;) {<br />
    $prnContent .= &quot;$prnUserAgent $crawlerName\n&quot;;<br />
    $prnContent .= &quot;Disallow: /\n&quot;;<br />
    $prnContent .= &quot;Allow: /cuties/*.jpg$\n&quot;;<br />
    $prnContent .= &quot;Allow: /hunks/*.gif$\n&quot;;<br />
    $prnContent .= &quot;Allow: /sitemap*.xml$\n&quot;;<br />
    $prnContent .= &quot;Sitemap: http://example.com/sitemap-images.xml\n\n&quot;;<br />
}<br />
if (&quot;$crawlerName&quot; == &quot;Mediapartners-Google&quot;) {<br />
    $prnContent .= &quot;$prnUserAgent $crawlerName \n Disallow:\n\n&quot;;<br />
}<br />
&#8230;<br />
print $prnContent;<br />
?&gt;</b></code></p>
<p>Say the user agent is Googlebot-Image, the code above will output this robots.txt: <code><b style="margin-left:15px;"><br />
User-agent: *<br />
Disallow: /cgi-bin/<br />
&nbsp;<br />
User-agent: Googlebot-Image<br />
Disallow: /<br />
Allow: /cuties/*.jpg$<br />
Allow: /hunks/*.gif$<br />
Allow: /sitemap*.xml$<br />
Sitemap: http://example.com/sitemap-images.xml<br />
</b></code><br />
(Please note that crawler sections must be delimited by an empty line, and that if there&#8217;s a section for a particular crawler, this spider will ignore the general directives. Please consider reading more pamphlets discussing <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a> and <a href="http://sebastians-pamphlets.com/links/categories/?definitions=TRUE">dull stuff like that</a>.)</p>
<p>That&#8217;s it. Adapt. Enjoy.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Act out your sophisticated affiliate link paranoia</title>
		<link>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/</link>
		<comments>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/#comments</comments>
		<pubDate>Tue, 13 Nov 2007 07:09:30 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[E-Commerce]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/</guid>
		<description><![CDATA[My recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google&#8217;s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:
Links purchased for advertising should be designated [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/paranoid-affiliate-link.png" width="250" height="231" border="0" align="right" style="margin-left:4px;" alt="GOOD: paranoid affiliate link" title="Paranoid on affiliate links" />My recent posts on <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">managing affiliate links</a> and <a href="http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/">nofollow cloaking</a> <a href="http://sebastians-pamphlets.com/text-link-broker-woes-smart-paid-links-sniffers-fromgoogle/">paid links</a> led to so many reactions from my readers that I thought explaining possible protection levels could make sense. <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66736">Google&#8217;s request to condomize affiliate links</a> is a bit, well, thin when it comes to technical tips and tricks:<br />
<blockquote>Links purchased for advertising should be designated as such. This can be done in several ways, such as:<br />
    * Adding a rel=&#8221;nofollow&#8221; attribute to the &lt;a&gt; tag<br />
    * Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file</p></blockquote>
<p> Also, Google doesn&#8217;t define <a href="http://sebastians-pamphlets.com/links/categories/?cat=paid-links">paid links</a> that clearly, so try this <a href="http://www.stonetemple.com/blog/?p=196">paid link definition</a> instead before your read on. <b>Here is my linking guide for the paranoid affiliate marketer.</b></p>
<p><a href="http://www.google.com/support/webmasters/bin/answer.py?answer=76465">Google recommends hiding of any content provided by affiliate programs from their crawlers</a>. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program&#8217;s sales tools page count as duplicate content (snippet) in its worst occurance.</p>
<p>Pasting code copied from a merchant&#8217;s site into a page&#8217;s or template&#8217;s HTML is not exactly a smart way to put ads. Those ads aren&#8217;t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you&#8217;re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts. </p>
<p>There are tons of such scripts out there, so I don&#8217;t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you&#8217;ve a WordPress blog you can adapt them with ease. </p>
<h3>Cover your ass with a linking policy</h3>
<p>Googlers as well as hired guns do review Web sites for violations of Google&#8217;s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) <a href="http://sebastians-pamphlets.com/links/full-disclosure/">full disclosure of your linking attitude</a> can help to pass a human review by search engine staff. By the way, having a <a href="http://sebastians-pamphlets.com/about/policies/#commenting">policy for dofollowed blog comments</a> is also a good idea.</p>
<p>Since crawler directives like <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">link condoms</a> are for search engines (only), and those pay attention to your source code and hints addressing search engines like <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a>, you should leave a note <a href="http://sebastians-pamphlets.com/robots.txt" rel="nofollow nocontent">there</a> too, look into the source of this page for an example. <a onclick="showContent('sample-code-disclosure'); this.style.display = 'none'; return false;">View sample HTML comment.</a> <b id="sample-code-disclosure" style="display:none;">Sample HTML comment: <code>&lt;&#33;--</code>This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.<code>--&gt;</code></b> </p>
<h3>Block crawlers from your propaganda scripts</h3>
<p>Put all your stuff related to advertising (scripts, images, movies&#8230;) in a subdirectory and disallow search engine crawling in your <a href="http://www.smart-it-consulting.com/article.htm?node=140&#038;page=46">/robots.txt</a> file: <code><br />
User-agent: *<br />
Disallow: /propaganda/ </code><br />
Of course you&#8217;ll use an innocuous name like &#8220;gnisitrevda&#8221; for this folder, which lacks a default document and can&#8217;t get browsed because you&#8217;ve a <code><br />
Options -Indexes </code><br />
statement in your .htaccess file. (Watch out, Google knows what &#8220;gnisitrevda&#8221; means, so be creative or cryptic.)</p>
<p>Crawlers sent out by major search engines do respect robots.txt, hence it&#8217;s guaranteed that regular spiders don&#8217;t fetch it. As long as you don&#8217;t cheat too much, you&#8217;re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn&#8217;t prevent you from surfing search engine staff, but I don&#8217;t tell you things you&#8217;d better hide from Matt&#8217;s gang.</p>
<h3>Detect search engine crawlers</h3>
<p>Basically there are three common methods to detect requests by search engine crawlers.
<ol>
<li>Testing the user agent name (HTTP_USER_AGENT) for strings like &#8220;Googlebot&#8221;, &#8220;Slurp&#8221;, &#8220;MSNbot&#8221; or so which identify crawlers. That&#8217;s easy to spoof, for example <a href="http://sebastians-pamphlets.com/referrer-spoofing-with-prefbar-341/">PrefBar for FireFox</a> lets you choose from a list of user agents.</li>
<li>Checking the user agent name, and only when it indicates a crawler, verifying the requestor&#8217;s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.</li>
<li>Maintaining a list of all search engine crawler IP addresses known to man,  checking the requestor&#8217;s IP (REMOTE_ADDR) against this list. (That alone isn&#8217;t bullet-proof, but I&#8217;m not going to write a tutorial on industrial-strength <strike>cloaking</strike> IP delivery, I leave that to the real <a href="http://fantomaster.com/fantomNews">experts</a>.)</li>
</ol>
<p>For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won&#8217;t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants&#8217; search engine rankings, and that&#8217;s evil in Google&#8217;s eyes.  </p>
<p>We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine&#8217;s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor&#8217;s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn&#8217;t match a crawler name. </p>
<p><b id="grab-php-code-check-crawler"><a onclick="showContent('php-code-check-crawler'); return false;">View</a>|<a onclick="hideContent('php-code-check-crawler'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)<br />
<code id="php-code-check-crawler" style="display:none;"><b><br />
// file system path to crawler IP log, scripts etc.,<br />
// without trailing slash:<br />
$includePath   = $_SERVER[&quot;DOCUMENT_ROOT&quot;] . &quot;/propaganda&quot;;<br />
// edit &quot;propaganda&quot; and CHMOD 777 the directory !<br />
// file names:<br />
$crawlerIps  = $includePath .&quot;/crawler-ip-addresses.txt&quot;;<br />
// misc. stuff:<br />
$timestamp     = date(&#8217;Y-m-d H:i:s&#8217;);<br />
$ipAddy        = $_SERVER[&quot;REMOTE_ADDR&quot;];<br />
$referrer      = $_SERVER[&quot;HTTP_REFERER&quot;];<br />
$userAgent     = $_SERVER[&quot;HTTP_USER_AGENT&quot;];<br />
$requestUri    = $_SERVER[&quot;REQUEST_URI&quot;];<br />
$queryString   = $_SERVER[&quot;QUERY_STRING&quot;];<br />
$isCrawler     = FALSE;<br />
$crawlerServer = &quot;&quot;;<br />
$delimiter     = &quot;|&quot;;<br />
$idString      = &quot;&quot;;<br />
if (empty($includePath)) {<br />
   $includePath = $_SERVER[&quot;DOCUMENT_ROOT&quot;] . &quot;/propaganda&quot;; // CHMOD 777<br />
}<br />
// Write a file to disk<br />
if (!function_exists(&quot;writeLocalFile&quot;)) {<br />
function writeLocalFile ($file, $content) {<br />
   if (!is_writable($file)) {<br />
      $lok = @chmod ( $file, 0777 );<br />
   }<br />
   // file_put_contents() not avail in PHP 4.3x<br />
   $fp = @fopen(&quot;$file&quot;,&quot;w+&quot;);<br />
   if ($fp) {<br />
       $lOk = @fwrite($fp, $content, strlen($content));<br />
       @fclose($fp);<br />
       // make sure file may get overwritten or removed later on<br />
       $lok = @chmod ( $file, 0777 );<br />
       return TRUE;<br />
   } // endif $fp<br />
   return FALSE;<br />
} // end function writeLocalFile<br />
}<br />
if (!function_exists(&quot;checkCrawlerUA&quot;)) {<br />
function checkCrawlerUA () {<br />
    GLOBAL $userAgent;<br />
    GLOBAL $crawlerServer;<br />
    $crawlerServer = &quot;&quot;;<br />
    $crawlers  = array(&quot;Googlebot&quot;,&quot;Mediapartners&quot;,&quot;Slurp&quot;,&quot;MSNbot&quot;,&quot;Ask&quot;,&quot;Teoma&quot;);<br />
    foreach ($crawlers as $crawler) {<br />
        if (stristr($userAgent,$crawler)) {<br />
            if (stristr($crawler,&quot;Googlebot&quot;) ||<br />
                stristr($crawler,&quot;Mediapartners&quot;)) {<br />
                $crawlerServer = &quot;.googlebot.com&quot;;<br />
            } // Google<br />
            if (stristr($crawler,&quot;Slurp&quot;)) {<br />
                $crawlerServer = &quot;.crawl.yahoo.net&quot;;<br />
            } // Yahoo<br />
            if (stristr($crawler,&quot;MSNbot&quot;)) {<br />
                $crawlerServer = &quot;.search.live.com&quot;;<br />
            } // MSN/Live<br />
            if (stristr($crawler,&quot;Ask&quot;) ||<br />
                stristr($crawler,&quot;Teoma&quot;)) {<br />
                $crawlerServer = &quot;.ask.com&quot;;<br />
            } // Ask<br />
        }<br />
    } // foreach crawlers<br />
    if (!empty($crawlerServer)) return TRUE;<br />
    return FALSE;<br />
} // end function checkCrawlerUA<br />
}<br />
if (!function_exists(&quot;checkCrawlerIP&quot;)) {<br />
function checkCrawlerIP ($idString) {<br />
    GLOBAL $ipAddy;<br />
    GLOBAL $crawlerIps;<br />
    GLOBAL $delimiter;<br />
    GLOBAL $timestamp;<br />
    GLOBAL $userAgent;<br />
    GLOBAL $crawlerServer;<br />
    $isCrawler = checkCrawlerUA();<br />
    if ($isCrawler === FALSE)  return FALSE;<br />
    if (empty($crawlerServer)) return FALSE;<br />
//<br />
// DEBUG: $crawlerServer = &quot;.national-net.com&quot;;<br />
// Use your ISPs host name for testing with a spoofed user agent name<br />
//<br />
    $crawlerIpsContent = @file_get_contents($crawlerIps);<br />
    if (!empty($crawlerIpsContent)) {<br />
        if (stristr($crawlerIpsContent, &quot;\n$ipAddy$delimiter&quot;)) {<br />
            return TRUE;<br />
        }<br />
    }<br />
    $crawlerHost = @gethostbyaddr($ipAddy);<br />
    if (!stristr($crawlerHost,$crawlerServer)) {<br />
        return FALSE;<br />
    }<br />
    if (&quot;$crawlerHost&quot; == &quot;$ipAddy&quot;) {<br />
        return FALSE;<br />
    }<br />
    $ipAddyRev = @gethostbyname($crawlerHost);<br />
    if (&quot;$ipAddyRev&quot; != &quot;$ipAddy&quot;) {<br />
        return FALSE;<br />
    }<br />
    $crawlerIpsContent .= &quot;\n&quot; .$ipAddy .$delimiter<br />
                          .$timestamp   .$delimiter<br />
                          .$crawlerHost .$delimiter<br />
                          .$idString    .$delimiter<br />
                          .$userAgent   .$delimiter;<br />
    $lOk = writeLocalFile ($crawlerIps, $crawlerIpsContent);<br />
    return TRUE;<br />
} // end function checkCrawlerIP<br />
}<br />
</b></code><br />
Grab and implement the PHP source, then you can code statements like <code><br />
$isSpider = checkCrawlerUA ();<br />
...<br />
if ($isSpider) {<br />
    $relAttribute = &quot; rel=\&quot;nofollow\&quot; &quot;;<br />
}<br />
...<br />
$affLink = &quot;&lt;a href=\&quot;$affUrl\&quot; $relAttribute&gt;call for action&lt;/a&gt;&quot;;<br />
</code><br />
or <code><br />
$isSpider = checkCrawlerIP ($sponsorUrl);<br />
...<br />
if ($isSpider) {<br />
    // don't redirect to the sponsor, return a 403 or 410 instead<br />
}</code><br />
More on that later.</p>
<h3>Don&#8217;t deliver your advertising to search engine crawlers</h3>
<p>It&#8217;s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense&#8217;s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can&#8217;t guarantee indexing, because Google doesn&#8217;t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.</p>
<p>Here is how you could output a totally uncrawlable banner ad: <code><br />
...<br />
$isSpider = checkCrawlerIP ($PHP_SELF);<br />
...<br />
print &quot;&lt;div class=\&quot;css-class-sidebar robots-nocontent\&quot;&gt;&quot;;<br />
// output RSS buttons or so<br />
if (!$isSpider) {<br />
    print &quot;&lt;script type=\&quot;text/javascript\&quot; src=\&quot;http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&#038;adServed=banner\&quot;&gt;&lt;/script&gt;&quot;;<br />
    ...<br />
}<br />
...<br />
print &quot;&lt;/div&gt;\n&quot;;<br />
...</code><br />
Lets look at the code above. First we detect crawlers &#8220;without doubt&#8221; (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-&#8217;.crawl.yahoo.net&#8217; host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I&#8217;ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).</p>
<p>Currently the <em>robots-nocontent</em>&nbsp; class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn&#8217;t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google&#8217;s section targeting that&#8217;s a <a href="http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/">crappy crawler directive</a>, though. However, it doesn&#8217;t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), &#8220;buy now&#8221;/&#8221;view basket&#8221; links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) &#8230; and of course ads.</p>
<p>Ads aren&#8217;t outputted when a crawler requests a page. Of course that&#8217;s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren&#8217;t relevant for indexing that&#8217;s not <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66355">deceitful cloaking</a> in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster&#8217;s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers. </p>
<p>Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren&#8217;t penalized. Since Web robots can&#8217;t pull out the plastic, there&#8217;s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Example from Google</a>: &#8220;Allow search bots to crawl your sites without session IDs or arguments that track their path through the site&#8221; is a call for <a href="http://www.smart-it-consulting.com/article.htm?node=148&#038;page=103">search engine friendly URL cloaking</a>.) </p>
<h3>Is hiding ads from crawlers &#8220;safe with Google&#8221; or not?</h3>
<p><img src="http://sebastians-pamphlets.com/img/posts/uncloaked-affiliate-link.png" width="200" height="188" border="0" align="right" style="margin-left:4px;" alt="BAD: uncloaked affiliate link" title="Uncloaked affiliate link" />Cloaking ads away is a double edged sword from a search engine&#8217;s perspective. Way too strictly interpreted that&#8217;s against the cloaking rule which states &#8220;don&#8217;t show crawlers other content than humans&#8221;, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google&#8217;s section targeting or Yahoo&#8217;s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don&#8217;t count them as on-the-page contents.</p>
<p>Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don&#8217;t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify <acronym title="Made For AdSense/Advertising">MFA</acronym> pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings. </p>
<p>Since in theory the contents of JavaScript ads aren&#8217;t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can&#8217;t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can&#8217;t argue seriously that hiding such tiny signals &#8211;which they use for the sole purposes of possible downranks&#8211; is against their guidelines.</p>
<p>For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn&#8217;t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.</p>
<p><b>Dear search engines, if it does not count, then you cannot demand to keep it crawlable.</b> Well, a few super mega white hat <acronym title="Dougie ...">trolls</acronym> might disagree, and depending on the implementation on individual sites maybe hiding ads isn&#8217;t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don&#8217;t try to hide the fact that I run ads on this blog.</p>
<p>Usually I don&#8217;t argue with fair vs. unfair, because we talk about <strike>war</strike> business here, what means that everything goes. However, Google does everything to talk the whole Internet into <strike>obfuscating</strike> disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> (Of course that&#8217;s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don&#8217;t care about bad press. So please don&#8217;t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)</p>
<h3>Outputting ads with JavaScript, preferably in iFrames</h3>
<p>Delivering adverts with JavaScript does not mean that one can&#8217;t use server sided scripting to adjust them dynamically. With content management systems it&#8217;s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn&#8217;t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval&#8217;ing the PHP code).</p>
<p>Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:<br />
<script type="text/javascript" src="http://sebastians-pamphlets.com/ads/output.js.php?adName=seobook&#038;adServed=banner"></script><br />
And here is the JS call of the PHP script: <code><br />
&lt;script type=&quot;text/javascript&quot; src=&quot;http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&#038;adServed=banner&quot;&gt;&lt;/script&gt;</code></p>
<p>The PHP script <code>/propaganda/output.js.php</code> evaluates the query string to pull the requested ad&#8217;s components. In case it&#8217;s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor&#8217;s location and whatnot, but that&#8217;s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad&#8217;s links. Once the HTML code is ready, it outputs a JavaScript statement: <code><br />
document.write(&lsquo;&lt;a href=&quot;http://sebastians-pamphlets.com/propaganda/router.php? adName=seobook&#038;adServed=banner&quot; title=&quot;DOWNLOAD THE BOOK ON SEO!&quot;&gt;&lt;img src=&quot;http://sebastians-pamphlets.com/propaganda/seobook/468-60.gif&quot; width=&quot;468&quot; height=&quot;60&quot; border=&quot;0&quot; alt=&quot;The only current book on SEO&quot; title=&quot;The only current book on SEO&quot;  /&gt;&lt;/a&gt;&rsquo;); </code> which the browser executes within the <code>script</code> tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag. </p>
<p>Matt Cutts <a href="http://www.stonetemple.com/articles/interview-matt-cutts.shtml">said</a> that <a href="http://www.mattcutts.com/blog/bot-obedience-herding-googlebot/#comment-45561">JavaScript links don&#8217;t prevent Googlebot from crawling</a>, but that <a href="http://www.seomoz.org/blog/the-paid-links-debate-rages-on-ses-san-jose-2007">those links</a> <a href="http://www.mattcutts.com/blog/how-to-report-paid-links/#comment-101482">don&#8217;t count for rankings</a> (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can&#8217;t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it&#8217;s fetchable by crawlers, so I wouldn&#8217;t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That&#8217;s why I use rel-nofollow &#8211;on crawler requests&#8211; even in JS ads.</p>
<p>Change your user agent name to Googlebot or so, install <a href="http://www.mattcutts.com/blog/seeing-nofollow-links/">Matt&#8217;s show nofollow hack</a> or something similar, and you&#8217;ll see that the affiliate-URL gets nofollow&#8217;ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can&#8217;t offend my readers with nofollow&#8217;ed links (although I don&#8217;t care about crawler spoofing, actually that&#8217;s a good procedure to let advertisers check out my linking attitude).</p>
<p>We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don&#8217;t count pages displayed in iFrames as on-page contents, especially not when the iFrame&#8217;s content is hosted on another domain. Here is an example straight from the horse&#8217;s mouth: <code><br />
&lt;iframe name=&quot;google_ads_frame&quot; src=&quot;http://pagead2.googlesyndication.com/pagead/ads? very-long-and-ugly-query-string&quot; marginwidth=&quot;0&quot; marginheight=&quot;0&quot; vspace=&quot;0&quot; hspace=&quot;0&quot; allowtransparency=&quot;true&quot; frameborder=&quot;0&quot; height=&quot;90&quot; scrolling=&quot;no&quot; width=&quot;728&quot;&gt;&lt;/iframe&gt;</code> In a noframes tag we could put a static ad for surfers using browsers which don&#8217;t support frames/iFrames. </p>
<p>If for some reasons you don&#8217;t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael&#8217;s <a href="http://www.bad-neighborhood.com/htmlhashing.htm">Full Text/HTML Obfuscator Tool</a> (hat tip to <a href="http://www.seo-scoop.com/2007/09/13/new-tool-to-hide-stuff/">Donna</a>).</p>
<h3>Always redirect to affiliate URLs</h3>
<p>There&#8217;s absolutely no point in using ugly affiliate URLs on your pages. Actually, that&#8217;s the last thing you want to do for various reasons.
<ul>
<li>For example, affiliate URLs as well as source codes can change, and you don&#8217;t want to edit tons of pages if that happens.</li>
<li>When an affiliate program doesn&#8217;t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you&#8217;d replace outdated ads completely with one mouse click or so.</li>
<li>Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own &#8211;often confusing&#8211; layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don&#8217;t track your outgoing traffic, some sponsors will cheat and you can&#8217;t prove your gut feelings.</li>
<li>Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don&#8217;t smell like affiliate URLs.</li>
<li><b>&#8230;</b></li>
</ul>
<p>When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really <b>manage your affiliate links</b>. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.</p>
<p>Affiliate links provide great footprints. Unique URL parts respectively <b>query string variable names</b> gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The <b>values</b> identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn&#8217;t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that&#8217;s one trackable footprint you can avoid to some degree. </p>
<p>If the redirect-script&#8217;s location is on the same server (in fact it&#8217;s not thanks to symlinks) and not named &#8220;adserver&#8221; or so, chances are that a heuristic check won&#8217;t identify the link&#8217;s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren&#8217;t crawlable with paramaters, or similar &#8220;legit&#8221; stuff. However, you can&#8217;t fool smart algos forever, but if you&#8217;ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine&#8217;s point of view, and users might agree on this).</p>
<p>Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn&#8217;t benefit you (<a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">usually</a>). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.</p>
<h3>The behavior of an adserver URL masking an affiliate link</h3>
<p>Lets look at the redirect-script&#8217;s URL from my code example above:<br />
<a href="http://sebastians-pamphlets.com/ads/router.php?adName=seobook&#038;adServed=banner">/propaganda/router.php?adName=seobook&#038;adServed=banner</a><br />
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert. </p>
<p>Now that we&#8217;ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor&#8217;s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor&#8217;s IP addy.</p>
<p>If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor&#8217;s landing page: <code><br />
$sponsorUrl      = &quot;http://www.seobook.com/262.html&quot;;<br />
$requestProtocol = $_SERVER[&quot;SERVER_PROTOCOL&quot;];<br />
$protocolArr     = explode(&quot;/&quot;,$requestProtocol);<br />
$protocolName    = trim($protocolArr[0]);<br />
$protocolVersion = trim($protocolArr[1]);<br />
if (stristr($protocolName,&quot;HTTP&quot;)<br />
    &#038;&#038; strtolower($protocolVersion) > &quot;1.0&quot; ) {<br />
    $httpStatusCode = 307;<br />
}<br />
else {<br />
    $httpStatusCode = 302;<br />
}<br />
$httpStatusLine = &quot;$requestProtocol $httpStatusCode Temporary Redirect&quot;;<br />
@header($httpStatusLine, TRUE, $httpStatusCode);<br />
@header(&quot;Location: $sponsorUrl&quot;);<br />
exit;</code><br />
A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can&#8217;t use the 301 response code, because 301 redirects are always cachable. (<a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/">More information on HTTP redirects</a>.)</p>
<p>If the requestor is a major search engine&#8217;s crawler, we perform the most brutal bounce back known to man: <code><br />
if ($isSpider) {<br />
    @header(&quot;HTTP/1.1 403 Sorry Crawlers Not Allowed&quot;, TRUE, 403);<br />
    @header(&quot;X-Robots-Tag: nofollow,noindex,noarchive&quot;);<br />
    exit;<br />
}</code><br />
The 403 response code translates to &#8220;kiss my ass and get the fuck outta here&#8221;. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn&#8217;t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine &#8220;forget this URL, don&#8217;t request it again&#8221;. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.</p>
<p>If you don&#8217;t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don&#8217;t need to live in fear of penalties.</p>
<h3 id="recap-hide-afflinks">Recap</h3>
<p><img src="http://sebastians-pamphlets.com/img/posts/prospering-affiliate-link.png" width="200" height="200" border="0" align="right" style="margin-left:4px;" alt="NICE: prospering affiliate link" title="Prospering affiliate link" />Before you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.</p>
<p>Don&#8217;t deliver different (editorial) contents to users and crawlers, but also don&#8217;t serve ads to crawlers. They just don&#8217;t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.</p>
<p>Your ads look plain ugly with dotted borders in firebrick, hence don&#8217;t apply rel=&#8221;nofollow&#8221; to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.</p>
<p>No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers. </p>
<p>Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google&#8217;s job is not exactly funny.</p>
<p>I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you&#8217;re not paranoid, that does not mean Google will not chase you down. You really don&#8217;t need to be a so called black hat to displease Google. Not knowing respectively not understanding <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Google&#8217;s 12 commandments</a> doesn&#8217;t prevent you from being spanked for sins you&#8217;ve never heard of. If you&#8217;re keen on Google&#8217;s nicely targeted traffic, better play by Google&#8217;s rules, leastwise on creawler requests.</p>
<p>Feel free to contribute your tips and tricks in the comments.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/feed/</wfw:commentRss>
		</item>
		<item>
		<title>A pragmatic defence against Google&#8217;s anti paid links campaign</title>
		<link>http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/</link>
		<comments>http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/#comments</comments>
		<pubDate>Fri, 26 Oct 2007 14:39:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/</guid>
		<description><![CDATA[Google&#8217;s recent shot across the bows of a gazillion sites handling paid links, advertising, or internal cross links not compliant to Google&#8217;s imagination of a natural link is a call for action. Google&#8217;s message is clear: &#8220;condomize your commercial links or suffer&#8221; (from deducted toolbar PageRank, links without the ability to pass real PageRank and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://sebastians-pamphlets.com/google-pagerank-deductions-october-2007/">Google&#8217;s recent shot across the bows</a> of a gazillion sites handling <a href="http://sebastians-pamphlets.com/links/categories/?cat=paid-links">paid links</a>, <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">advertising</a>, or <a href="http://sebastians-pamphlets.com/links/categories/?cat=risky-linkage">internal cross links</a> not compliant to <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66736">Google&#8217;s imagination of a natural link</a> is a call for action. Google&#8217;s message is clear: &#8220;<a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">condomize</a> your commercial links or suffer&#8221; (from deducted toolbar PageRank, links without the ability to pass real PageRank and relevancy signals, or perhaps even penalties).</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/paid-links-evil-versus-good.png" width="250" height="116" align="right" style="margin-left:4px;" alt="Paid links: good versus evil" title="Paid links: Google versus Web" />Of course that&#8217;s somewhat evil, because applying <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">nofollow values</a> to all sorts of links is not exactly a natural thing to do; visitors don&#8217;t care about invisible link attributes and sometimes they&#8217;re even pissed when they get redirected to an URL not displayed in their status bar. Also, this requirement forces Webmasters to invest enormous efforts in code maintenance for the sole purpose of satisfying search engines. The argument &#8220;if Google doesn&#8217;t like these links, then they can discount them in their system, without bothering us&#8221; has its merits, but unfortunately that&#8217;s not the way Google&#8217;s cookie crumbles for various reasons. Hence lets develop a pragmatic procedure to handle those links.</p>
<h3>The problem</h3>
<p>Google thinks that uncondomized paid links as well as commercial links to sponsors or affiliated entities aren&#8217;t natural, because the terms &#8220;sponsor|pay for review|advertising|my other site|sign-up|&#8230;&#8221; and &#8220;editorial vote&#8221; are not compatible in the sense of <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Google&#8217;s guidelines</a>. This view at the Web&#8217;s linkage is pretty black vs. white.</p>
<p>Either you link out because a sponsor bought ads, or you don&#8217;t sell ads and link out for free because you honestly think your visitors will like a page. Links to sponsors without condom are black, links to sites you like and which you don&#8217;t label &#8220;sponsor&#8221; are white. </p>
<p>There&#8217;s nothing in between, respectively gray areas like links to hand picked sponsors on a page with a gazillion of links count as black. Google doesn&#8217;t care whether or not your clean links actually pass a reasonable amount of PageRank to link destinations which buy ad space too, the sole possibility that those links <em>could</em>&nbsp; influence search results is enough to qualify you as sort of a link seller. </p>
<p>The same goes for paid reviews on blogs and whatnot, see for example <a href="http://andybeard.eu/2007/10/penalty-confirmed-but-i-dont-sell-pagerank.html">Andy&#8217;s problem</a> with his honest reviews which Google classifies as paid links, and of course all sorts of traffic deals, <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">affiliate links</a>, banner ads and stuff like that. </p>
<p>You don&#8217;t even need to label a clean link as advert or sponsored. If the link destination matches a domain in Google&#8217;s database of on-line advertisers, link buyers, e-commerce sites / merchants etcetera, or Google figures out that you link too much to affiliated sites or other sites you own or control, then your toolbar PageRank is toast and most probably your outgoing links will be penalized. Possibly these penalties have impact on your internal links too, what results in less PageRank landing on subsidiary pages. Less PageRank gathered by your landing pages means less crawling, less ranking, less SERP referrers, less revenue.</p>
<h3>The solution</h3>
<p>You&#8217;re absolutely right when you say that such search engine nitpicking should not force you to throw nofollow crap on your links like confetti. From your and my point of view condomizing links is wrong, but sometimes it&#8217;s better to pragmatically comply to such policies in order to stay in the game.  </p>
<p>Although uncrawlable redirect scripts have advantages in some cases, the simplest procedure to condomize a link is the <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">rel-nofollow</a> <a href="http://sebastians-pamphlets.com/links/categories/?cat=microformats">microformat</a>. Here is an example of a googlified affiliate link:<code><br />
&lt;a href="http://sponsor.com/?affID=1" rel="nofollow"&gt;Sponsor&lt;/a&gt;</code></p>
<h3>Why serve your visitors search engine crawler directives?</h3>
<p>Complying to Google&#8217;s laws does not mean that you must deliver <a href="http://sebastians-pamphlets.com/links/categories/?cat=crawler-directives">crawler directives</a> like rel=&#8221;nofollow&#8221; to your visitors. Since Google is concerned about search engine rankings influenced by uncondomized links with commercial intent, serving crawler directives to crawlers and clean links to users is perfectly in line with Google&#8217;s goals. Actually, initiatives like the <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">X-Robots-Tag</a> make clear that hiding crawler directives from users is fine with Google. To underline that, here is a quote from <a href="http://www.mattcutts.com/blog/hidden-links/">Matt Cutts</a>:<br />
<blockquote>[&#8230;] If you want to sell a link, <b>you should at least provide machine-readable disclosure</b> for paid links by making your link in a way that doesn’t affect search engines. [&#8230;]</p>
<p>The other best practice I’d advise is to provide human readable disclosure that a link/review/article is paid. You could put a badge on your site to disclose that some links, posts, or reviews are paid, but including the disclosure on a per-post level would better. Even something as simple as &#8220;This is a paid review&#8221; fulfills the human-readable aspect of disclosing a paid article. [&#8230;]</p>
<p><b>Google’s quality guidelines are more concerned with the machine-readable aspect of disclosing paid links/posts</b> [&#8230;]</p>
<p>To make sure that you’re in good shape, go with both human-readable disclosure and machine-readable disclosure, using any of the methods [uncrawlable redirects, rel-nofollow] I mentioned above.<br />
[emphasis mine]</p></blockquote>
<p>Since Google devalues paid links anyway, search engine friendly cloaking of rel-nofollow for Googlebot is a non-issue with advertisers, as long as this fact is disclosed. I bet most link buyers look at the magic green pixels anyway, but that&#8217;s their problem.</p>
<h3>How to cloak rel-nofollow for search engine crawlers</h3>
<p>I&#8217;ll discuss a PHP/Apache example, but this method is adaptable to other server sided scripting languages like ASP or so with ease. If you&#8217;ve a static site and PHP is available on your (*ix) host, you need to tell Apache that you&#8217;re using PHP in .html (.htm) files. Put this statement in your root&#8217;s .htaccess file: <code><br />
AddType application/x-httpd-php .html .htm</code></p>
<p>Next create a plain text file, insert the code below, and upload it as &#8220;funct_nofollow.php&#8221; or so to your server&#8217;s root directory (or a subdirectory, but then you need to change some code below). <code><br />
&lt;?php<br />
function makeRelAttribute ($linkClass) {<br />
    $numargs = func_num_args();<br />
    // optional 2nd input parameter: $relValue<br />
    if ($numargs >= 2) {<br />
        $relValue = func_get_arg(1) .&quot; &quot;;<br />
    }<br />
    $referrer                   = $_SERVER[&quot;HTTP_REFERER&quot;];<br />
    $refUrl                     = parse_url($referrer);<br />
    $isSerpReferrer             = FALSE;<br />
    if (stristr($refUrl[host], &quot;google.&quot;) ||<br />
        stristr($refUrl[host], &quot;yahoo.&quot;))<br />
        $isSerpReferrer         = TRUE;<br />
    $userAgent                  = $_SERVER[&quot;HTTP_USER_AGENT&quot;];<br />
    $isCrawler                  = FALSE;<br />
    if (stristr($userAgent, &quot;Googlebot&quot;) ||<br />
        stristr($userAgent, &quot;Slurp&quot;))<br />
        $isCrawler              = TRUE;<br />
    if ($isCrawler  <b>/*</b>|| $isSerpReferrer<b>*/</b> ) {<br />
        if (&quot;$linkClass&quot; == &quot;ad&quot;)   $relValue .= &quot;advertising nofollow&quot;;<br />
        if (&quot;$linkClass&quot; == &quot;paid&quot;) $relValue .= &quot;sponsored nofollow&quot;;<br />
        if (&quot;$linkClass&quot; == &quot;own&quot;)  $relValue .= &quot;affiliated nofollow&quot;;<br />
        if (&quot;$linkClass&quot; == &quot;vote&quot;) $relValue .= &quot;editorial dofollow&quot;;<br />
    }<br />
    if (empty($relValue))<br />
        return &quot;&quot;;<br />
    return &quot; rel=\&quot;&quot; .trim($relValue) .&quot;\&quot; &quot;;<br />
} // end function makeRelValue<br />
?&gt;   </code></p>
<p>Next put the code below in a PHP file you&#8217;ve included in all scripts, for example header.php. If you&#8217;ve static pages, then insert the code at the very top. <code><br />
&lt;?php<br />
@include($_SERVER[&quot;DOCUMENT_ROOT&quot;] .&quot;/funct_nofollow.php&quot;);<br />
?&gt;   </code><br />
Do not paste the function <code>makeRelValue</code> itself! If you spread code this way you&#8217;ve to edit tons of files when you need to change the functionality later on.</p>
<p>Now you can use the function <code>makeRelValue($linkClass,$relValue)</code> within the scripts or HTML pages. The function has an input parameter $linkClass and knows the (self-explanatory) values &#8220;ad&#8221;, &#8220;paid&#8221;, &#8220;own&#8221; and &#8220;vote&#8221;. The second (optional) input parameter is a value for the <a href="http://www.smart-it-consulting.com/article.htm?node=155&#038;page=90#a-rel">A element&#8217;s REL attribute</a> itself. If you provide it, it gets appended, or, if <code>makeRelValue</code> doesn&#8217;t detect a spider, it creates a REL attribute with this value. Examples below. You can add more user agents, or serve rel-nofollow to visitors coming from SERPs by enabling the <code>|| $isSerpReferrer</code> condition (remove the bold <code>/*</code>&amp;<code>*/</code>).</p>
<p>When you code a hyperlink, just add the function to the A tag. Here is a PHP example: <code><br />
print &quot;&lt;a href=\&quot;http://google.com/\&quot;&quot; .makeRelAttribute(&quot;ad&quot;) .&quot;&gt;Google&lt;/a&gt;&quot;; </code><br />
will output<br />
<code>&lt;a href=&quot;http://google.com/&quot; rel=&quot;advertising nofollow&quot; &gt;Google&lt;/a&gt;</code><br />
when the user agent is Googlebot, and<br />
<code>&lt;a href=&quot;http://google.com/&quot;&gt;Google&lt;/a&gt;</code><br />
to a browser.</p>
<p>If you can&#8217;t write nice PHP code, for example because you&#8217;ve to follow crappy guidelines and worst practices with a WordPress blog, then you can mix <span style="color:blue;">HTML</span> and <span style="color:green;">PHP</span> tags: <code><br />
<span style="color:blue;">&lt;a href=&quot;http://search.yahoo.com/&quot;</span><span style="color:green;">&lt;?php print makeRelAttribute(&quot;paid&quot;); ?&gt;</span><span style="color:blue;">&gt;Yahoo&lt;/a&gt;</span>   </code></p>
<p>Please note that this method is not safe with search engines or unfriendly competitors when you want to cloak for other purposes. Also, the link condoms are served to crawlers only, that means search engine staff reviewing your site with a non-crawler user agent name won&#8217;t spot the nofollow&#8217;ed links unless they check the engine&#8217;s cached page copy. An HTML comment in HEAD like &#8220;This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.&#8221; as well as a similar comment line in robots.txt would certainly help to pass reviews by humans.</p>
<h3>A Google-friendly way to handle paid links, affiliate links, and cross linking</h3>
<p>Load this page with different user agents and referrers. You can do this for example with a FireFox extension like <a href="http://sebastians-pamphlets.com/referrer-spoofing-with-prefbar-341/">PrefBar</a>. For testing purposes you can use these user agent names: <code><br />
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)<br />
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) </code><br />
and these SERP referrer URLs: <code><br />
http://google.com/search?q=viagra<br />
http://search.yahoo.com/search?p=viagra&#038;ei=utf-8&#038;iscqry=&#038;fr=sfp </code><br />
Just enter these values in PrefBar&#8217;s user agent respectively referrer spoofing options (click &#8220;Customize&#8221; on the toolbar, select &#8220;User Agent&#8221; / &#8220;Referrerspoof&#8221;, click &#8220;Edit&#8221;, add a new item, label it, then insert the strings above). Here is the code above in action:</p>
<table style="margin-bottom:30px;">
<tr>
<td valign="top"><b>Referrer URL:</b></td>
<td valign="top"></td>
</tr>
<tr>
<td valign="top"><b>User Agent Name:</b></td>
<td valign="top">CCBot/1.0 (+http://www.commoncrawl.org/bot.html)</td>
</tr>
<tr>
<td valign="top"><b>Ad</b> makeRelAttribute(&#8221;ad&#8221;): </td>
<td valign="top"><a href="http://google.com/">Google</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>Paid</b> makeRelAttribute(&#8221;paid&#8221;): </td>
<td valign="top"><a href="http://search.yahoo.com/"  >Yahoo</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>Own</b> makeRelAttribute(&#8221;own&#8221;): </td>
<td valign="top"><a href="http://sebastians-pamphlets.com/"  >Sebastian&#8217;s Pamphlets</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>Vote</b> makeRelAttribute(&#8221;vote&#8221;): </td>
<td valign="top"><a href="http://link-condom.com/"  >The Link Condom</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>External</b> makeRelAttribute(&#8221;", &#8220;external&#8221;): </td>
<td valign="top"><a href="http://w3.org/"  rel="external"  >W3C</a> <code> rel="external" </code></td>
</tr>
<tr>
<td valign="top"><b>Without parameters</b> makeRelAttribute(&#8221;"): </td>
<td valign="top"><a href="http://sphinn.com/"  >Sphinn</a> <code></code></td>
</tr>
</table>
<p>When you change your browser&#8217;s user agent to a crawler name, or fake a SERP referrer, the REL value will appear in the right column.</p>
<p>When you&#8217;ve developed a better solution, or when you&#8217;ve a nofollow-cloaking tutorial for other programming languages or platforms, please let me know in the comments. Thanks in advance!</p>
<p><!-- Processed by EzStatic --></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise</title>
		<link>http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/</link>
		<comments>http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#comments</comments>
		<pubDate>Tue, 09 Oct 2007 14:57:53 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/</guid>
		<description><![CDATA[We find redirects on every Web site out there. They&#8217;re often performed unnoticed in the background, unintentionally messed up, implemented with a great deal of ignorance, but seldom perfect from a SEO perspective. Unfortunately, the Webmaster boards are flooded with contradictorily, misleading and plain false  advice on redirects. If you for example read &#8220;for [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/http-redirects.png" width="200" height="150" alt="HTTP Redirects" title="HTTP Redirects" style="margin-left:3px;" align="right"  />We find redirects on every Web site out there. They&#8217;re often performed unnoticed in the background, unintentionally messed up, implemented with a great deal of ignorance, but seldom perfect from a SEO perspective. Unfortunately, the Webmaster boards are flooded with contradictorily, misleading and plain false  advice on redirects. If you for example read &#8220;for SEO purposes you must make use of 301 redirects only&#8221; then better close the browser window/tab to prevent you from crappy advice. A 302 or 307 redirect can be search engine friendly too.</p>
<p>With this post I do plan to bore you to death. So lean back, grab some popcorn, and stay tuned for a longish piece explaining the Interweb&#8217;s forwarding requests as dull as dust. Or, if you know everything about redirects, then please digg, sphinn and stumble this post before you surf away. Thanks.</p>
<ul id="redirect-jump-station" style="margin-bottom:25px;"><b>Jump Station</b></p>
<li class="toc-h3"><a href="#post-203">The anatomy of a server sided redirect</a></li>
<li class="toc-h4"><a href="#http-redirect-def">Redirects are defined in the HTTP protocol, not in search engine guidelines</a></li>
<li class="toc-h3"><a href="#whats-a-ss-redirect">What is a server sided redirect?</a></li>
<li class="toc-h4"><a href="#exec-ss-redirect">Execution of server sided redirects</a></li>
<li class="toc-h3"><a href="#http-redirect-header">What is an HTTP redirect header?</a></li>
<li class="toc-h4"><a href="#http-status-line">The redirect response code in a HTTP status line</a></li>
<li class="toc-h4"><a href="#http-header-location">The redirect header&#8217;s &#8220;location&#8221; directive</a></li>
<li class="toc-h3"><a href="#how-to-implement-ss-redirect">How to implement a server sided redirect?</a></li>
<li class="toc-h4"><a href="#redirect-server-config">Redirects in server configuration files</a></li>
<li class="toc-h4"><a href="#redirect-dir-files-htaccess">Redirecting directories and files with .htaccess</a></li>
<li class="toc-h4"><a href="#redirect-in-scripts">Redirects in server sided scripts</a></li>
<li class="toc-h3"><a href="#invisible-server-redirects">Redirects done by the Web server itself</a></li>
<li class="toc-h3"><a href="#redirect-or-not">Redirect or not? A few use cases&#8230;</a></li>
<li class="toc-h3"><a href="#choosing-a-redirect-response-code">Choosing the best redirect response code (301, 302, or 307)</a></li>
<li class="toc-h4"><a href="#301-moved-permanently">301 - Moved Permanently</a></li>
<li class="toc-h4"><a href="#moving-sites-301">Moving sites with 301 redirects</a></li>
<li class="toc-h4"><a href="#302-found-elsewhere">302 - Found [Elsewhere]</a></li>
<li class="toc-h4"><a href="#307-temporary-redirect">307 - Temporary Redirect</a></li>
<li class="toc-h3"><a href="#redirect-recap">Recap</a></li>
</ul>
<h4 id="http-redirect-def">Redirects are defined in the HTTP protocol, not in search engine guidelines</h4>
<p>For the moment please forget everything you&#8217;ve heard about redirects and their SEO implications, clear your mind, and follow me to the very basics defined in the HTTP protocol. Of course search engines interpret some redirects in a non-standard way, but understanding the norm as well as its use and abuse is necessary to deal with server sided redirects. I don&#8217;t bother with outdated HTTP 1.0 stuff, although some search engines still apply it every once in a while, hence I&#8217;ll discuss the 307 redirect introduced in <a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html">HTTP 1.1</a> too. For information on client sided redirects please refer to <a href="http://sebastians-pamphlets.com/google-and-yahoo-treat-undelayed-meta-refresh-as-301-redirect/">Meta Refresh - the poor man&#8217;s 301 redirect</a> or read my other <a href="http://sebastians-pamphlets.com/links/categories/?cat=redirects">pamphlets on redirects</a>, and stay away from <a href="http://sebastians-pamphlets.com/links/categories/?cat=javascript-redirects">JavaScript URL manipulations</a>.</p>
<h3 id="whats-a-ss-redirect">What is a server sided redirect?</h3>
<p>Think about an HTTP redirect as a forwarding request. Although redirects work slightly different from snail mail forwarding requests, this analogy perfectly fits the <em>procedure</em>. Whilst with <a href="https://moversguide.usps.com/?referral=USPS">US Mail forwarding requests</a> a clerk or postman writes the new address on the envelope before it bounces in front of a no longer valid respectively temporarily abandoned letter-box or pigeon hole, on the Web the request&#8217;s location (that is the Web server responding to the <em>server name</em> part of the URL) provides the requestor with the new location (absolute URL). </p>
<p>A server sided redirect tells the user agent (browser, Web robot, &#8230;) that it has to perform another request for the URL given in the HTTP header&#8217;s &#8220;location&#8221; line in order to fetch the requested contents. The type of the redirect (301, 302 or 307) also instructs the user agent how to perform future requests of the Web resource. Because search engine crawlers/indexers try to emulate human traffic with their content requests, it&#8217;s important to choose the right redirect type both for humans and robots. That does not mean that a 301-redirect is always the best choice, and it certainly does not mean that you always must return the same HTTP response code to crawlers and browsers. More on that later.</p>
<h4 id="exec-ss-redirect">Execution of server sided redirects</h4>
<p>Server sided redirects are executed <b>before</b> your server delivers any content. In other words, your server ignores everything it <b>could</b> deliver (be it a static HTML file, a script output, an image or whatever) when it runs into a redirect condition. Some redirects are done by the server itself (see <a href="http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/#incomplete-uri">handling incomplete URIs</a>), and there are several places where you can set (conditional) redirect directives: Apache&#8217;s <a href="http://httpd.apache.org/docs/2.2/configuring.html">httpd.conf</a>, <a href="http://httpd.apache.org/docs/2.2/howto/htaccess.html">.htaccess</a>, or in application layers for example in <a href="http://www.php.net/manual/en/function.header.php">PHP scripts</a>. (If you suffer from IIS/ASP maladies, <a href="http://www.cumbrowski.com/CarstenC/seo_301redirect_aspsrc.asp">this post</a> is for you.) <b>Examples:</b></p>
<table cellpadding="1" cellspacing="0" border="1" bordercolor="gray">
<tr>
<th><b>Browser Request:</b></th>
<th><code><b>ww.site.com<br />/page.php?id=1</b></code></th>
<th><code><b>site.com<br />/page.php?id=1</b></code></th>
<th><code><b>www.site.com<br />/page.php?id=1</b></code></th>
<th><code><b>www.site.com<br />/page.php?id=2</b></code></th>
</tr>
<tr>
<td><b>Apache:</b></td>
<td>301 header:<br /><code>www.site.com<br />/page.php?id=1</code></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td><b>.htaccess:</b></td>
<td>&nbsp;</td>
<td>301 header:<br /><code>www.site.com<br />/page.php?id=1</code></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td><b>/page.php:</b></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td valign="top">301 header:<br /><code>www.site.com<br />/page.php?id=2</code></td>
<td valign="top">200 header:<br /><code>(Info like content length...)</code><br />
<hr />Content:<br />Article #2</td>
</tr>
</table>
<p>The 301 header may or may not be followed by a hyperlink pointing to the new location, solely added for user agents which can&#8217;t handle redirects. Besides that link, there&#8217;s no content sent to the client <b>after</b> the redirect header.</p>
<p>More important, you must not send a single byte to the client <b>before</b> the HTTP header. If you for example code <code>[space(s)|tab|new-line|HTML code]&lt;?php ...</code> in a script that shall perform a redirect or is supposed to return a 404 header (or any HTTP header different from the server&#8217;s default instructions), you&#8217;ll produce a runtime error. The redirection fails, leaving the visitor with an ugly page full of cryptic error messages but no link to the new location.</p>
<p>That means in each and every page or script which possibly has to deal with the HTTP header, put the logic testing those conditions at the very top. <strong>Always send the header status code and optional further information like a new location to the client before you process the contents.</strong> </p>
<p>After the last redirect header line terminate execution with the &#8220;L&#8221; parameter in .htaccess, PHP&#8217;s <code>exit;</code> statement, or whatever.</p>
<h3 id="http-redirect-header">What is an HTTP redirect header?</h3>
<p>An HTTP redirect, regardless its type, consists of two lines in the HTTP header. In this example I&#8217;ve requested http://www.sebastians-pamphlets.com/about/, which is an invalid URI because my server name lacks the www-thingy, hence my canonicalization routine outputs this HTTP header:</code><br />
<b>HTTP/1.1 301 Moved Permanently</b><br />
<span style="color:gray;">Date: Mon, 01 Oct 2007 17:45:55 GMT<br />
Server: Apache/1.3.37 (Unix) PHP/4.4.4</span><br />
<b>Location: http://sebastians-pamphlets.com/about/</b><br />
<span style="color:gray;">Connection: close<br />
Transfer-Encoding: chunked<br />
Content-Type: text/html; charset=iso-8859-1</span></code></p>
<h4 id="http-status-line">The redirect response code in a HTTP status line</h4>
<p>The first line of the header defines the protocol version, the reponse code, and provides a human readable reason phrase. Here is a shortened and slightly modified excerpt quoted from the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6">HTTP/1.1 protocol definition</a>:<br />
<blockquote><b>Status-Line</b></p>
<p>The first line of a <em>Response message</em> is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by <acronym title="Space, blank, ASCII 0x20">SP</acronym> (space) characters. No <acronym title="Carriage Return, ASCII 0x0D">CR</acronym> or <acronym title="Line Feed, ASCII 0x0A">LF</acronym> is allowed except in the final <acronym title="New Line, CR followed by LF">CRLF</acronym> sequence.</p>
<p>Status-Line = HTTP-Version <i>SP</i> Status-Code <i>SP</i> Reason-Phrase <i>CRLF</i><br />
[e.g. &#8220;HTTP/1.1 301 Moved Permanently&#8221; + CRLF]</p>
<p><b>Status Code and Reason Phrase</b></p>
<p>The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request. [&#8230;] The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase.</p>
<p>The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. [&#8230;]:<br />
[&#8230;]<br />
- <b>3xx</b>: Redirection - Further action must be taken in order to complete the request<br />
[&#8230;]</p>
<p>The individual values of the numeric status codes defined for HTTP/1.1, and an example set of corresponding Reason-Phrases, are presented below. The reason phrases listed here are only recommendations &#8212; they MAY be replaced by local equivalents without affecting the protocol [that means you could translate and/or rephrase them].<br />
[&#8230;]<br />
<span style="color:gray;">300: Multiple Choices</span><br />
<b>301: Moved Permanently</b><br />
<b>302: Found [Elsewhere]</b><br />
<span style="color:gray;">303: See Other<br />
304: Not Modified<br />
305: Use Proxy</span><br />
<b>307: Temporary Redirect</b><br />
[&#8230;]</p></blockquote>
<p>In terms of SEO the understanding of 301/302-redirects is important. 307-redirects, introduced with HTTP/1.1, are still capable to confuse some search engines, even major players like Google when Ms. Googlebot for some reasons thinks she <em>must</em> do HTTP/1.0 requests, usually caused by weird respectively ancient server configurations (or possibly testing newly discovered sites under certain circumstances). You should not perform 307 redirects as response to most HTTP/1.0 requests, use 302/301 &#8211;whatever fits best&#8211; instead. More info on this issue below in the 302/307 sections.</p>
<p>Please note that the default reponse code of all redirects is 302. That means when you send a HTTP header with a location directive but without an explicit response code, your server will return a 302-Found status line. That&#8217;s kinda crappy, because in most cases you want to avoid the 302 code like the plague. Do no nay never rely on default response codes! <strong>Always prepare a server sided redirect with a status line telling an actual response code (301, 302 or 307)!</strong> In server sided scripts (PHP, Perl, ColdFusion, JSP/Java, ASP/VB-Script&#8230;) always send a complete status line, and in .htaccess or httpd.conf add a <code>[R=301|302|307<span style="color:gray;">,L</span>]</code> parameter to statements like <code>RewriteRule</code>: <code><br />
RewriteRule (.*) http://www.site.com/$1 [R=301,L]</code></p>
<h4 id="http-header-location">The redirect header&#8217;s &#8220;location&#8221; field</h4>
<p>The next element you need in every redirect header is the <b>location</b> directive. Here is the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.30">official syntax</a>:<br />
<blockquote>
<b>Location</b></p>
<p>The Location response-header field is used to redirect the recipient to a location other than the Request-URI for completion of the request or identification of a new resource. [&#8230;] For 3xx responses, the location SHOULD indicate the server&#8217;s preferred URI for automatic redirection to the resource. The field value consists of a single absolute URI.</p>
<p>Location = &#8220;Location&#8221; &#8220;:&#8221; absoluteURI [+ CRLF]</p>
<p>An example is:<br />
<code><br />
Location: http://sebastians-pamphlets.com/about/</code></p></blockquote>
<p><img src="http://sebastians-pamphlets.com/img/posts/redirect-to-an-absolute-url.png" width="200" height="150" alt="Redirect to absolute URLs only" title="A redirect's location is ALWAYS an absolute URL!" style="margin-left:3px;" align="right"  />Please note that the value of the location field must be an <b>absolute URL</b>, that is a fully qualified URL with scheme (http|https), server name (domain|subdomain), and path (directory/file name) plus the optional query string (&#8221;?&#8221; followed by variable/value pairs like <code>?id=1&amp;page=2...</code>), no longer than 2047 bytes (better 255 bytes because most scripts out there don&#8217;t process longer URLs <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.1">for historical reasons</a>). A relative URL like <code>../page.php</code> <em>might</em> work in (X)HTML (although you better plan a spectacular suicide than any use of relative URIs!), but <strong>you must not use relative URLs in HTTP response headers</strong>!</p>
<h3 id="how-to-implement-ss-redirect">How to implement a server sided redirect?</h3>
<p>You can perform HTTP redirects with statements in your Web server&#8217;s configuration, and in server sided scripts, e.g. PHP or Perl. JavaScript is a client sided language and therefore lacks a mechanism to do HTTP redirects. That means all JS redirects count as a 302-Found response.</p>
<p>Bear in mind that when you redirect, you possibly leave tracks of outdated structures in your HTML code, not to speak of incoming links. You must change each and every internal link to the new location, as well as all external links you control or where you can ask for an URL update. If you leave any outdated links, visitors probably don&#8217;t spot it (although every redirect slows things down), but search engine spiders continue to follow them, what ends in <a href="http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/">redirect chains</a> eventually. Chained redirects often are the cause of deindexing pages, site areas or even complete sites by search engines, hence do no more than one redirect in a row and consider two redirects in a row risky. You don&#8217;t control offsite redirects, in some cases a search engine has already counted one or two redirects before it requests your redirecting URL (caused by redirecting traffic counters etcetera). <b>Always redirect to the final destination to avoid useless hops which kill your search engine traffic.</b> (<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40132">Google recommends</a> &#8220;that you use fewer than five redirects for each request&#8221;, but don&#8217;t try to max out such limits because other services might be less BS-tolerant.)</p>
<p>Like conventional forwarding requests, redirects do expire. Even a permanent 301-redirect&#8217;s source URL will be requested by search engines every now and then because they can&#8217;t trust you. As long as there is one single link pointing to an outdated and redirecting URL out there, it&#8217;s not forgotten. It will stay alive in search engine indexes and address books of crawling engines even when the last link pointing to it was changed or removed. You can&#8217;t control that, and you can&#8217;t find all inbound links a search engine knows, despite their better reporting nowadays (neither <a href="https://siteexplorer.search.yahoo.com/">Yahoo&#8217;s site explorer</a> nor <a href="https://www.google.com/webmasters/tools/siteoverview">Google&#8217;s link stats</a> show you all links!). That means <b>you must maintain your redirects forever, and you must not remove (permanent) redirects</b>. Maintenance of redirects includes hosting abandoned domains, and updates of location directives whenever you change the final structure. <b>With each and every revamp that comes with URL changes check for incoming redirects and make sure that you eliminate unnecessary hops.</b></p>
<p>Often you&#8217;ve many choices where and how to implement a particular redirect. You can do it in scripts and even static HTML files, CMS software, or in the server configuration. There&#8217;s no such thing as a general best practice, just a few hints to bear in mind.
<ul>
<li><img src="http://sebastians-pamphlets.com/img/posts/redirects-are-dynamite-so-blast-carefully.png" width="150" height="164" alt="Redirects are dynamite, so blast carefully" title="SEO wise, the best redirect is no redirect at all!" style="margin-left:3px;" align="right"  /><b>Doubt</b>: Don&#8217;t believe Web designers and developers when they say that a particular task can&#8217;t be done without redirects. Do your own research, or ask an SEO expert. When you for example plan to make a static site dynamic by pulling the contents from a database with PHP scripts, you don&#8217;t need to change your file extensions from *.html to *.php. Apache can parse .html files for PHP, just enable that in your root&#8217;s .htaccess: <code><br />
AddType application/x-httpd-php .html <span style="color:gray;">.htm .shtml .txt .rss .xml .css</span></code><br />
Then generate tiny PHP scripts calling the CMS to replace the outdated .html files. That&#8217;s not perfect but way better than URL changes, provided your developers can manage the outdated links in the CMS&#8217; navigation. Another pretty popular abuse of redirects is click tracking. You don&#8217;t need a redirect script to count clicks in your database, <a href="http://sebastians-pamphlets.com/how-to-turn-click-tracking-into-miserable-failure/">make use of the onclick event instead</a>. </li>
<li><b>Transparency</b>: When the shit hits the fan and you need to track down a redirect with not more than the HTTP header&#8217;s information in your hands, you&#8217;ll begin to believe that performance and elegant coding is not everything. Reading and understanding a large httpd.conf file, several complex .htaccess files, and searching redirect routines in a conglomerate of a couple generations of scripts and include files is not exactly fun. You could add a custom field identifying the piece of redirecting code to the HTTP header. In .htaccess that would be achieved with <code><br />
Header add X-Redirect-Src &quot;/content/img/.htaccess&quot;</code><br />
and in PHP with <code><br />
header(&quot;X-Redirect-Src: /scripts/inc/header.php&quot;, TRUE);</code><br />
(Whether or not you should encode or at least obfuscate code locations in headers depends on your security requirements.) </li>
<li><b>Encapsulation</b>: When you must implement redirects in more than one script or include file, then encapsulate all redirects including all the logic (redirect conditions, determining new locations, &#8230;). You can do that in an include file with a meaningful file name for example. Also, instead of plastering the root&#8217;s .htaccess file with tons of directory/file specific redirect statements, you can gather all requests for redirect candidates and call a script which tests the REQUEST_URI to execute the suitable redirect. In .htaccess put something like:<code><br />
RewriteEngine On<br />
RewriteBase /old-stuff<br />
RewriteRule ^(.*)\.html$ do-redirects.php</code><br />
This code calls /old-stuff/do-redirects.php for each request of an .html file in /old-stuff/. The PHP script: <code><br />
$requestUri = $_SERVER[&quot;REQUEST_URI&quot;];<br />
if (stristr($requestUri, &quot;/contact.html&quot;)) {<br />
    $location = &quot;http://example.com/new-stuff/contact.htm&quot;;<br />
}<br />
...<br />
if ($location) {<br />
    @header(&quot;HTTP/1.1 301 Moved Permanently&quot;, TRUE, 301);<br />
    @header(&quot;X-Redirect-Src: /old-stuff/do-redirects.php&quot;, TRUE);<br />
    @header(&quot;Location: $location&quot;);<br />
    exit;<br />
}<br />
else {<br />
    [output the requested file or whatever]<br />
}</code><br />
(This is also an example of a redirect include file which you could insert at the top of a header.php include or so. In fact, you can include this script in some files <em>and</em> call it from .htaccess without modifications.) This method will not work with ASP on IIS because amateurish wannabe Web servers don&#8217;t provide the REQUEST_URI variable.</li>
<li><b>Documentation</b>: When you design or update an information architecture, your documentation should contain a redirect chapter. Also comment all redirects in the source code (your genial regular expressions might lack readability when someone else looks at your code). It&#8217;s a good idea to have a documentation file explaining all redirects on the Web server (you might work with other developers when you change your site&#8217;s underlying technology in a few years).</li>
<li><b>Maintenance</b>: Debugging legacy code is a nightmare. And yes, what you write today becomes legacy code in a few years. Thus keep it simple and stupid, implement redirects transparent rather than elegant, and don&#8217;t forget that you must change your ancient redirects when you revamp a site area which is the target of redirects.</li>
<li><b>Performance</b>: Even when performance is an issue, you can&#8217;t do everything in httpd.conf. When you for example move a large site changing the URL structure, the redirect logic becomes too complex in most cases. You can&#8217;t do database lookups and stuff like that in server configuration files. However, some redirects like for example server name canonicalization should be performed there, because they&#8217;re simple and not likely to change. If you can&#8217;t change httpd.conf, .htaccess files are for you. They&#8217;re are slower than cached config files but still faster than application scripts.</li>
</ul>
<h4 id="redirect-server-config">Redirects in server configuration files</h4>
<p>Here is an example of a canonicalization redirect in the root&#8217;s .htaccess file: <code><br />
RewriteEngine On<br />
RewriteCond %{HTTP_HOST} !^sebastians-pamphlets\.com [NC]<br />
RewriteRule (.*) http://sebastians-pamphlets.com/$1 [R=301,L]</code>
<ol>
<li>The first line enables Apache&#8217;s mod_rewrite module. Make sure it&#8217;s available on your box before you copy, paste and modify the code above.</li>
<li>
<p>The second line checks the server name in the HTTP request header (received from a browser, robot, &#8230;). The &#8220;NC&#8221; parameter ensures that the test of the server name (which is, like the scheme part of the URI, not case sensitive by <a href="http://www.ietf.org/rfc/rfc2396.txt">definition</a>) is done as intended. Without this parameter a request of http://SEBASTIANS-PAMPHLETS.COM/ would run in an unnecessary redirect. The rewrite condition returns TRUE when the server name is <b>not</b> sebastians-pamphlets.com. There&#8217;