<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; Redirects</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Wed, 11 Aug 2010 18:57:05 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>OMFG - Google sends porn punters to my website &#8230;</title>
		<link>http://sebastians-pamphlets.com/make-risk-free-beer-money-from-porn-traffic/</link>
		<comments>http://sebastians-pamphlets.com/make-risk-free-beer-money-from-porn-traffic/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 18:11:48 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/make-risk-free-beer-money-from-porn-traffic/</guid>
		<description><![CDATA[
In todays GWC doctor&#8217;s office, the webmaster of an innocent orphanage website asks Google&#8217;s Matt Cutts:
[My site] is showing up for searches on &#8216;girls in bathrooms&#8217; because they have an article about renovating the girls bathroom! What do you think of the idea if a negative keyword meta tag to block irrelevant searches? [sic!]
Well, we [...]]]></description>
			<content:encoded><![CDATA[
<p>In todays GWC doctor&#8217;s office, the webmaster of an innocent orphanage website asks Google&#8217;s Matt Cutts:</p>
<blockquote><p>[My site] is showing up for searches on &#8216;girls in bathrooms&#8217; because they have an article about renovating the girls bathroom! What do you think of the idea if a negative keyword meta tag to block irrelevant searches? [sic!]</p></blockquote>
<p><b>Well, we don&#8217;t know what the friendly guy from Google recommends &#8230;</b></p>
<p><object style="display:inline; height: 100px; width: 133px;" onmouseover="this.height='344px'; this.width='425';" onmouseout="this.height='100px;'; this.width='133px;';">
<param name="movie" value="http://www.youtube.com/v/mvYZa3NZ1HE">
<param name="allowFullScreen" value="true">
<param name="allowScriptAccess" value="always"><embed src="http://www.youtube.com/v/mvYZa3NZ1HE" type="application/x-shockwave-flash" allowfullscreen="true" allowScriptAccess="always" ></object><img src="http://sebastians-pamphlets.com/img/posts/omfg-women.png" style="margin-left:50px; margin-bottom:25px;" /></p>
<p><b>&#8230; but my dear readers do know that my bullshit detector, faced with such a moronic idea, shouts out in agony:</b></p>
<h3>There&#8217;s no such thing as bad traffic, just weak monetizing!</h3>
<p>Ok, Ok, Ok &#8230; every now and then each and every webmaster out there suffers from misleaded search engine ranking algos, that send shitloads of totally unrelated search traffic. For example, when you search for [<a href="http://google.com/search?q=how+to+fuck+a+click&#038;safe=off">how to fuck a click</a>], you won&#8217;t expect that Google considers <a href="http://sebastians-pamphlets.com/how-to-turn-click-tracking-into-miserable-failure/">this geeky pamphlet</a> the very best search result. Of course Google should&#8217;ve detected your <a href="http://google.com/search?q=how+to+fuck+a+chick&#038;safe=off">NSFW-typo</a>. Shit happens. Deal with it.</p>
<p>On the other hand, search traffic is free, so there&#8217;s no valid reason to complain. Instead of asking Google for a minus-keyword REP directive, one should think of clever ways to monetize unrelated traffic without wasting bandwidth.</p>
<p>You want to monetize irrelevant traffic from searches for smut in a way that nobody can associate your site with porn. That&#8217;s doable. Here&#8217;s how it works:</p>
<h3>Make risk-free beer money from porn traffic with a non-adult site</h3>
<p>Copy those slimy phrases from your keyword stats and paste them into Google&#8217;s search box. Once you find an adult site that seems to match the smut surfer&#8217;s needs better than your site, click on the search result, and on the landing page search for a &#8220;webmasters&#8221; link that points to their affiliate program. Sign up and save your customized affiliate link.</p>
<p>Next add some PHP code to your scripts. Make absolutely sure it gets executed before you output any other content, even whitespace:</p>
<p><code>&lt;?php </code> &nbsp;<a onclick="showContent('code_getOffsiteUri');">Show all code</a></p>
<p id="code_getOffsiteUri" style="display:none;"><code>function getReferrer () {<br />
    return $_SERVER["HTTP_REFERER"];<br />
}<br />
function getOffsiteUri() {<br />
    $searchQuery = stristr(getReferrer(), "q=");<br />
    $trash = stristr($searchQuery, "&#038;");<br />
    $searchQuery = str_replace($trash, "", $searchQuery);<br />
    $searchQuery = str_replace("+", " ", $searchQuery);<br />
    $searchQuery = str_replace("&#038;", " ", $searchQuery);<br />
    $searchQuery = str_replace("%20", " ", $searchQuery);<br />
    while (stristr($searchQuery, "  ")) {<br />
        $searchQuery = str_replace("  ", " ", $searchQuery);<br />
    }<br />
    // map irrelevant search queries to sponsor URIs<br />
    if (stristr($searchQuery, "teens in bathroom")) {<br />
        return "http://someteenpornsite.com/landingpage?affID=4711";<br />
    }<br />
}</code></p>
<p><code>$betterMatch = getOffsiteUri();<br />
if ($betterMatch) {<br />
   header("HTTP/1.1 307 Here's your smut", TRUE, 307);<br />
   header("Location: $betterMatch");<br />
   exit;<br />
}<br />
?&gt;</code> Refine the simplified code above. Use a database table to store the mappings &#8230;</p>
<p>Now a surfer coming from a SERP like <code><br />http://google.com/search?num=100&#038;q=nude+teens+in+bathroom&#038;safe=off</code> <br />will get redirected to <code><br />http://someteenpornsite.com/landingpage?affID=4711</code><br /> You&#8217;re using a 307 redirect because it&#8217;s not cached by a user agent, so that when you later on find a porn site that converts your traffic better, you can redirect visitors to another URI.</p>
<p>As you probably know, search engines don&#8217;t approve duplicate content. Hence it wouldn&#8217;t be a bright idea to put up x-rated stuff (all smut is duplicate content by design) onto your site to fulfil the misleaded searcher&#8217;s needs.</p>
<p>Of course you can use the technique outlined above to protect searchers from landing on your contact/privacy page, too, when in fact your signup page is their desired destination.</p>
<h3>Shiny whitehat disclaimer</h3>
<p>If you&#8217;re afraid of the possibility that the allmighty Google might punish you for your well meant attempt to fix it&#8217;s bugs, relax.</p>
<p>A search engine misinterpreting your content so badly, failed miserably. Your bugfix actually improves their search quality. Search engines can&#8217;t force you to report such flaws, they just kindly ask for voluntary feedback.</p>
<p>If search engines dislike smart websites that find related content on the Interwebs in case the search engine delivers shitty search results, they can act themselves. Instead of penalizing webmasters that react to flaws in their algos, they&#8217;re well advised to adjust their scoring. I mean, if they stop sending smut traffic to non-porn sites, their users don&#8217;t get redirected any longer. It&#8217;s that simple.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/make-risk-free-beer-money-from-porn-traffic/", "style": "big", "title": "OMFG - Google sends porn punters to my website ..." } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/make-risk-free-beer-money-from-porn-traffic/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google went belly-up: SERPs sneakily redirect to FPAs</title>
		<link>http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/</link>
		<comments>http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/#comments</comments>
		<pubDate>Wed, 12 May 2010 17:06:19 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Spam Report]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/</guid>
		<description><![CDATA[
I&#8217;m pissed. I do know I shouldn&#8217;t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.
What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? Here you go. Google&#8217;s webmaster guidelines [...]]]></description>
			<content:encoded><![CDATA[
<p>I&#8217;m pissed. I do know I shouldn&#8217;t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.</p>
<p>What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Here you go</a>. Google&#8217;s webmaster guidelines explicitely forbid such black hat tactics: </p>
<p>&#8220;<strong>Don&#8217;t use cloaking or sneaky redirects.</strong>&#8221; Google just did the latter with its very own <a href="http://google.com/ie?q=buy+viagra+online">SERPs</a>. The search interface <a href="http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/#goog-ie-ui">google.com/ie</a>, out in the wild for nearly a decade, redirects to a piece of sidebar HTML offering a download of IE8 optimized for Google. That&#8217;s a helpful redirect for some IE6 users who don&#8217;t suffer from an IT department stuck with this outdated browser, but it&#8217;s plain misleading in the eyes of all those searchers who appreciated this clean and totally uncluttered search interface. Interestingly, <abbr title="User Agent">UA</abbr> cloaking is the only way to heal this sneaky behavior.</p>
<p>&#8220;<strong>Don&#8217;t create pages with malicious behavior.</strong>&#8221; Google&#8217;s guilty, too. Instead of checking for the user&#8217;s browser, redirecting only IE6 requests from <a href="http://www.google.com/search?output=ie&#038;num=100&#038;hl=en&#038;safe=off&#038;q=google+discontinues+IE6+support">Google&#8217;s discontiued IE6 support</a> (IE6 toolbar &#8230;) to the IE8 advertisement, whilst all other user agents get their desired search box, respectively their SERPs, under a google.com/search?output=ie&amp;&#8230; URI, Google performs an unconditional redirect to a page that&#8217;s utterly useless and also totally unexpected for many searchers. I consider misleading redirects malicious.</p>
<p>&#8220;<strong>Avoid links to web spammers or &#8216;bad neighborhoods&#8217; on the web.</strong>&#8221; I consider the propaganda for IE that Google displays instead of the search results I&#8217;d expect a bad neighborhood on the Web, because IE constantly ignores Web standards, forcing developers and designers to implement superfluous work arounds. (Ok, ok, ok &#8230; Google&#8217;s lack of geekiness doesn&#8217;t exactly count as violation of their webmaster guidelines, but it sounds good, doesn&#8217;t it?)</p>
<p><a href="http://twitter.com/home?status=Hey+@MattCutts,+about+time+to+ban+google.com/ie?q=spam!+http%3A%2F%2Fsebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/" target="twitter" title="Tweet That!"><strong>Hey Matt Cutts, about time to ban google.com/ie!</strong> <img src="http://sebastians-pamphlets.com/img/twitter-icon.gif" width="10" height="10" style="border:none;" alt="Click to tweet that"  /></a></p>
<p id="goog-ie-ui"><a href="http://sebastians-pamphlets.com/rediscover-googles-free-ranking-checker/">Google&#8217;s very best search interface</a> is history. Here is what you got under<code><br />
<b>http://www.google.com/ie?num=100&#038;hl=en&#038;safe=off&#038;q=minimalistic</b></code>:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/google-awesome-ie-serp.png" width="448" height="503" style="text-align:center; display:block;" align="middle" alt="Google's famous minimalistic search UI" title="Google's famous minimalistic search UI" /></p>
<p>And here is where Google sneakily redirects you to when you load the SERP link above (even with Chrome!):<code><br />
<b>http://www.google.com/toolbar/ie8/sidebar.html</b></code>:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/google-fpa-ie8.png" width="268" height="569" style="border:dotted red 1px; text-align:center; display:block;" align="middle" alt="Google's sneaky IE8 propaganda" title="Google's sneaky IE8 propaganda" /></p>
<p id="goog-ie-spam-report">It&#8217;s sad that a browser vendor like Google (and yes, Google Chrome <b>is</b> my favorite browser) feels the need to mislead its users with propaganda for a competiting browser that&#8217;s slower and doesn&#8217;t render everything as it should render it. But when this particular browser vendor also leads Web search, and makes use of black hat techniques that it bans webmasters for, then that&#8217;s a scandal. So, if you agree, please submit a spam report to Google:</p>
<p><a href="http://twitter.com/home?status=Hey+@MattCutts,+about+time+to+ban+google.com/ie! %23spam-report+http%3A%2F%2Fsebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/" target="twitter" title="Tweet Your Spam Report!"><strong>Hey Matt Cutts, about time to ban google.com/ie! #spam-report</strong> <img src="http://sebastians-pamphlets.com/img/twitter-icon.gif" width="10" height="10" style="border:none;" alt="Tweet Your Spam Report"  /></a></p>
<p>2010-05-17 I&#8217;ve updated this pamphlet because it didn&#8217;t explain the &#8220;sneakiness&#8221; clear enough. As of today, the unconditional redirect is still sneaky IMHO. Google needs to deliver searchers their desired search results, and only stubborn IE6 users ads for a somewhat better browser.</p>
<p>2010-05-18 <b>Q:</b> You&#8217;re pissed solely because your SERP scraping scrips broke. <b>A:</b> Glad you&#8217;ve asked. Yes, I&#8217;ve <a href="http://www.scroogle.org/cgi-bin/scraper.htm" rel="crap nofollow">scraped Google&#8217;s /ie search</a> too. Not because I&#8217;m a <a href="http://www.google-watch.org/" rel="crap nofollow">privacy nazi</a> like Daniel Brandt. I&#8217;ve just checked (my) rankings. However, when I spotted the redirects I didn&#8217;t even remember the location of the scripts that scraped this service, because I didn&#8217;t look at ranking reports for years. I&#8217;m interested in actual traffic, and revenues. Ego food annoys me. I just love the /ie search interface. So the answer is a bold &#8220;no&#8221;. I don&#8217;t give a fucking dead rat&#8217;s ass what ranking reports based on scraped SERPs could tell.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/", "style": "big", "title": "Google went belly-up: SERPs sneakily redirect to FPAs" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How brain-amputated developers created the social media plague</title>
		<link>http://sebastians-pamphlets.com/social-media-plague/</link>
		<comments>http://sebastians-pamphlets.com/social-media-plague/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 17:36:50 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Social Web]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[URI shortening]]></category>

		<category><![CDATA[Trolling]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Twitter]]></category>

		<category><![CDATA[robots.txt]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/social-media-plague/</guid>
		<description><![CDATA[
The bot playground commonly refered to as &#8220;social media&#8221; is responsible for shitloads of absurd cretinism.
For example Twitter, where gazillions of bots [type A] follow other equally superfluous but nevertheless very busy bots [type B] that automatically generate 27% valuable content (links to penis enlargement tools) and 73% not exactly exciting girly chatter (breeding demand [...]]]></description>
			<content:encoded><![CDATA[
<p>The bot playground commonly refered to as &#8220;social media&#8221; is responsible for shitloads of absurd cretinism.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/sm-plague-bot-playground.png" width="195" height="246" align="right" style="margin-left:5px;" alt="Twitter Bot Playground" title=""  />For example <a href="https://twitter.com/SebastianX">Twitter</a>, where gazillions of bots [type A] follow other equally superfluous but nevertheless very busy bots [type B] that automatically generate <a href="http://holykaw.alltop.com/only-27-of-tweets-contain-value-says-new-stud">27%</a> valuable content (links to penis enlargement tools) and 73% not exactly exciting girly chatter (breeding demand for cheap viagra).</p>
<p>Bazillions of other bots [type C] retweet bot [type B] generated crap and create lists of bots [type A, B, C]. In rare cases when a non-bot tries to participate in Twitter, the uber-bot [type T] prevents the whole bot network from negative impacts by serving a 503 error to the homunculus&#8217; browser.</p>
<p>This pamphlet is about the idiocy of a particular subclass of bots [type S] that sneakily work in the underground stealing money from content producers, and about their criminal (though brain-dead) creators. May they catch the swine flu, or at least pox or cholera, for the pest they&#8217;ve brought to us.</p>
<h3 id="sm-plague-the-pest">The Twitter pest that costs you hard earned money</h3>
<p>WTF I&#8217;m ranting about? The technically savvy reader, familiar with my attitude, has already figured out that I&#8217;ve read way too many raw logs. For the sake of a common denominator, I encourage you to perform a tiny real-world experiment:</p>
<ul>
<li>Publish a great and linkworthy piece of content.</li>
<li>Tweet its URI (not shortened - message incl. URI &le; 139 characters!) with a compelling call for action.</li>
<li>Watch your server logs.</li>
<li>Puke. Vomit increases with every retweet.</li>
</ul>
<p>So what happens on your server? A greedy horde of bots pounces on every tweet containing a link, requesting its content. That&#8217;s because on Twitter all URIs are suspected to be shortened (<a href="http://tag.us.com/uri-shorteners-suck-ass.htm#twitter-crap">learn <i>why</i> Twitter makes you eat shit</a>). This uncalled-for &#8211;IOW abusive&#8211; bot traffic burns your resources, and (with a cheap hosting plan) it can hinder your followers to read your awesome article and prevent them from clicking on your carefully selected ads.</p>
<p>Those crappy bots not only cost you money because they keep your server busy and increase your bandwidth bill, they actively decrease your advertising revenue because your visitors hit the back button when your page isn&#8217;t responsive due to the heavy bot traffic. Even if you&#8217;ve great hosting, you probably don&#8217;t want to burn money, not even pennies, right? </p>
<h3 id="sm-plague-mo">Bogus Twitter apps and their modus operandi</h3>
<p>If only every Twitter&amp;Crap-mashup would lookup each URI once, that wouldn&#8217;t be such a mess. Actually, some of these crappy bots request your stuff 10+ times per tweet, and again for each and every retweet. That means, as more popular your content becomes, as more bot traffic it attracts.</p>
<p>Most of these bots don&#8217;t obey <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a>, that means you can&#8217;t even block them applying Web standards (<a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-rogue-bot">learn how to block rogue bots</a>). <a href="http://labs.topsy.com/butterfly/">Topsy</a>, for example, does respect the content producer &#8212; so morons using &#8220;Python-urllib/1.17&#8243; or &#8220;AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)&#8221; could obey the <a href="http://www.robotstxt.org/">Robots Exclusion Protocol</a> (REP), too. Their developers are just too fucking lazy to understand such protocols that every respected service on the Web (search engines&#8230;) obeys.</p>
<p>Some of these bots even provide an HTTP_REFERER to lure you into viewing the website operated by their shithead of developer when you&#8217;re viewing your referrer stats. Others fake Web browsers in their user agent string, just in case you&#8217;re not smart enough to smell shit that really stinks (IOW browser-like requests that don&#8217;t fetch images, CSS files, and so on).</p>
<p>One of the worst offenders is outing itself as &#8220;ThingFetcher&#8221; in the user agent string. It&#8217;s hosted by Rackspace, which is a hosting service that obviously doesn&#8217;t care much about its reputation. Otherwise these guys would have reacted to my <a href="http://friendfeed.com/sebastianx/81c0792e/rackspace-you-host-abusive-bot-thingfetcher">various</a> <a href="http://friendfeed.com/sebastianx/80a3165c/owner-of-thingfetcher-should-stand-up-now-i-m" title="A Rackspace employee did read this tweet">complaints</a> WRT &#8220;ThingFetcher&#8221;. By the way, <a href="http://scobleizer.com/">Robert Scoble</a> represents Rackspace, you could <a href="http://twitter.com/scobelizer">drop him a line</a> if ThingFetcher annoys you, too.</p>
<p>ThingFetcher sometimes requests a (shortened) URI 30 times per second, from different IPs. It can get worse when a URI gets retweeted often. This malicious piece of code doesn&#8217;t obey robots.txt, and doesn&#8217;t cache results. Also, it&#8217;s too dumb to follow chained redirects, by the way. It doesn&#8217;t even publish its results anywhere, at least I couldn&#8217;t find the fancy URIs I&#8217;ve feeded it with in Google&#8217;s search index.</p>
<p>In ThingFetcher&#8217;s defense, its developer might say that it performs only <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.4" title="The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response.">HEAD requests</a>. Well, it&#8217;s true that HEAD request provoke only an HTTP response header. But: the script invoked gets completely processed, just the output is trashed.</p>
<p>That means, the Web server has to deal with the same load as with a <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.3" title="The GET method means retrieve whatever information (...) is identified by the Request-URI.">GET request</a>, it just deletes the content portion (the compelety formatted HTML page) when responding, after counting its size to send the <code>Content-Length</code> response header. Do you really believe that I don&#8217;t care about machine time? For each of your <del>utterly useless</del> <ins>bogus</ins> requests I could have my server deliver ads to a human visitor, who pulls the plastic if I&#8217;m upselling the right way (I do, usually).</p>
<p>Unfortunately, ThingFetcher is not the only bot that does a lookup for each URI embedded in a tweet, per tweet processed. Probably the overall number of URIs that appear only once is bigger than the number of URIs that appear quite often while a retweet campaign lasts. That means that doing HTTP requests is cheaper for the bot&#8217;s owner, but on the other hand that&#8217;s way more expensive for the content producer, and the URI shortening services involved as well.</p>
<p style="margin-left:15px;" id="sm-plague-upd-thingfetcher"><strong>ThingFetcher update:</strong> The owners of ThingFetcher are now aware of the problem, and will try to fix it asap (<a href="http://friendfeed.com/sebastianx/5e00fc9e/shellen-cw-let-move-this-discussion-to">more information</a>). Now that I know who&#8217;s operating the Twitter app owning ThingFetcher, <del>I take back the insults above</del> <ins>I&#8217;ve removed some insults from above, because they&#8217;d no longer address an anonymous developer, but bright folks who&#8217;ve just failed once</ins>. Too sad that <a href="http://brizzly.com/">Brizzly</a> didn&#8217;t reply earlier to my <a href="http://friendfeed.com/search?q=ThingFetcher">attempts</a> to identify ThingFetcher&#8217;s owner.</p>
<p>As a content producer I don&#8217;t care about the costs of any Twitter application that processes Tweets to deliver anything to its users. I care about my costs, and I can perfecly live without such a crappy service. Liberally, I can allow one single access per (shortened) URI to figure out its final destination, but I can&#8217;t tolerate such thoughtless abuse of my resources.</p>
<p>Every Twitter related &#8220;service&#8221; that does multiple requests per (shortened) URI embedded in a tweet is guilty of theft and pilferage. Actually, that&#8217;s an understatement, because these raids cost publishers an enormous sum across the Web. </p>
<p>These fancy apps shall maintain a database table storing the destination of each redirect (chain) acessible by its short URI. Or leave the Web, respectively pay the publishers. And by the way, Twitter should finally end URI shortening. Not only it breaks the Internet, it&#8217;s way too expensive for all of us.</p>
<h3 id="sm-plague-offenders">A few more bots that need a revamp, or at least minor tweaks</h3>
<p>I&#8217;ve added this section to express that besides my prominent <a href="http://sebastians-pamphlets.com/social-media-plague/#sm-plague-upd-thingfetcher">example</a> above, there&#8217;s more than one Twitter related app running not exactly squeaky clean bots. That&#8217;s not a &#8220;worst offenders&#8221; list, it&#8217;s not complete (I don&#8217;t want to reprint <a href="http://twitter.com/downloads">Twitter&#8217;s yellow pages</a>), and bots are listed in no particular order (compiled from requests following the link in a <a href="http://twitter.com/SebastianX/status/7784746783">test tweet</a>, evaluating only a snapshot of less than 5 minutes, backed by historized logs.)</p>
<p><small><a href="http://sebastians-pamphlets.com/social-media-plague/#sm-plague-howto">Skip examples</a></small></p>
<p id="sm-plague-tweetmeme"><a href="http://tweetmeme.com/about" rel="nofollow">Tweetmeme</a>&#8217;s <b>TweetmemeBot</b> coming from eagle.favsys.net doesn&#8217;t fetch robots.txt. On their site they don&#8217;t explain why they don&#8217;t respect the robots exclusion protocol (REP). Apart from that it behaves.</p>
<p  id="sm-plague-oneriot"><a href="http://www.oneriot.com/company/about" rel=nofollow>OneRiot&#8217;s</a> bot <b>OneRiot/1.0</b> totally proves that this real time search engine has chosen a great name for itself. Performing 5+ GET as well as HEAD requests per link in a tweet (sometimes more) certainly counts as rioting. Requests for content come from different IPs, the host name pattern is <code>flx1-ppp*.lvdi.net</code>, e.g. flx1-ppp47.lvdi.net. From the same IPs comes another bot:  <b>Me.dium/1.0</b>, me.dium.com redirects to oneriot.com.  OneRiot doesn&#8217;t respect the REP.</p>
<p id="sm-plague-bing"><b>Microsoft/Bing</b> runs abusive bots following links in tweets, too. They fake browsers in the user agent, make use of IPs that don&#8217;t obviously point to Microsoft (no host name, e.g. 65.52.19.122, 70.37.70.228 &#8230;), send multiple GET requests per processed tweet, and don&#8217;t respect the REP. If you need more information, I&#8217;ve ranted about <a href="http://sebastians-pamphlets.com/links/categories/?cat=msn">deceptive M$-bots</a> before. Just a remark in case you&#8217;re going to block abusive MSN bot traffic:</p>
<p>MSN/Bing reps ask you not to block their spam bots when you&#8217;d like to stay included in their search index (that goes for real time search, too), but who really wants that? Their search index is tiny &#8211;compared to other search engines like Yahoo and Google&#8211;, their discovery crawling <a href="http://blogs.perl.org/users/cpan_testers/2010/01/msnbot-must-die.html">sucks</a> &#8211;to get indexed you need to submit your URIs at their <a href="http://tag.us.com/_submit2bing">webmaster forum</a>&#8211;, and in most niches you can count your yearly Bing SERP referrers using not even all fingers of your right hand. If your stats show more than that, check your raw logs. You&#8217;ll soon figure out that MSN/Bing spam bots fake SERP traffic in the HTTP_REFERER (guess where their &#8220;impressive&#8221; market share comes from).</p>
<p id="sm-plague-friendfeed"><a href="http://friendfeed.com/">FriendFeed</a>&#8217;s bot <b>FriendFeedBot/0.1</b> is well explained, and behaves. Its <a href="http://friendfeed.com/about/bot">bot page</a> even lists all its IPs, and provides you with an email addy for complaints (I never had a reason to use it). The FriendFeedBot made it on this list just because of its lack of REP support.</p>
<p id="sm-plague-postrank"><a href="http://postrank.com/" rel="nofollow">PostRank</a>&#8217;s bot <b>PostRank/2.0</b> comes from Amazon IPs. It doesn&#8217;t respect the REP, and does more than one request per URI found in one single tweet.</p>
<p id="sm-plague-markmonitor"><a href="http://markmonitor.com/" rel="nofollow">MarkMonitor</a> operates a bot faking browser requests, coming from *.embarqhsd.net (va-71-53-201-211.dhcp.embarqhsd.net, va-67-233-115-66.dhcp.embarqhsd.net, &#8230;). Multiple requests per URI, no REP support.</p>
<p id="sm-plague-cuil"><a href="http://www.cuil.com/" rel="nofollow">Cuil</a>&#8217;s bot provides an empty user agent name when following links in tweets, but fetches robots.txt like Cuil&#8217;s offical crawler <b>Twiceler</b>. I didn&#8217;t bother to test whether this Twitter bot can be blocked following <a href="http://www.cuil.com/info/webmaster_info/">Cuil&#8217;s instructions for webmasters</a> or not. It got included in this list for the supressed user agent.</p>
<p id="sm-plague-twingly"><a href="http://www.twingly.com/" rel="nofollow">Twingly</a>&#8217;s bot  <b>Twingly Recon</b> coming from *.serverhotell.net doesn&#8217;t respect the REP, doesn&#8217;t name its owner, but does only few HEAD requests. </p>
<p id="sm-plague-anon-bots">Many bots mimicking browsers come from Amazon, Rackspace, and other cloudy environments, so you can&#8217;t get hold of their owners without submitting a report-abuse form. You can identify such bots by sorting your access logs by IP addy. Those &#8220;browsers&#8221; which don&#8217;t request your images, CSS files, and so on, are most certainly bots. Of course, a human visitor having cached your images and CSS matches this pattern, too. So block only IPs that solely request your HTML output over a longer period of time (problematic with bots using DSL providers, AOL, &#8230;).</p>
<p>Blocking requests (with IPs belonging to consumer ISPs, or from Amazon and other dynamic hosting environments) with a user agent name like &#8220;LWP::Simple/5.808&#8243;, &#8220;PycURL/7.18.2&#8243;, &#8220;my6sense/1.0&#8243;, &#8220;Firefox&#8221; (just these 7 characters), &#8220;Java/1.6.0_16&#8243; or &#8220;libwww-perl/5.816&#8243; is sound advice. By the way, these requests sum up to an amount that would lead a &#8220;worst offenders&#8221; listing. </p>
<p id="sm-plague-edu">Then there are students doing research. I&#8217;m not sure I want to waste my resources on requests from Moscow&#8217;s &#8220;Institute for System Programming RAS&#8221;, which fakes unnecessary loads of human traffic (from efrate.ispras.ru, narva.ispras.ru, dvina.ispras.ru &#8230;), for example.</p>
<p id="sm-plague-nasty-bots">When you analyze bot traffic following a tweet with many retweets, you&#8217;ll gather a way longer list of misbehaving bots. That&#8217;s because you&#8217;ll catch more 3rd party Twitter UIs when many Twitter users view their timeline. Not all Twitter apps route their short URI evaluation through their servers, so you might miss out on abusive requests coming from real users via client sided scripts.</p>
<p id="sm-plague-bot-devs">Developers might argue that such requests &#8220;on behalf of the user&#8221; are neither abusive, nor count as bot traffic. I assure you, that&#8217;s crap, regardless a particular Twitter app&#8217;s architecture, when you count more than one evaluation request per (shortened) URI. For example Googlebot acts on behalf of search engine users too, but it doesn&#8217;t overload your server. It fetches each URI embedded in tweets only once. And yes, it processes all tweets out there.</p>
<h3 id="sm-plague-howto">How to do it the right way</h3>
<p>Here is what a site owner can expect from a Twitter app&#8217;s Web robot:</p>
<h4 id="sm-plague-howto-ua">A meaningful user agent</h4>
<p>A Web robot must provide a user agent name that fulfills at least these requirements:</p>
<ul>
<li>A unique string that identifies the bot. The unique part of this string must not change when the version changes (&#8221;somebot/1.0&#8243;, &#8220;somebot/2.0&#8243;, &#8230;).</li>
<li>A URI pointing to a page that explains what the bot is all about, names the owner, and tells how it can be blocked in robots.txt (like <a href="http://labs.topsy.com/butterfly/">this</a> or <a href="http://www.alexa.com/help/webmasters">that</a>).</li>
<li>A hint on the rendering engine used, for example &#8220;Mozilla/5.0 (compatible; &#8230;&#8221;.</li>
</ul>
<h4 id="sm-plague-howto-revip">A method to verify the bot</h4>
<p>All IP addresses used by a bot should resolve to server names having a unique pattern. For example Googlebot comes only from servers named <code>"crawl" + "-" + replace($IP, ".", "-") + ".googlebot.com"</code>, e.g. &#8220;crawl-66-249-71-135.googlebot.com&#8221;. All major search engines follow this standard that enables crawler detection not solely relying on the easily spoofable user agent name. </p>
<h4 id="sm-plague-howto-robots-txt">Obeying the robots.txt standard</h4>
<p>Webmasters must be able to steer a bot with crawler directives in <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a> like &#8220;Disallow:&#8221;. A Web robot should fetch a site&#8217;s /robots.txt file before it launches a request for content, when it doesn&#8217;t have a cached version from the same day.</p>
<h4 id="sm-plague-howto-indexer-directives">Obeying REP indexer directives</h4>
<p>Indexer directives like &#8220;nofollow&#8221;, &#8220;noindex&#8221; et cetera must be obeyed. That goes for HEAD requests just chasing for a 301/302/307 redirect response code and a &#8220;location&#8221; header, too.</p>
<p>Indexer directives can be served in the HTTP response header with an <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">X-Robots-Tag</a>, and/or in META elements like the <a href="http://sebastians-pamphlets.com/links/categories/?cat=robots-meta-tags">robots meta tag</a>, as well as in LINK elements like <a href="http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">rel=canonical</a> and its <a href="http://sebastians-pamphlets.com/x-canonical-uri-http-header/">corresponding headers</a>.</p>
<h4 id="sm-plague-howto-behave">Responsible behavior</h4>
<p>As outlined above, requesting the same resources over and over doesn&#8217;t count as responsible behavior. Fetching or &#8220;HEAD&#8217;ing&#8221; a resource no more than once a day should suffice for every Twitter app&#8217;s needs.</p>
<h4 id="sm-plague-howto-copyright">Respecting copyrights</h4>
<p>Reprinting a page&#8217;s content, or just large quotes, doesn&#8217;t count as fair use. It&#8217;s Ok to grab the page title and a summary from a META element like &#8220;description&#8221; (or up to 250 characters from an article&#8217;s first paragraph) to craft links, for example - but not more! Also, showing images or embedding videos from the crawled page violates copyrights.</p>
<h3 id="sm-plague-conclusion">Conclusion, and call for action</h3>
<p>If you suffer from rogue Twitter bot traffic, use the medium those bots live in to make their sins public knowledge. Identify the bogus bot&#8217;s owners and tweet the crap out of them. Lookup their hosting services, find the report-abuse form, and submit your complaints. Most of these apps make use of the Twitter-API, there are <a href="http://twitter.zendesk.com/forums/26257/entries/15789">many spam report forms</a> you can creatively use to ruin their reputation at Twitter. If you&#8217;ve an account at such a bogus Twitter app, then cancel it and encourage your friends to follow suit.</p>
<p><strong>Don&#8217;t let the assclowns of the Twitter universe get away with theft!</strong></p>
<p>I&#8217;d like to hear about particular offenders you&#8217;re dealing with, and your defense tactics as well, in the comments. Don&#8217;t be shy. Go rant away. Thanks in advance!</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/social-media-plague/", "style": "big", "title": "How brain-amputated developers created the social media plague" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/social-media-plague/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How to cleverly integrate your own URI shortener</title>
		<link>http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/</link>
		<comments>http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 14:20:25 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[URI shortening]]></category>

		<category><![CDATA[Usability]]></category>

		<category><![CDATA[404grabber]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Social Web]]></category>

		<category><![CDATA[Site-Search]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/</guid>
		<description><![CDATA[
This pamphlet is somewhat geeky. Don&#8217;t necessarily understand it as a part of my ongoing jihad holy war on URI shorteners.
Assuming you&#8217;re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don&#8217;t want to make use of unholy crap, so you need to roll [...]]]></description>
			<content:encoded><![CDATA[
<p>This pamphlet is somewhat geeky. Don&#8217;t necessarily understand it as a part of my ongoing <a href="http://sebastians-pamphlets.com/put-an-end-to-uri-shortening/"><strike>jihad</strike></a> holy war on URI shorteners.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/clever-404-handler-with-sURI.png"  width="279" height="1008" align="left" style="margin-right:5px;" alt="Clever implementation of an URL shortener" title="How to implement an URL shortener cleverly" />Assuming you&#8217;re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are <a href="http://sebastians-pamphlets.com/links/categories/?cat=s-url">downright evil</a>. You don&#8217;t want to make use of unholy crap, so you need to roll your own. Here&#8217;s how you can (could) integrate a URI shortener into your site&#8217;s architecture.</p>
<p>Please note that my design suggestions ain&#8217;t black nor white. Your site&#8217;s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they&#8217;re applicable.</p>
<p>At the first sight, <a href="http://www.google.com/search?hl=en&#038;safe=off&#038;num=100&#038;q=free+URL+shortener+script">searching for a free URI shortener script</a> to implement it on a dedicated domain looks like a pretty simple solution. It&#8217;s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that&#8217;s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.</p>
<p>So if you&#8217;re not running <code>cheap-student-loans-with-debt-consolidation-on-each-payday-is-a-must-have-for-sexual-heroes-desperate-for-a-viagra-overdose-and-extreme-penis-length-enhancement.info</code> and your domain&#8217;s name without the &#8220;www&#8221; prefix plus a few characters gives URIs of 20 (30) characters or less, you don&#8217;t need a short domain name to host your shortened URIs.</p>
<p id="twitter-suri-obsolete">As a side note, when you&#8217;re shortening your URIs for <a href="http://tag.us.com/uri-shorteners-suck-ass.htm#twitter-crap">Twitter</a> you should know that <strong>shortened URIs aren&#8217;t mandatory</strong> any more. If your message doesn&#8217;t exceed 139 characters, you don&#8217;t need to shorten embedded URIs.</p>
<p>By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.</p>
<p>In the following I&#8217;ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you&#8217;ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-dynamic-stuff">alternatives</a>). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.</p>
<h3 id="sUri-impl-toc">Table of contents</h3>
<p>To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I&#8217;ll bore you with a ton of dull and totally unrelated stuff:</p>
<ul>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-rogue-bot">Block rogue bots</a></li>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-srv-name">Server name canonicalization</a></li>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-static-stuff">Deliver static stuff (images &#8230;)</a></li>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-dynamic-stuff">Execute script (dynamic URI)</a></li>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-short-uri">Resolve shortened URI</a> (<a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-components">Anatomy of a URI shortener</a>)</li>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-redirect">Redirect to destination (invalid request)</a></li>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-guess-uri">Guess destination (invalid request)</a></li>
<li><a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-error-page">Serve a useful error page</a></li>
</ul>
<h3 id="sUri-code-intro">Introduction</h3>
<p>There&#8217;s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we&#8217;re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:</p>
<p><code>RewriteEngine On</code><br />
<code>RewriteCond  %{SERVER_PORT} ^80$</code><br />
<code>RewriteRule . /requestHandler.php [L]</code></p>
<p>Please note that the code above kinda disables the Web server&#8217;s error handling. If <code><br />/requestHandler.php</code> exists in the root directory, all <a href="http://sebastians-pamphlets.com/why-proper-error-handling-is-important/">ErrorDocument directives</a> (except some 5xx) et cetera will be ignored. You need to <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-error-page">take care of errors</a> yourself.</p>
<p><b>/requestHandler.php</b> (Warning: untested and simplified code snippets below)<br /><code> /* Initialization */<br />
$serverName          = strtolower($_SERVER["SERVER_NAME"]);<br />
$canonicalServerName = "sebastians-pamphlets.com";<br />
$scheme              = "http://";<br />
$rootUri             = $scheme .$canonicalServerName; /* if used w/o path add a </code><a href="http://sebastians-pamphlets.com/thou-must-not-steal-the-trailing-slash-from-my-urls/">slash</a><code> */<br />
$rootPath            = $_SERVER["DOCUMENT_ROOT"];<br />
$includePath         = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */<br />
$requestIp           = $_SERVER["REMOTE_ADDR"];<br />
$reverseIp           = NULL;<br />
$requestReferrer     = $_SERVER["HTTP_REFERER"];<br />
$requestUserAgent    = $_SERVER["HTTP_USER_AGENT"];<br />
$isRogueBot          = FALSE;<br />
$isCrawler           = NULL;<br />
$requestUri          = $_SERVER["REQUEST_URI"];<br />
$absoluteUri         = $scheme .$canonicalServerName .$requestUri;<br />
$uriParts            = parse_url($absoluteUri);<br />
$requestScript       = $PHP_SELF;<br />
$httpResponseCode    = NULL;<br />
</code></p>
<h3 id="sUri-impl-rogue-bot">Block rogue bots</h3>
<p>You don&#8217;t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, &#8230;) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: <a href="http://twitter.com/IncrediBill">IncrediBill</a>.</p>
<p>While handling bots, you should detect search engine crawlers, too:</p>
<p><code>/* lookup your </code><a href="http://fantomaster.com/fasvsspy01.html">crawler IP database</a><code> to populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */<br />
if ($isCrawler !== TRUE) {<br />
    $crawlerName         = NULL;<br />
    $crawlerHost         = NULL;<br />
    $crawlerServer       = NULL;<br />
    if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = ".crawl.baidu.com";}<br />
    ...<br />
    if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ".googlebot.com"; }<br />
    if ($crawlerName != NULL) {<br />
        $reverseIp = @gethostbyaddr($requestIp);<br />
        if (!stristr($reverseIp,$crawlerServer)) {<br />
            $isCrawler = FALSE;<br />
        }<br />
        if ("$reverseIp" == "$requestIp") {<br />
            $isCrawler = FALSE;<br />
        }<br />
        if ($isCrawler !== FALSE;) {<br />
            $chkIpAddyRev = @gethostbyname($reverseIp);<br />
            if ("$chkIpAddyRev" == "$requestIp") {<br />
                $isCrawler   = TRUE;<br />
                $crawlerHost = $reverseIp;<br />
                // store the newly discovered crawler IP<br />
            }<br />
        }<br />
    }<br />
}<br />
</code></p>
<p>If Baidu doesn&#8217;t send you any traffic, it makes sense to block its crawler. This piece of crap doesn&#8217;t behave anyway.<code><br />if ($isCrawler &amp;&amp;</code><code> "$crawlerName" == "Baiduspider") {<br />
$isRogueBot = TRUE;<br />
}<br />
</code></p>
<p>Another SE candidate is <a href="http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/">Bing&#8217;s spam bot</a> that tries to manipulate stats on search engine usage. If you don&#8217;t approve such scams, block incoming<span style="color:silver;">!</span> from the IP address range 65.52.0.0 to 65.55.255.255 (131.107.0.0 to 131.107.255.255 &#8230;) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren&#8217;t aware of their company&#8217;s spammy activities, so make sure you serve them a friendly GFY page that explains the issue. </p>
<p>Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you&#8217;re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:<code><br />if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}</code><br />If you operate a porn site you should refine this approach.</p>
<p>As for blocking requests by IP addy I&#8217;d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a <code>@gethostbyaddr($requestIp)</code> DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.<code><br />if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}</code></p>
<p>You won&#8217;t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [<a href="http://www.google.com/search?hl=en&#038;safe=off&#038;num=100&#038;q=block+user+agent">block user agent</a>], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or code<code><br />if (<span style="color:silver;">isBlacklistedUa($requestUserAgent) ||</span></code><code><br /> stristr($requestUserAgent,&#8221;ThingFetcher&#8221;)) {$isRogueBot = TRUE;}</code><br />By the way, the owner of ThingFetcher really should stand up now. I&#8217;ve sent a complaint to Rackspace and I&#8217;ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn&#8217;t bother to check for robots.txt.</p>
<p>Finally, serve rogue bots what they deserve:<code><br />if ($isRogueBot === TRUE) {</code><code><br />
header("HTTP/1.1 403 Go fuck yourself", TRUE, 403);<br />
exit;<br />
}</code></p>
<p>If you&#8217;re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a <code>file_get_contents($requestReferrer);</code> and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to http://example.com/go-fuck-yourself, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I&#8217;d go for the 403-GFY response.</p>
<h3 id="sUri-impl-srv-name">Server name canonicalization</h3>
<p>Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under <b>one</b> server name:<code><br />if (&#8221;$serverName&#8221; != &#8220;$canonicalServerName&#8221;) {<br />
    header(&#8221;HTTP/1.1 301 Please use the canonical URI&#8221;, TRUE, 301);<br />
    header(&#8221;Location: $absoluteUri&#8221;);<br />
    header(&#8221;X-Canonical-URI: $absoluteUri&#8221;); // </code><a href="http://sebastians-pamphlets.com/x-canonical-uri-http-header/">experimental</a><code><br />
    header("Link: &lt;$absoluteUri&gt;; rel=canonical"); // experimental<br />
    exit;<br />
}<br />
</code></p>
<p>Subdomains are so 1999, also 2010 is the year of non-&#8217;.www&#8217; URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite &#8230; the code from this pamphlet as you like. However, you must not change the <code>$canonicalServerName = "sebastians-pamphlets.com";</code> statement. I&#8217;ll appreciate the traffic. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes &#8211;if necessary&#8211;, and remove clutter from query strings.</p>
<p id="remove-google-utm-clutter-on-arrival">Sometimes even smart developers do evil things with your URIs. For example <a href="http://sebastians-pamphlets.com/thou-must-not-steal-the-trailing-slash-from-my-urls/">Yahoo truncates the trailing slash</a>. And <a href="http://sebastians-pamphlets.com/troubles-made-by-utm-variables-from-google-analytics/">Google badly messes up your URIs for click tracking purposes</a>. Here&#8217;s how you can &#8216;heal&#8217; the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> ):<code><br />$testForUriClutter = $absoluteUri;<br />
if (isset($_GET)) {<br />
   foreach ($_GET as $var => $crap) {<br />
       if ( stristr($var,&#8221;utm_&#8221;) ) {<br />
           $testForUriClutter = str_replace($testForUriClutter, &#8220;&#038;$var=$crap&#8221;, &#8220;&#8221;);<br />
           $testForUriClutter = str_replace($testForUriClutter, &#8220;&amp;amp;$var=$crap&#8221;, &#8220;&#8221;);</code><code><br />
           unset ($_GET[$var]);<br />
       }<br />
   }<br />
   $uriPartsSanitized = parse_url($testForUriClutter);<br />
   $qs = $uriPartsSanitized["query"];<br />
   $qs = str_replace($qs, "?", "");<br />
   if ("$qs" != $uriParts["query"]) {<br />
        $canonicalUri = $scheme .$canonicalServerName .$requestScript;<br />
        if (!empty($qs)) {<br />
            $canonicalUri .= "?" .$qs;<br />
        }<br />
        if (!empty($uriParts["fragment"]))      {<br />
            $canonicalUri .= "#" .$uriParts["fragment"];<br />
        }<br />
        header("HTTP/1.1 301 URI messed up by Google", TRUE, 301);<br />
        header("Location: $canonicalUri");<br />
        exit;<br />
   }<br />
}<br />
</code></p>
<p>By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.</p>
<p>Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (<a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-screwed-uris">see below</a>), but some are self-made.</p>
<p>Therefore I&#8217;d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don&#8217;t need a log entry.</p>
<hr color="silver" size="1" width="150" align="left" style="margin-left:20px;margin-top:50px;" />
<p style="margin-left:20px;">Before you&#8217;re going to serve your content, do a lookup in your <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-short-uri">shortUri</a> table. If the requested URI is a shortened URI pointing to your own stuff, don&#8217;t perform a redirect but serve the content under the shortened URI.</p>
<h3 id="sUri-impl-static-stuff">Deliver static stuff (images &#8230;)</h3>
<p>Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we&#8217;ve bypassed this functionality, do it yourself:<code><br />if (empty($uriParts[&#8221;query&#8221;])) &amp;&amp; empty($uriParts[&#8221;fragment&#8221;])) &amp;&amp; file_exists(&#8221;$rootPath$requestUri&#8221;)) {<br />
    header(&#8221;Content-type: &#8221; .getContentType(&#8221;$rootPath$requestUri&#8221;), TRUE);<br />
    readfile(&#8221;$rootPath$requestUri&#8221;);<br />
    exit;<br />
}<br />
/* getContentType($filename) returns a </code><a href="http://www.iana.org/assignments/media-types/">MIME media type</a><code> like 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */<br />
</code></p>
<p>If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don&#8217;t handle them here.</p>
<p>Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your <a href="http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/">smart robots.txt</a> to maintain your crawler-IP database and more.</p>
<h3 id="sUri-impl-dynamic-stuff">Execute script (dynamic URI)</h3>
<p>Say you&#8217;ve a WP blog in /blog/, then you can invoke WordPress with <code><br />if (substring($requestUri, 0, 6) == &#8220;/blog/&#8221;) {<br />
require(&#8221;$rootPath/blog/index.php&#8221;);<br />
exit;<br />
}<br />
</code></p>
<p>(Perhaps the WP configuration needs a tweak to make this work.) There&#8217;s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.</p>
<p>Fortunately, when WordPress calls the <a href="http://sebastians-pamphlets.com/404/">404 page</a> (wp-content/themes/yourtheme/404.php), it hasn&#8217;t sent any output or headers yet. That means you can include the procedures discussed below in WP&#8217;s 404.php:<code><br />$httpResponseCode = &#8220;404&#8243;;<br />
$errSrc = &#8220;WordPress&#8221;;<br />
$errMsg = &#8220;The blog couldn&#8217;t make sense out of this request.&#8221;;<br />
require(&#8221;$includePath/err.php&#8221;);<br />
exit;<br />
</code></p>
<p>Like in my WordPress example, you&#8217;ll find a way to call your scripts so that they don&#8217;t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.</p>
<h3 id="sUri-impl-short-uri">Resolve shortened URI</h3>
<p>If you&#8217;re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.</p>
<h4 id="sUri-impl-components">Excursus: URI shortener components</h4>
<p>Using the hints below you should be able to code your own URI shortener. You don&#8217;t need all the balls and whistles (like stats) overloading most scripts available on the Web.</p>
<ul>
<li id="sUri-impl-tbl-shorturi"><b>A database table</b> with at least these attributes:</p>
<ul>
<li>shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)</li>
<li>shortUri.suriUri, text, indexed, stores the original URI</li>
<li>shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)</li>
</ul>
<p>Storing page titles and content (snippets) makes sense, but isn&#8217;t mandatory. For outputs like &#8220;recently shortened URIs&#8221; you need a timestamp attribute.</li>
<li id="sUri-impl-create-method"><b>A method to create a shortened URI</b>.<br />
Make that an independent script callable from a Web form&#8217;s server procedure, via Ajax, SOAP, etc.</p>
<p>Without a given shortcut, use the primary key to create one. <code>base_convert(intval($suriId), 10, 36);</code> converts an integer into a short string. If you can&#8217;t do that in a database insert/create trigger procedure, retrieve the primary key&#8217;s value with <code>LAST_INSERT_ID()</code> or so and perform an update.</p>
<p>URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your <i>create short URI</i> method should return a previously created shortcut then.</p>
<p>If you&#8217;re storing titles and such stuff grabbed from the destination page, don&#8217;t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.</p>
<p>With the shortcut returned build the short URI on-the-fly <code>$shortUri = getBaseUri() ."/" .$suriShortcut;</code> (so you can use your URI shortener across all your sites). </li>
<li id="sUri-impl-retrieve-method"><b>A method to retrieve the original URI</b>.<br />
Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.</p>
<p>Bear in mind that shortened URIs spread via social media do get abused. A shortcut like &#8216;xxyyzz&#8217; can appear as &#8216;xxyyz..&#8217;, &#8216;xxy&#8217;, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-error-page">error page</a> with <a href="http://tag.us.com/_sur...yadayadayada" title="Example from an experimental site, not exactly perfect">suggestions</a>.</li>
<li id="sUri-impl-suri-maint"><b>A Web form to create and edit shortened URIs</b>.<br />
Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
</li>
<li id="sUri-impl-api">If you want to use your URI shortener with a Twitter client, then build an <a href="http://code.google.com/p/bitly-api/wiki/ApiDocumentation">API</a>.
</li>
<li id="sUri-impl-click-stats">If you need particular stats for your short URIs pointing to foreign sites that your analytics package can&#8217;t deliver, then store those click data separately.<br />
<code>// end excursus</code>
</li>
</ul>
<p>If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.<code><br />$suriUri = resolveShortUri($requestUri);<br />
if ($suriUri === FALSE) {<br />
    $httpResponseCode = &#8220;404&#8243;;<br />
    $errSrc = &#8220;sUri&#8221;;<br />
    $errMsg = &#8220;Invalid short URI. Shortcut resolves to more than one result.&#8221;;<br />
    require(&#8221;$includePath/err.php&#8221;);<br />
    exit;<br />
}<br />
if (!empty($suriUri))<br />
    if (!stristr($suriUri, $canonicalServerName)) {<br />
        header(&#8221;HTTP/1.1 301 Here you go&#8221;, TRUE, 301);<br />
        header(&#8221;Location: $suriUri&#8221;);<br />
        exit;<br />
    }<br />
}<br />
</code></p>
<p>Otherwise ($suriUri is yours) deliver your content without redirecting.</p>
<h3 id="sUri-impl-redirect">Redirect to destination (invalid request)</h3>
<p id="sUri-impl-screwed-uris">From reading your raw logs (404 stats don&#8217;t cover 302-Found crap) you&#8217;ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn&#8217;t make sense to show visitors following such a link your 404 page.</p>
<p>Most screwed URIs are unique in a way that they still &#8216;address&#8217; one particular resource on your server. You should <a href="http://sebastians-pamphlets.com/getting-the-most-out-of-googles-404-stats/">maintain a mapping table</a> for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.</p>
<p>When you feature a &#8220;product of the week&#8221;, &#8220;hottest blog post&#8221;, &#8220;today&#8217;s joke&#8221; or so, then bookmarkers will love it when its URI doesn&#8217;t change. For such transient URIs do a 307 redirect to the currently featured page. Don&#8217;t fear non-existing &#8216;duplicate content penalties&#8217;. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you&#8217;ll still get the SERP traffic you deserve.</p>
<h3 id="sUri-impl-guess-uri">Guess destination (invalid request)</h3>
<p>For many <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-screwed-uris">screwed URIs</a> you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.</p>
<p>Once you&#8217;ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your <i>screwed URIs mapping table</i> (see above).</p>
<p>When you can&#8217;t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.</p>
<h3 id="sUri-impl-error-page">Serve a useful error page</h3>
<p>Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don&#8217;t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page&#8217;s prominent message should consist of hints how the visitor can reach the estimated contents.</p>
<p>A central error page gets invoked from various scripts. Unfortunately, err.php can&#8217;t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can&#8217;t send an HTTP response header. Hence prefix the header() statement with a &#8216;@&#8217; to supress PHP error messages, and catch and log errors.</p>
<p>Before you output your wonderful error page, send a 404 header:<code><br />if ($httpResponseCode == NULL) {<br />
    $httpResponseCode = &#8220;404&#8243;;<br />
}<br />
if (empty($httpResponseCode)) {<br />
    $httpResponseCode = &#8220;501&#8243;; // log for debugging<br />
}<br />
@header(&#8221;HTTP/1.1 $httpResponseCode Shit happens&#8221;, TRUE, intval($httpResponseCode));<br />
logHeaderErr(error_get_last());<br />
</code></p>
<p>In rare cases you better send a 410-Gone header, for example when Matt&#8217;s team has discovered a shitload of questionable pages and you&#8217;ve filed a reconsideration request.</p>
<p>In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.</p>
<p>Use possible input ($errSrc, $errMsg, &#8230;) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [<a href="http://www.google.com/search?hl=en&#038;num=100&#038;q=best+404+page+ever">* 404 page *</a>] might inspire you (<a href="http://yoast.com/404-error-pages-wordpress/">WordPress users click here</a>). </p>
<hr align="center" color="silver" size="1" width="150" style="margin-top:70px;" />
<p>All errors are mine. In other words, be careful when you grab my untested code examples. It&#8217;s all dumped from memory without further thoughts and didn&#8217;t face a syntax checker.</p>
<p>I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I&#8217;d be happy to discuss your thoughts in the comments. Thanks for your time.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/", "style": "big", "title": "How to cleverly integrate your own URI shortener" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The anatomy of a deceptive Tweet spamming Google Real-Time Search</title>
		<link>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/</link>
		<comments>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/#comments</comments>
		<pubDate>Thu, 10 Dec 2009 10:12:44 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[Twitter]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/</guid>
		<description><![CDATA[
Minutes after the launch of Google&#8217;s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.
I&#8217;m sure Google&#8217;s WebSpam team will pull the plug sooner or later, but as of today Google&#8217;s real time search results are extremely vulnerable to questionable content.
The somewhat [...]]]></description>
			<content:encoded><![CDATA[
<p><img  src="http://sebastians-pamphlets.com/img/posts/spamming-google-real-time-search.png" width="250" height="345" align="right" style="margin-left:5px;" alt="Google real time search spammed and abused" title=""  />Minutes after the <a href="http://googleblog.blogspot.com/2009/12/relevance-meets-real-time-web.html?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">launch</a> of Google&#8217;s <a href="http://searchengineland.com/search-real-time-madness-31668">famous</a> Real Time Search, the Internet marketing community <a href="http://sphinn.com/story/135685">began</a> to <a href="http://outspokenmedia.com/seo/google-real-time-spam/">spam</a> the <a href="http://www.google.com/search?hl=en&#038;safe=off&#038;esrch=RTSearch&#038;tbo=1&#038;num=100&#038;q=spam&#038;tbs=rltm:1">scrolling SERPs</a>. Google gave birth to a <a href="http://www.seo-theory.com/2009/12/07/google-launches-a-new-spam-industry/">new spam industry</a>.</p>
<p>I&#8217;m sure Google&#8217;s <a href="http://friendfeed.com/dannysullivan/d973e438/real-time-spam-google-says-been-fighting-so-long">WebSpam</a> team will pull the plug sooner or later, but as of today Google&#8217;s real time search results are extremely vulnerable to questionable content.</p>
<p>The somewhat shady approach to make creative use of real time search I&#8217;m outlining below will not work forever. It can be used for really evil purposes,  and Google is aware of the problem. Frankly, if I&#8217;d be the Googler in charge, I&#8217;d dump the whole real-time thingy until the spam defense lines are rock solid.</p>
<p id="rtss-recipe"><strong>Here&#8217;s the recipe from Dr Evil&#8217;s WebSpam-Cook-Book:</strong></p>
<h3 id="rtss-ingredients">Ingredients</h3>
<ul>
<li>1 <a href="http://www.google.com/trends?q=spam+google">popular topic</a> that pulls lots of searches, but not so many that the results scroll down too fast.</li>
<li>1 <a href="http://www.google.com/products?q=spam+google&#038;hl=en&#038;aq=f">landing page</a> that makes the punter pull out the plastic in no time.</li>
<li>1 <a href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&#038;answer=93713">trusted authority page</a> totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.</li>
<li>1 <a href="http://goo.gl/">short</a> domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.</li>
</ul>
<h3 id="rtss-preparation">Preparation</h3>
<p>Create a new text file and name it <code>hot-topic.php</code> or so. Then code:<code><br />
&lt;?php<br />
$landingPageUri = "http://affiliate-program.com/?your-aff-id";<br />
$trustedPageUri = "http://google.com/something.py";<br />
if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) {<br />
   header("HTTP/1.1 307 Here you go today", TRUE, 307);<br />
   header("Location: $trustedPageUri");<br />
}<br />
else {<br />
   header("HTTP/1.1 301 Happy shopping", TRUE, 301);<br />
   header("Location: $landingPageUri");<br />
}<br />
exit;<br />
?&gt;</code></p>
<p>Provided you&#8217;re a savvy spammer, your crawler detection routine will be a little more <a href="http://fantomaster.com/fasvsspy01.html">complex</a>.</p>
<p>Save the file and upload it, then test the URI <code>http://youspamaw.ay/hot-topic.php</code> in your browser.</p>
<h3 id="rtss-serving">Serving</h3>
<ul>
<li>Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don&#8217;t swear, and sail around phrases like &#8216;buy cheap viagra&#8217; with synonyms like &#8216;brighten up your girl friend&#8217;s romantic moments&#8217;.</li>
<li>On their SERPs, Google will display the text from the trusted page&#8217;s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.</li>
<li>Just for entertainment, closely monitor Google&#8217;s real time SERPs, and your real-time sales stats as well.</li>
<li>Be happy and get rich by end of the week.</li>
</ul>
<p>Google removes links to untrusted destinations, that&#8217;s why you need to abuse authority pages. As long as you don&#8217;t launch f-bombs, Google&#8217;s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.</p>
<p>Hey <a href="http://twitter.com/GoogleWebspam">Google</a>, for the sake of our children, take that as a spam report!</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/", "style": "big", "title": "The anatomy of a deceptive Tweet spamming Google Real-Time Search" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How to borrow relevance from authority pages with 307 redirects</title>
		<link>http://sebastians-pamphlets.com/how-to-steal-relevancy-with-307-redirects-no-longer/</link>
		<comments>http://sebastians-pamphlets.com/how-to-steal-relevancy-with-307-redirects-no-longer/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 18:13:28 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-steal-relevancy-with-307-redirects-no-longer/</guid>
		<description><![CDATA[
Every once in a while I switch to Dr Evil mode. That&#8217;s a &#8220;do more evil&#8221; type of pamphlet. Don&#8217;t bother reading the disclaimer, just spam away &#8230;
Why the heck should you invest valuable time into crafting out compelling content, when there&#8217;s a shortcut?
There are so many awesome Web pages out there, just pick some [...]]]></description>
			<content:encoded><![CDATA[
<p>Every once in a while I switch to Dr Evil mode. That&#8217;s a &#8220;do more evil&#8221; type of pamphlet. Don&#8217;t bother reading the disclaimer, just spam away &#8230;</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/steal-relecanvy-with-307-redirects.png" with="200" height="174" style="margin-left:5px;" align="right" alt="Content theft with 307 redirects" title="" />Why the heck should you invest valuable time into crafting out <a href="http://sebastians-pamphlets.com/dont-underestimate-the-truth-in-se-quality-guidelines/">compelling content</a>, when there&#8217;s a shortcut?</p>
<p>There are so many awesome Web pages out there, just pick some and steal their content. You say &#8220;duplicate content issues&#8221;, I say &#8220;don&#8217;t worry&#8221;. You say &#8220;copyright violation&#8221;, I say &#8220;be happy&#8221;. Below I explain the setup.</p>
<p>This somewhat shady <acronym title="Internet Marketing">IM</acronym> technique is for you when you&#8217;re shy of automatted content generation.</p>
<p>Register a new (short!) domain and create a tiny site with a few pages of totally unique and somewhat interesting content. Write opinion pieces, academic papers or whatnot, just don&#8217;t use content generators or anything that cannot pass a human bullshit detector. No advertising. No questionable links. Instead, link out to authority pages. No SEO stuff like nofollow&#8217;ed links to imprints or so.</p>
<p>Launch with a few links from clean pages. Every now and then drop a deep link in relevant discussions on forums or social media sites. Let the search engines become familiar with your site. That&#8217;ll attract even a few natural inbound links, at least if your content is linkworthy.</p>
<p>Use <a href="http://google.com/webmasters/">Google&#8217;s Webmaster Console</a> (GWC) to monitor your progress. Once all URIs from your sitemap are indexed and show in [site:yourwebspam.com] searches, begin to expand your site&#8217;s menu and change outgoing links to authority pages embedded in your content.</p>
<p>Create short URIs (LE 20 characters!) that point to authority pages. Serve search engine crawlers a <a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#307-temporary-redirect">307</a>, and human surfers a <a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#301-moved-permanently">301</a> redirect. Build deep links to those URIs, for example in tweets. Once you&#8217;ve gathered 1,000+ inbounds, you&#8217;ll receive SERP traffic. By the way, don&#8217;t buy the sandbox myths.</p>
<p>Watch the <i>keywords</i> page in you GWC account. It gets populated with keywords that appear only in content of pages you&#8217;ve hijacked with redirects. Watch your [site:yourwebspam.com] SERPs. Usually the top 10 keywords listed in the GWC report will originate from pages listed on the first [site:yourwebspam.com] SERPs, provided you&#8217;ve hijacked awesome content.</p>
<p>Add (new) keywords from pages that appear both in redirect destinations listed within the first 20 [site:yourwebspam.com] search results, as well as in the first 20 listed keywords, to articles you actually serve on your domain.</p>
<p>Detect SERP referrers (human surfers who&#8217;ve clicked your URIs on search result pages) and redirect those to sales pitches. That goes for content pages as well as for redirecting URIs (mimiking <a href="http://sebastians-pamphlets.com/links/categories/?cat=s-url">shortened URIs</a>). Laugh all the way to the bank.</p>
<p>Search engines rarely will discover your scam. Of course shit happens, though. Once the domain is burned, just block crawlers, redirect everything else to your sponsors, and let the domain expire.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/307-content-theft-is-history.png" with="150" height="119" style="margin-left:5px;" align="right" alt="History: Content theft with 307 redirects" title="" /><b>Disclaimer:</b> Google has put an end to most 307 spam tactics. That&#8217;s why I&#8217;m publishing all this crap. Because watching decreasing traffic to spammy sites is frustrating. Deceptive 307&#8242;ing URIs won&#8217;t rank any more. Slowly, actually very slow, GWC reports follow suit.</p>
<p>What can we learn? Do not believe in the truth of search engine reports. Just because Google&#8217;s webmaster console tells you that Google thinks a keyword is highly relevant to your site, that doesn&#8217;t mean you&#8217;ll rank for it on their SERPs. Most probably GWC is not the average search engine spammer&#8217;s tool of the trade.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/how-to-steal-relevancy-with-307-redirects-no-longer/", "style": "big", "title": "How to borrow relevance from authority pages with 307 redirects" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-steal-relevancy-with-307-redirects-no-longer/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Search engines should make shortened URIs somewhat persistent</title>
		<link>http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/</link>
		<comments>http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 17:29:26 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Social Web]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[URI shortening]]></category>

		<category><![CDATA[Twitter]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/</guid>
		<description><![CDATA[
URI shorteners are crap. Each and every shortened URI expresses a design flaw. All &#8211;or at least most&#8211; public URI shorteners will shut down sooner or later, because shortened URIs are hard to monetize. Making use of 3rd party URI shorteners translates to &#8220;put traffic at risk&#8221;. Not to speak of link love (PageRank, Google [...]]]></description>
			<content:encoded><![CDATA[
<p><a href="http://tag.us.com/uri-shorteners-suck-ass.htm">URI shorteners are crap</a>. Each and every shortened URI expresses a design flaw. All &#8211;or at least most&#8211; public URI shorteners will shut down sooner or later, because shortened URIs are hard to monetize. Making use of 3rd party URI shorteners translates to &#8220;put traffic at risk&#8221;. Not to speak of link love (PageRank, Google juice, link popularity) lost forever.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/se-rescue-short-url.png" width="250" height="222" align="right" alt="SEs could rescue tiny URLs" title="Dear search engines, please make our shortened URIs persistent!" style="margin-left:3px;" />Search engines could provide a way out of the <strong>sURL dilemma</strong> that Twitter &amp; Co created with their crappy, thoughtless and shortsighted software designs. Here&#8217;s how:</p>
<p>Most browsers support search queries in the address bar, as well as suggestions (aka search results) on DNS errors, and sometimes even 404s or other HTTP response codes other than 200/3x. That means browsers &#8220;ask a search engine&#8221; when an HTTP request fails.</p>
<p>When a <acronym title="Top Level Domain, that's .com/.net/.org...">TLD</acronym> is out of service, search engines could have crawled a 301 or meta refresh from a page formerly living on a <code>.yu</code> domain for example. They know the new address and can lead the user to this (working) URI.</p>
<p>The same goes for shortened URIs created ages ago by URI shortening services that died in the meantime. Search engines have transferred all the link juice from the shortened URI to the destination page already, so why not point users that request a dead <i>short URI</i> to the right destination?</p>
<p>Search engines have all the data required for rescuing short URIs that are out of service in their datebases. Not de-indexing &#8220;outdated&#8221; URIs belonging to URI shorteners would be a minor tweak. At least Google has stored attributes and behavior of all links on the Web since the past century, and most probably other search engines are operated by data rats too.</a></p>
<p>URI shorteners can be identified by simple patterns. They gather tons of inbound links from foreign domains that get redirected (not always using a 301!) to URIs on other 3rd party domains. Of course that applies to some AdServers too, but rest assured search engines do know the differences.</p>
<p><strong>So why the heck didn&#8217;t Google, <strike>Yahoo/MSN</strike> Bing, and Ask offer such a service yet? I thought it&#8217;s all about users, but I might have misread something. Sigh.</strong></p>
<p><small>By the way, I&#8217;ve recorded search engine misbehavior with regard to shortened URIs that could arouse Jack The Ripper, but that&#8217;s a completely other story.</small></p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/", "style": "big", "title": "Search engines should make shortened URIs somewhat persistent" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Debugging robots.txt with Google Webmaster Tools</title>
		<link>http://sebastians-pamphlets.com/debugging-robots-txt-with-google-webmaster-tools/</link>
		<comments>http://sebastians-pamphlets.com/debugging-robots-txt-with-google-webmaster-tools/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 17:32:01 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/debugging-robots-txt-with-google-webmaster-tools/</guid>
		<description><![CDATA[
Although Google&#8217;s Webmaster Console is a really neat toolkit, it can mislead the not-that-savvy crowd every once in a while.
When you go to &#8220;Diagnostics::Crawl Errors::Restricted by robots.txt&#8221; and you find URIs that aren&#8217;t disallow&#8217;ed or even noindex&#8217;ed in your very own robots.txt, calm down.
Google&#8217;s cool robots.txt validator withdraws its knowledge of redirects and approves your [...]]]></description>
			<content:encoded><![CDATA[
<p>Although <a href="http://webmasters.google.com/">Google&#8217;s Webmaster Console</a> is a really neat toolkit, it can mislead the not-that-savvy crowd every once in a while.</p>
<p>When you go to &#8220;Diagnostics::Crawl Errors::Restricted by robots.txt&#8221; and you find URIs that aren&#8217;t disallow&#8217;ed or even noindex&#8217;ed in your very own robots.txt, calm down.</p>
<p>Google&#8217;s cool robots.txt validator withdraws its knowledge of redirects and approves your redirecting URIs, driving you nuts until you <a href="http://www.seoconsultants.com/tools/headers/">check each URI&#8217;s HTTP response code</a> for <a href="http://sebastians-pamphlets.com/links/categories/?cat=redirects">redirects</a> (<a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/">HTTP response codes 301, 302 and 307</a>, as well as <a href="http://sebastians-pamphlets.com/google-and-yahoo-treat-undelayed-meta-refresh-as-301-redirect/">undelayed meta refreshs</a>).</p>
<p>Google obeys robots.txt even in a chain of redirects. If for Google&#8217;s user agent(s) an URI given in an HTTP header&#8217;s <code>location</code> is disallow&#8217;ed or noindex&#8217;ed, Googlebot doesn&#8217;t fetch it, regardless the position in the current chain of redirects. Even a robots.txt block in the 5th hop stops the greedy Web robot. Those URIs are correctly reported back as &#8220;restricted by robots.txt&#8221;, Google just refuses to tell you that the blocking crawler directive origins from a foreign server.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/debugging-robots-txt-with-google-webmaster-tools/", "style": "big", "title": "Debugging robots.txt with Google Webmaster Tools" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/debugging-robots-txt-with-google-webmaster-tools/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Upgrading from IIS/ASP to Apache/PHP</title>
		<link>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/</link>
		<comments>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/#comments</comments>
		<pubDate>Tue, 11 Dec 2007 20:47:25 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[404grabber]]></category>

		<category><![CDATA[Duplicate Content]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Copy+Paste-Penalties]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[IIS]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/</guid>
		<description><![CDATA[
Once you&#8217;re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won&#8217;t work, and you can&#8217;t run MS-Access &#8220;databases&#8221; and such stuff under Apache. 
Here is my idea of a smoothly migration from IIS/ASP to [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/upgrade-from-iis-asp-to-apache-php.png" width="250" height="227" align="right" style="margin-left:4px;" alt="Upgrade from Windows/IIS/ASP to Unix/Apache/PHP" title="Get the most out of your Web site - throw away Windows/IIS/ASP!"  />Once you&#8217;re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won&#8217;t work, and you can&#8217;t run MS-Access &#8220;databases&#8221; and such stuff under Apache. </p>
<p>Here is my idea of a smoothly migration from IIS/ASP to Apache/PHP. Grab any Unix box from your hoster&#8217;s portfolio and start over.</p>
<p>(Recently I got a tiny IIS/ASP site about <a href="http://link-condom.com/">uses &amp; abuses of link condoms</a> and moved it to an Apache server. I&#8217;m well known for brutal IIS rants, but so far I didn&#8217;t discuss a way out of such a dilemma, so I thought blogging this move could be a good idea.) </p>
<p>I don&#8217;t want to make this piece too complex, so I skip database and code migration strategies. Read Mike Hillyer&#8217;s article <a href="http://dev.mysql.com/tech-resources/articles/migrating-from-microsoft.html">Migrating from Microsoft Access/MS-SQL to MySQL</a>, and try tools like <a href="http://asp2php.naken.cc/docs.php">ASP to PHP</a>. (With my tiny <a href="http://link-condom.com/about.asp">link condom</a> site I overwrote the ASP code with PHP statements in my primitive text editor.)</p>
<p><b>From an SEO perspective such an upgrade comes with pitfalls:</b>
<ul>
<li>Changing file extensions from .asp to .php is not an option. We want to keep the number of unavoidable redirects as low as possible.</li>
<li>Default.asp is usually not configured as a valid default document under Apache, hence requests of http://example.com/ run into 404 errors.</li>
<li>Basic server name canonicalization routines (www vs. non-www) from ASP scripts are not convertible.</li>
<li>IIS-URIs are not case sensitive, that means that /Default.asp will 404 on Apache when the filename is /default.asp. Usually there are lowercase/uppercase issues with query string variables and values as well.</li>
<li>Most probably search engines have URL variants in their indexes, so we want to adapt their URL canonicalization, at least where possible.</li>
<li>HTML editors like Microsoft Visual Studio tend to duplicate the HTML code of templated page areas. Instead of editing menus or footers in all scripts we want to encapsulate them.</li>
<li>If the navigation makes use of relative links, we need to convert those to absolute URLs.</li>
<li>Error handling isn&#8217;t convertible. Improper error handling can cause decreasing search engine traffic.</li>
</ul>
<h3>Running /default.asp, /home.asp etc. as PHP scripts</h3>
<p>When you upload an .asp file to an Apache Web server, most user agents can&#8217;t handle it. Browsers treat them as unknown file types and force downloads instead of rendering them. Next those files aren&#8217;t parsed for PHP statements, provided you&#8217;ve rewritten the ASP code already.</p>
<p>To tell Apache that .asp files are valid PHP scripts outputting X/HTML, add this code to your server config or your .htaccess file in the root: <code><b><br />
AddType text/html .asp<br />
AddHandler application/x-httpd-php .asp </b></code><br />
The first line says that .asp files shall be treated as HTML documents, and should force the server to send a <code>Content-Type: text/html</code> HTTP header. The second line tells Apache that it must parse .asp files for PHP code. </p>
<p>Just in case the AddType statement above doesn&#8217;t produce a <code>Content-Type: text/html</code> header, here is another way to tell all user agents requesting .asp files from your server that the content type for .asp is text/html. If you&#8217;ve mod_headers available, you can accomplish that with this .htaccess code: <code><b><br />
&lt;IfModule mod_headers.c&gt;<br />
SetEnvIf Request_URI \.asp is_asp=is_asp<br />
Header set &quot;Content-type&quot; &quot;text/html&quot; env=is_asp<br />
Header set imagetoolbar &quot;no&quot;<br />
&lt;/IfModule&gt; </b></code><br />
(The imagetoolbar=no header tells IE to behave nicely; you can use this directive in a meta tag too.)<br />
If for some reason mod_headers doesn&#8217;t work well with mod_setenvif, giving 500 error codes or so, then you can set the content-type with PHP too. Add this to a PHP script file which is included in all your scripts at the very top: <code><b><br />
@header(&quot;Content-type: text/html&quot;, TRUE);  </b></code><br />
Instead of &#8220;text/html&#8221; alone, you can define the character set too: &#8220;text/html; charset=UTF-8&#8243;</p>
<h3>Sanitizing the home page URL by eliminating &#8220;default.asp&#8221;</h3>
<p>Instead of slowing down Apache by defining just another default document name (<code>DirectoryIndex index.html index.shtml index.htm index.php [...] default.asp</code>), we get rid of &#8220;/default.asp&#8221; with this &#8220;/index.php&#8221; script: <code><b><br />
&lt;?php<br />
@require(&quot;default.asp&quot;);<br />
?&gt; </b></code><br />
Now every request of http://example.com/ executes /index.php which includes /default.asp. This works with subdirectories too.</p>
<p>Just in case someone requests /default.asp directly (search engines keep forgotten links!), we perform a permanent redirect in .htaccess: <code><b><br />
Redirect 301 /default.asp http://example.com/<br />
Redirect 301 /Default.asp http://example.com/ </b></code></p>
<h3>Converting the ASP code for server name canonicalization</h3>
<p>If you find ASP canonicalization routines like <code><b><br />
&lt;%@ Language=VBScript %&gt;<br />
&lt;%<br />
if strcomp(Request.ServerVariables(&quot;SERVER_NAME&quot;), &quot;www.example.com&quot;, vbCompareText) = 0 then<br />
   Response.Clear<br />
   Response.Status = &quot;301 Moved Permanently&quot;<br />
   strNewUrl = Request.ServerVariables(&quot;URL&quot;)<br />
   if instr(1,strNewUrl, &quot;/default.asp&quot;, vbCompareText) &gt; 0 then<br />
     strNewUrl = replace(strNewUrl, &quot;/Default.asp&quot;, &quot;/&quot;)<br />
     strNewUrl = replace(strNewUrl, &quot;/default.asp&quot;, &quot;/&quot;)<br />
   end if<br />
   if Request.QueryString &lt;&gt; &quot;&quot; then<br />
       Response.AddHeader &quot;Location&quot;,&quot;http://example.com&quot; &amp; strNewUrl &amp; &quot;?&quot; &amp; Request.QueryString<br />
   else<br />
       Response.AddHeader &quot;Location&quot;,&quot;http://example.com&quot; &amp; strNewUrl<br />
   end if<br />
   Response.End<br />
end if<br />
%&gt;  </b></code><br />
(or the other way round) at the top of all scripts, just select and delete. This .htaccess code works way better, because it takes care of other server name garbage too: <code><b><br />
RewriteEngine On<br />
RewriteCond %{HTTP_HOST} !^example\.com [NC]<br />
RewriteRule (.*) http://example.com/$1 [R=301,L] </b></code><br />
(you need mod_rewrite, that&#8217;s usually enabled with the default configuration of Apache Web servers). </p>
<h3>Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value</h3>
<p>Probably a M$ developer didn&#8217;t read more than the scheme and server name chapter of the URL/URI standards, at least I&#8217;ve no better explanation for the fact that these clowns made the path and query string segment of URIs case-insensitive. (Ok, I have an idea, but nobody wants to read about M$ world domination plans.)</p>
<p>Just because &#8211;contrary to Web standards&#8211; M$ finds it funny to serve the same contents on request of /Home.asp as well as /home.ASP, such crap doesn&#8217;t fly on the World Wide Web. Search engines &#8211;and other Web services which store URLs&#8211; treat them as different URLs, and consider everything except one version duplicate content.</p>
<p>Creating hyperlinks in HTML editors by picking the script files from the Windows Explorer can result in HREF values like &#8220;/Script.asp&#8221;, although the file itself is stored with an all-lowercase name, and the FTP client uploads &#8220;/script.asp&#8221; to the Web server. There are more ways to fuck up file names with improper use of (leading) uppercase characters. Typos like that are somewhat undetectable with IIS, because the developer surfing the site won&#8217;t get 404-Not found responses. </p>
<p>Don&#8217;t misunderstand me, you&#8217;re free to camel-case file names for improved readability, but then make sure that the file system&#8217;s notation matches the URIs in HREF/SRC values. (Of course hyphened file names like &#8220;buy-cheap-viagra.asp&#8221; top the CamelCased version &#8220;BuyCheapViagra.asp&#8221; when it comes to search engine rankings, but don&#8217;t freak out about keywords in URLs, that&#8217;s ranking factor #202 or so.)</p>
<p>Technically spoken, converting all file names, variable names and values as well to all-lowercase is the simplest solution. This way it&#8217;s quite easy to 301-redirect all invalid requests to the canonical URLs. </p>
<p>However, each redirect puts search engine traffic at risk. Not all search engines process 301 redirects as they should (<a href="http://sphinn.com/story/16345">MSN Live Search</a> for example doesn&#8217;t follow permanent redirects and doesn&#8217;t pass the reputation earned by the old URL over to the new URL). So if you&#8217;ve good SERP positions for &#8220;misspelled&#8221; URLs, it might make sense to stick with ugly directory/file names. Check your search engine rankings, perform [site:example.com] search queries on all major engines, and read the SERP referrer reports from the old site&#8217;s server stats to identify all URLs you don&#8217;t want to redirect. By the way, the link reports in <a href="http://www.google.com/webmasters/tools/">Google&#8217;s Webmaster Console</a> and <a href="http://siteexplorer.search.yahoo.com/">Yahoo&#8217;s Site Explorer</a> reveal invalid URLs with (internal as well as external) inbound links too.</p>
<p>Whatever strategy fits your needs best, you&#8217;ve to call a script handling invalid URLs from your .htaccess file. You can do that with the ErrorDocument directive: <code><b><br />
ErrorDocument 404 /404handler.php </b></code><br />
That&#8217;s safe with static URLs without parameters and should work with dynamic URIs too. When you &#8211;in some cases&#8211; deal with query strings and/or virtual URIs, the .htaccess code becomes more complex, but handling virtual paths and query string parameters in the PHP scripts might be easier: <code><b><br />
&lt;IfModule mod_rewrite.c&gt;<br />
RewriteEngine On<br />
RewriteBase /<br />
RewriteCond %{REQUEST_FILENAME} !-f<br />
RewriteCond %{REQUEST_FILENAME} !-d<br />
RewriteRule . /404handler.php [L]<br />
&lt;/IfModule&gt; </b></code><br />
In both cases Apache will process /404handler.php if the requested URI is invalid, that is if the path segment (/directory/file.extension) points to a file that doesn&#8217;t exist.</p>
<p>And here is the PHP script /404handler.php:<br />
<b><a onclick="showContent('php-code-404-handler'); return false;">View</a>|<a onclick="hideContent('php-code-404-handler'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)<code id="php-code-404-handler" style="display:none;"><b><br />
&lt;?php // 404handler.php<br />
      // called from .htaccess if the requested path doesn&#8217;t exist<br />
&nbsp;<br />
$thisFileName    = &quot;404handler.php&quot;;  // change this<br />
$canonicalScheme = &quot;http://&quot;;<br />
$canonicalServer = &quot;example.com&quot;; // change this<br />
$errorPageUri    = &quot;/error.asp&quot;;  // change this<br />
$documentRoot    = $_SERVER[&quot;DOCUMENT_ROOT&quot;];<br />
$requestUri      = $_SERVER[&quot;REQUEST_URI&quot;];<br />
$canonicalUri    = &quot;&quot;;<br />
$requestedUrl    = $canonicalScheme .$canonicalServer .$requestUri;<br />
$canonicalUrl    = &quot;&quot;;<br />
$url             = parse_url($requestedUrl);<br />
$requestPath     = $url[&quot;path&quot;];<br />
$includeScript   = &quot;&quot;;<br />
$queryString     = $url[&quot;query&quot;];<br />
&nbsp;<br />
// keep misspelled URIs with nice search engine rankings<br />
if (&quot;$requestPath&quot; == &quot;/Sample.asp&quot;) {  // change this<br />
   $includeScript = $documentRoot .&quot;/sample.asp&quot;;  // change this<br />
}<br />
// &#8230;<br />
if (!empty($includeScript)) {<br />
   @header(&quot;HTTP/1.1 200 OK&quot;, TRUE, 200);<br />
   @include($includeScript);<br />
   exit;<br />
}<br />
&nbsp;<br />
// if the lowercase version exists, redirect to it<br />
$lcPath = strtolower($url[&quot;path&quot;]);<br />
$lcFile = $documentRoot .$lcPath;<br />
if (file_exists($lcFile) &#038;&#038; !stristr($requestUri,$thisFileName)) {<br />
    $canonicalUrl = $canonicalScheme .$canonicalServer .$lcPath;<br />
    if ($queryString) {<br />
        $canonicalUrl .= &quot;?&quot; .$queryString;<br />
    }<br />
    if ($url[&quot;fragment&quot;]) {<br />
        $canonicalUrl .= &quot;#&quot; .$url[&quot;fragment&quot;];<br />
    }<br />
}<br />
if (!empty($canonicalUrl)) {<br />
    @header(&quot;HTTP/1.1 301 Moved Permanently&quot;, TRUE, 301);<br />
    @header(&quot;Location: $canonicalUrl&quot;);<br />
    exit;<br />
}<br />
&nbsp;<br />
// serve the 404 error page<br />
@header(&quot;HTTP/1.1 404 Not found&quot;, TRUE, 404);<br />
@include($documentRoot .$errorPageUri);<br />
exit;<br />
?&gt;   </b></code><br />
(Edit the values in all lines marked with &#8220;// change this&#8221;.)</p>
<p>This script doesn&#8217;t handle case issues with query string variables and values. Query string canonicalization must be developed for each individual site. Also, capturing misspelled URLs with nice search engine rankings should be implemented utilizing a database table when you&#8217;ve more than a dozen or so. </p>
<p>Lets see what the /404handler.php script does with requests of non-existing files. </p>
<p>First we test the requested URI for invalid URLs which are nicely ranked at search engines. We don&#8217;t care much about duplicate content issues when the engines deliver targeted traffic. Here is an example (which admittedly doesn&#8217;t rank for anything but illustrates the functionality): both <a href="http://link-condom.com/sample.asp">/sample.asp</a> as well as <a href="http://link-condom.com/Sample.asp">/Sample.asp</a> deliver the same content, although there&#8217;s no /Sample.asp script. Of course a better procedure would be renaming /sample.asp to /Sample.asp, permanently redirecting /sample.asp to /Sample.asp in .htaccess, and changing all internal links accordinly.</p>
<p>Next we lookup the all lowercase version of the requested path. If such a file exists, we perform a permanent redirect to it. Example: <a href="http://link-condom.com/About.asp">/About.asp</a> 301-redirects to <a href="http://link-condom.com/about.asp">/about.asp</a>, which is the file that exists.</p>
<p>Finally, if everything we tried to find a suitable URI for the actual request failed, we send the client a 404 error code and output the error page. Example: <a href="http://link-condom.com/gimme404.asp" rel="nofollow crap">/gimme404.asp</a> doesn&#8217;t exist, hence /404handler.php responds with a 404-Not Found header and displays /error.asp, but <a href="http://link-condom.com/error.asp">/error.asp</a> directly requested responds with a 200-OK.</p>
<p>You can easily refine the script with other algorithms and mappings to adapt its somewhat primitive functionality to your project&#8217;s needs. </p>
<h3>Tweaking code for future maintenance</h3>
<p>Legacy code comes with repetition, redundancy and duplication caused by developers who love copy+paste respectively copy+paste+modify, or Web design software that generates static files from templates. Even when you&#8217;re not willing to do a complete revamp by shoving your contents into a CMS, you must replace the ASP code anyway, what gives you the opportunity to encapsulate all templated page areas. </p>
<p>Say your design tool created a bunch of .asp files which all contain the same sidebars, headers and footers. When you move those files to your new server, create PHP include files from each templated page area, then replace the duplicated HTML code with <code>&lt;?php @include("header.php"); ?&gt;</code>, <code>&lt;?php @include("sidebar.php"); ?&gt;</code>, <code>&lt;?php @include("footer.php"); ?&gt;</code> and so on. Note that when you&#8217;ve HTML code in a PHP include file, you must add <code>&lt;?php ?&gt;</code> before the first line of HTML code or contents in included files. Also, leading spaces, empty lines and such which don&#8217;t hurt in HTML, can result in errors with PHP statements like header(), because those fail when the server has sent anything to the user agent (even a single space, new line or tab is too much).</p>
<p>It&#8217;s a good idea to use PHP scripts that are included at the very top and bottom of all scripts, even when you currently have no idea what to put into those. Trust me and create top.php and bottom.php, then add the calls (<code>&lt;?php @include("top.php"); ?&gt;</code> [&#8230;] <code>&lt;?php @include("bottom.php"); ?&gt;</code>) to all scripts. Tomorrow you&#8217;ll write a generic routine that you must have in all scripts, and you&#8217;ll happily do that in top.php. The day after tomorrow you&#8217;ll paste the GoogleAnalytics tracking code into bottom.php. With complex sites you need more hooks. </p>
<h3>Using absolute URLs on different systems</h3>
<p>Another weak point is the use of relative URIs in links, image sources or references to feeds or external scripts. The lame excuse of most developers is that they need to test the site on their local machine, and that doesn&#8217;t work with absolute URLs. Crap. Of course it works. The first statement in top.php is <code><b><br />
@require($_SERVER[&quot;SERVER_NAME&quot;] .&quot;.php&quot;); </b></code><br />
This way you can set the base URL for each environment and your code runs everywhere. For development purposes on a subdomain you&#8217;ve a &#8220;dev.example.com.php&#8221; include file, on the production system example.com the file name resolves to &#8220;www.example.com.php&#8221;: <code><b><br />
&lt;?php<br />
$baseUrl = &#8220;http://example.com&#8221;;<br />
?&gt;  </b></code><br />
Then the menu in sidebar.php looks like: <code><b><br />
&lt;?php<br />
$classVMenu = &quot;vmenu&quot;;<br />
print &quot;<br />
&lt;img src=\&quot;$baseUrl/vmenuheader.png\&quot; width=\&quot;128\&quot; height=\&quot;16\&quot; alt=\&quot;MENU\&quot; /&gt;<br />
&lt;ul&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/\&quot;&gt;Home&lt;/a&gt;&lt;/li&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/contact.asp\&quot;&gt;Contact&lt;/a&gt;&lt;/li&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/sitemap.asp\&quot;&gt;Sitemap&lt;/a&gt;&lt;/li&gt;<br />
&#8230;<br />
&lt;/ul&gt;<br />
&quot;;<br />
?&gt; </b></code><br />
Mixing X/HTML with server sided scripting languages is fault-prone and makes maintenance a nightmare. Don&#8217;t make the same mistake as WordPress. Avoid crap like that: <code><br />
&lt;li&gt;&lt;a class=&quot;&lt;?php print $classVMenu; ?&gt;&quot; href=&quot;&lt;?php print $baseUrl; ?&gt;/contact.asp&quot;&gt;&lt;/a&gt;&lt;/li&gt; </code></p>
<h3>Error handling</h3>
<p>I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root&#8217;s .htaccess file: <code><b><br />
ErrorDocument 401 /get-the-fuck-outta-here.asp<br />
ErrorDocument 403 /get-the-fudge-outta-here.asp<br />
ErrorDocument 404 /404handler.php<br />
ErrorDocument 410 /410-gone-forever.asp<br />
ErrorDocument 503 /410-down-for-maintenance.asp<br />
# &#8230;<br />
Options -Indexes </b></code><br />
Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script: <code></b><br />
ErrorDocument 401 /error.php?errno=401<br />
ErrorDocument 403 /error.php?errno=403<br />
ErrorDocument 404 /404handler.php<br />
ErrorDocument 410 /error.php?errno=410<br />
ErrorDocument 503 /error.php?errno=503<br />
# &#8230;<br />
Options -Indexes </b></code><br />
Note that relative URLs in pages or scripts called by ErrorDocument directives don&#8217;t work. <b>Don&#8217;t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that.</b> If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that&#8217;s bad enough.) For more information please read my pamphlet <a href="http://sebastians-pamphlets.com/why-proper-error-handling-is-important/">Why error handling is important</a>. </p>
<p>Last but not least create a robots.txt file in the root. If you&#8217;ve nothing to hide from search engine crawlers, this one will suffice: <code></b><br />
User-agent: *<br />
Disallow:<br />
Allow: /<br />
</b></code></p>
<p>I&#8217;m aware that this tiny guide can&#8217;t cover everything. It should give you an idea of the pitfalls and possible solutions. If you&#8217;re somewhat code-savvy my code snippets will get you started, but hire an expert when you plan to migrate a large site. And don&#8217;t view the source code of <a href="http://link-condom.com/">link-condom.com</a> pages where I didn&#8217;t implement all tips from this tutorial. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/", "style": "big", "title": "Upgrading from IIS/ASP to Apache/PHP" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Act out your sophisticated affiliate link paranoia</title>
		<link>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/</link>
		<comments>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/#comments</comments>
		<pubDate>Tue, 13 Nov 2007 07:09:30 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[E-Commerce]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/</guid>
		<description><![CDATA[
My recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google&#8217;s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:
Links purchased for advertising should be designated [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/paranoid-affiliate-link.png" width="250" height="231" border="0" align="right" style="margin-left:4px;" alt="GOOD: paranoid affiliate link" title="Paranoid on affiliate links" />My recent posts on <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">managing affiliate links</a> and <a href="http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/">nofollow cloaking</a> <a href="http://sebastians-pamphlets.com/text-link-broker-woes-smart-paid-links-sniffers-fromgoogle/">paid links</a> led to so many reactions from my readers that I thought explaining possible protection levels could make sense. <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66736">Google&#8217;s request to condomize affiliate links</a> is a bit, well, thin when it comes to technical tips and tricks:<br />
<blockquote>Links purchased for advertising should be designated as such. This can be done in several ways, such as:<br />
    * Adding a rel=&#8221;nofollow&#8221; attribute to the &lt;a&gt; tag<br />
    * Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file</p></blockquote>
<p> Also, Google doesn&#8217;t define <a href="http://sebastians-pamphlets.com/links/categories/?cat=paid-links">paid links</a> that clearly, so try this <a href="http://www.stonetemple.com/blog/?p=196">paid link definition</a> instead before your read on. <b>Here is my linking guide for the paranoid affiliate marketer.</b></p>
<p><a href="http://www.google.com/support/webmasters/bin/answer.py?answer=76465">Google recommends hiding of any content provided by affiliate programs from their crawlers</a>. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program&#8217;s sales tools page count as duplicate content (snippet) in its worst occurance.</p>
<p>Pasting code copied from a merchant&#8217;s site into a page&#8217;s or template&#8217;s HTML is not exactly a smart way to put ads. Those ads aren&#8217;t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you&#8217;re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts. </p>
<p>There are tons of such scripts out there, so I don&#8217;t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you&#8217;ve a WordPress blog you can adapt them with ease. </p>
<h3>Cover your ass with a linking policy</h3>
<p>Googlers as well as hired guns do review Web sites for violations of Google&#8217;s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) <a href="http://sebastians-pamphlets.com/links/full-disclosure/">full disclosure of your linking attitude</a> can help to pass a human review by search engine staff. By the way, having a <a href="http://sebastians-pamphlets.com/about/policies/#commenting">policy for dofollowed blog comments</a> is also a good idea.</p>
<p>Since crawler directives like <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">link condoms</a> are for search engines (only), and those pay attention to your source code and hints addressing search engines like <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a>, you should leave a note <a href="http://sebastians-pamphlets.com/robots.txt" rel="nofollow nocontent">there</a> too, look into the source of this page for an example. <a onclick="showContent('sample-code-disclosure'); this.style.display = 'none'; return false;">View sample HTML comment.</a> <b id="sample-code-disclosure" style="display:none;">Sample HTML comment: <code>&lt;&#33;--</code>This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.<code>--&gt;</code></b> </p>
<h3>Block crawlers from your propaganda scripts</h3>
<p>Put all your stuff related to advertising (scripts, images, movies&#8230;) in a subdirectory and disallow search engine crawling in your <a href="http://www.smart-it-consulting.com/article.htm?node=140&#038;page=46">/robots.txt</a> file: <code><br />
User-agent: *<br />
Disallow: /propaganda/ </code><br />
Of course you&#8217;ll use an innocuous name like &#8220;gnisitrevda&#8221; for this folder, which lacks a default document and can&#8217;t get browsed because you&#8217;ve a <code><br />
Options -Indexes </code><br />
statement in your .htaccess file. (Watch out, Google knows what &#8220;gnisitrevda&#8221; means, so be creative or cryptic.)</p>
<p>Crawlers sent out by major search engines do respect robots.txt, hence it&#8217;s guaranteed that regular spiders don&#8217;t fetch it. As long as you don&#8217;t cheat too much, you&#8217;re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn&#8217;t prevent you from surfing search engine staff, but I don&#8217;t tell you things you&#8217;d better hide from Matt&#8217;s gang.</p>
<h3>Detect search engine crawlers</h3>
<p>Basically there are three common methods to detect requests by search engine crawlers.
<ol>
<li>Testing the user agent name (HTTP_USER_AGENT) for strings like &#8220;Googlebot&#8221;, &#8220;Slurp&#8221;, &#8220;MSNbot&#8221; or so which identify crawlers. That&#8217;s easy to spoof, for example <a href="http://sebastians-pamphlets.com/referrer-spoofing-with-prefbar-341/">PrefBar for FireFox</a> lets you choose from a list of user agents.</li>
<li>Checking the user agent name, and only when it indicates a crawler, verifying the requestor&#8217;s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.</li>
<li>Maintaining a list of all search engine crawler IP addresses known to man,  checking the requestor&#8217;s IP (REMOTE_ADDR) against this list. (That alone isn&#8217;t bullet-proof, but I&#8217;m not going to write a tutorial on industrial-strength <strike>cloaking</strike> IP delivery, I leave that to the real <a href="http://fantomaster.com/fantomNews">experts</a>.)</li>
</ol>
<p>For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won&#8217;t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants&#8217; search engine rankings, and that&#8217;s evil in Google&#8217;s eyes.  </p>
<p>We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine&#8217;s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor&#8217;s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn&#8217;t match a crawler name. </p>
<p><b id="grab-php-code-check-crawler"><a onclick="showContent('php-code-check-crawler'); return false;">View</a>|<a onclick="hideContent('php-code-check-crawler'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)<br />
<code id="php-code-check-crawler" style="display:none;"><b><br />
// file system path to crawler IP log, scripts etc.,<br />
// without trailing slash:<br />
$includePath   = $_SERVER[&quot;DOCUMENT_ROOT&quot;] . &quot;/propaganda&quot;;<br />
// edit &quot;propaganda&quot; and CHMOD 777 the directory !<br />
// file names:<br />
$crawlerIps  = $includePath .&quot;/crawler-ip-addresses.txt&quot;;<br />
// misc. stuff:<br />
$timestamp     = date(&#8217;Y-m-d H:i:s&#8217;);<br />
$ipAddy        = $_SERVER[&quot;REMOTE_ADDR&quot;];<br />
$referrer      = $_SERVER[&quot;HTTP_REFERER&quot;];<br />
$userAgent     = $_SERVER[&quot;HTTP_USER_AGENT&quot;];<br />
$requestUri    = $_SERVER[&quot;REQUEST_URI&quot;];<br />
$queryString   = $_SERVER[&quot;QUERY_STRING&quot;];<br />
$isCrawler     = FALSE;<br />
$crawlerServer = &quot;&quot;;<br />
$delimiter     = &quot;|&quot;;<br />
$idString      = &quot;&quot;;<br />
if (empty($includePath)) {<br />
   $includePath = $_SERVER[&quot;DOCUMENT_ROOT&quot;] . &quot;/propaganda&quot;; // CHMOD 777<br />
}<br />
// Write a file to disk<br />
if (!function_exists(&quot;writeLocalFile&quot;)) {<br />
function writeLocalFile ($file, $content) {<br />
   if (!is_writable($file)) {<br />
      $lok = @chmod ( $file, 0777 );<br />
   }<br />
   // file_put_contents() not avail in PHP 4.3x<br />
   $fp = @fopen(&quot;$file&quot;,&quot;w+&quot;);<br />
   if ($fp) {<br />
       $lOk = @fwrite($fp, $content, strlen($content));<br />
       @fclose($fp);<br />
       // make sure file may get overwritten or removed later on<br />
       $lok = @chmod ( $file, 0777 );<br />
       return TRUE;<br />
   } // endif $fp<br />
   return FALSE;<br />
} // end function writeLocalFile<br />
}<br />
if (!function_exists(&quot;checkCrawlerUA&quot;)) {<br />
function checkCrawlerUA () {<br />
    GLOBAL $userAgent;<br />
    GLOBAL $crawlerServer;<br />
    $crawlerServer = &quot;&quot;;<br />
    $crawlers  = array(&quot;Googlebot&quot;,&quot;Mediapartners&quot;,&quot;Slurp&quot;,&quot;MSNbot&quot;,&quot;Ask&quot;,&quot;Teoma&quot;);<br />
    foreach ($crawlers as $crawler) {<br />
        if (stristr($userAgent,$crawler)) {<br />
            if (stristr($crawler,&quot;Googlebot&quot;) ||<br />
                stristr($crawler,&quot;Mediapartners&quot;)) {<br />
                $crawlerServer = &quot;.googlebot.com&quot;;<br />
            } // Google<br />
            if (stristr($crawler,&quot;Slurp&quot;)) {<br />
                $crawlerServer = &quot;.crawl.yahoo.net&quot;;<br />
            } // Yahoo<br />
            if (stristr($crawler,&quot;MSNbot&quot;)) {<br />
                $crawlerServer = &quot;.search.live.com&quot;;<br />
            } // MSN/Live<br />
            if (stristr($crawler,&quot;Ask&quot;) ||<br />
                stristr($crawler,&quot;Teoma&quot;)) {<br />
                $crawlerServer = &quot;.ask.com&quot;;<br />
            } // Ask<br />
        }<br />
    } // foreach crawlers<br />
    if (!empty($crawlerServer)) return TRUE;<br />
    return FALSE;<br />
} // end function checkCrawlerUA<br />
}<br />
if (!function_exists(&quot;checkCrawlerIP&quot;)) {<br />
function checkCrawlerIP ($idString) {<br />
    GLOBAL $ipAddy;<br />
    GLOBAL $crawlerIps;<br />
    GLOBAL $delimiter;<br />
    GLOBAL $timestamp;<br />
    GLOBAL $userAgent;<br />
    GLOBAL $crawlerServer;<br />
    $isCrawler = checkCrawlerUA();<br />
    if ($isCrawler === FALSE)  return FALSE;<br />
    if (empty($crawlerServer)) return FALSE;<br />
//<br />
// DEBUG: $crawlerServer = &quot;.national-net.com&quot;;<br />
// Use your ISPs host name for testing with a spoofed user agent name<br />
//<br />
    $crawlerIpsContent = @file_get_contents($crawlerIps);<br />
    if (!empty($crawlerIpsContent)) {<br />
        if (stristr($crawlerIpsContent, &quot;\n$ipAddy$delimiter&quot;)) {<br />
            return TRUE;<br />
        }<br />
    }<br />
    $crawlerHost = @gethostbyaddr($ipAddy);<br />
    if (!stristr($crawlerHost,$crawlerServer)) {<br />
        return FALSE;<br />
    }<br />
    if (&quot;$crawlerHost&quot; == &quot;$ipAddy&quot;) {<br />
        return FALSE;<br />
    }<br />
    $ipAddyRev = @gethostbyname($crawlerHost);<br />
    if (&quot;$ipAddyRev&quot; != &quot;$ipAddy&quot;) {<br />
        return FALSE;<br />
    }<br />
    $crawlerIpsContent .= &quot;\n&quot; .$ipAddy .$delimiter<br />
                          .$timestamp   .$delimiter<br />
                          .$crawlerHost .$delimiter<br />
                          .$idString    .$delimiter<br />
                          .$userAgent   .$delimiter;<br />
    $lOk = writeLocalFile ($crawlerIps, $crawlerIpsContent);<br />
    return TRUE;<br />
} // end function checkCrawlerIP<br />
}<br />
</b></code><br />
Grab and implement the PHP source, then you can code statements like <code><br />
$isSpider = checkCrawlerUA ();<br />
...<br />
if ($isSpider) {<br />
    $relAttribute = &quot; rel=\&quot;nofollow\&quot; &quot;;<br />
}<br />
...<br />
$affLink = &quot;&lt;a href=\&quot;$affUrl\&quot; $relAttribute&gt;call for action&lt;/a&gt;&quot;;<br />
</code><br />
or <code><br />
$isSpider = checkCrawlerIP ($sponsorUrl);<br />
...<br />
if ($isSpider) {<br />
    // don't redirect to the sponsor, return a 403 or 410 instead<br />
}</code><br />
More on that later.</p>
<h3>Don&#8217;t deliver your advertising to search engine crawlers</h3>
<p>It&#8217;s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense&#8217;s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can&#8217;t guarantee indexing, because Google doesn&#8217;t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.</p>
<p>Here is how you could output a totally uncrawlable banner ad: <code><br />
...<br />
$isSpider = checkCrawlerIP ($PHP_SELF);<br />
...<br />
print &quot;&lt;div class=\&quot;css-class-sidebar robots-nocontent\&quot;&gt;&quot;;<br />
// output RSS buttons or so<br />
if (!$isSpider) {<br />
    print &quot;&lt;script type=\&quot;text/javascript\&quot; src=\&quot;http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&#038;adServed=banner\&quot;&gt;&lt;/script&gt;&quot;;<br />
    ...<br />
}<br />
...<br />
print &quot;&lt;/div&gt;\n&quot;;<br />
...</code><br />
Lets look at the code above. First we detect crawlers &#8220;without doubt&#8221; (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-&#8217;.crawl.yahoo.net&#8217; host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I&#8217;ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).</p>
<p>Currently the <em>robots-nocontent</em>&nbsp; class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn&#8217;t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google&#8217;s section targeting that&#8217;s a <a href="http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/">crappy crawler directive</a>, though. However, it doesn&#8217;t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), &#8220;buy now&#8221;/&#8221;view basket&#8221; links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) &#8230; and of course ads.</p>
<p>Ads aren&#8217;t outputted when a crawler requests a page. Of course that&#8217;s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren&#8217;t relevant for indexing that&#8217;s not <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66355">deceitful cloaking</a> in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster&#8217;s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers. </p>
<p>Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren&#8217;t penalized. Since Web robots can&#8217;t pull out the plastic, there&#8217;s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Example from Google</a>: &#8220;Allow search bots to crawl your sites without session IDs or arguments that track their path through the site&#8221; is a call for <a href="http://www.smart-it-consulting.com/article.htm?node=148&#038;page=103">search engine friendly URL cloaking</a>.) </p>
<h3>Is hiding ads from crawlers &#8220;safe with Google&#8221; or not?</h3>
<p><img src="http://sebastians-pamphlets.com/img/posts/uncloaked-affiliate-link.png" width="200" height="188" border="0" align="right" style="margin-left:4px;" alt="BAD: uncloaked affiliate link" title="Uncloaked affiliate link" />Cloaking ads away is a double edged sword from a search engine&#8217;s perspective. Way too strictly interpreted that&#8217;s against the cloaking rule which states &#8220;don&#8217;t show crawlers other content than humans&#8221;, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google&#8217;s section targeting or Yahoo&#8217;s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don&#8217;t count them as on-the-page contents.</p>
<p>Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don&#8217;t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify <acronym title="Made For AdSense/Advertising">MFA</acronym> pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings. </p>
<p>Since in theory the contents of JavaScript ads aren&#8217;t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can&#8217;t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can&#8217;t argue seriously that hiding such tiny signals &#8211;which they use for the sole purposes of possible downranks&#8211; is against their guidelines.</p>
<p>For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn&#8217;t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.</p>
<p><b>Dear search engines, if it does not count, then you cannot demand to keep it crawlable.</b> Well, a few super mega white hat <acronym title="Dougie ...">trolls</acronym> might disagree, and depending on the implementation on individual sites maybe hiding ads isn&#8217;t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don&#8217;t try to hide the fact that I run ads on this blog.</p>
<p>Usually I don&#8217;t argue with fair vs. unfair, because we talk about <strike>war</strike> business here, what means that everything goes. However, Google does everything to talk the whole Internet into <strike>obfuscating</strike> disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> (Of course that&#8217;s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don&#8217;t care about bad press. So please don&#8217;t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)</p>
<h3>Outputting ads with JavaScript, preferably in iFrames</h3>
<p>Delivering adverts with JavaScript does not mean that one can&#8217;t use server sided scripting to adjust them dynamically. With content management systems it&#8217;s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn&#8217;t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval&#8217;ing the PHP code).</p>
<p>Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:<br />
<script type="text/javascript" src="http://sebastians-pamphlets.com/ads/output.js.php?adName=seobook&#038;adServed=banner"></script><br />
And here is the JS call of the PHP script: <code><br />
&lt;script type=&quot;text/javascript&quot; src=&quot;http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&#038;adServed=banner&quot;&gt;&lt;/script&gt;</code></p>
<p>The PHP script <code>/propaganda/output.js.php</code> evaluates the query string to pull the requested ad&#8217;s components. In case it&#8217;s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor&#8217;s location and whatnot, but that&#8217;s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad&#8217;s links. Once the HTML code is ready, it outputs a JavaScript statement: <code><br />
document.write(&lsquo;&lt;a href=&quot;http://sebastians-pamphlets.com/propaganda/router.php? adName=seobook&#038;adServed=banner&quot; title=&quot;DOWNLOAD THE BOOK ON SEO!&quot;&gt;&lt;img src=&quot;http://sebastians-pamphlets.com/propaganda/seobook/468-60.gif&quot; width=&quot;468&quot; height=&quot;60&quot; border=&quot;0&quot; alt=&quot;The only current book on SEO&quot; title=&quot;The only current book on SEO&quot;  /&gt;&lt;/a&gt;&rsquo;); </code> which the browser executes within the <code>script</code> tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag. </p>
<p>Matt Cutts <a href="http://www.stonetemple.com/articles/interview-matt-cutts.shtml">said</a> that <a href="http://www.mattcutts.com/blog/bot-obedience-herding-googlebot/#comment-45561">JavaScript links don&#8217;t prevent Googlebot from crawling</a>, but that <a href="http://www.seomoz.org/blog/the-paid-links-debate-rages-on-ses-san-jose-2007">those links</a> <a href="http://www.mattcutts.com/blog/how-to-report-paid-links/#comment-101482">don&#8217;t count for rankings</a> (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can&#8217;t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it&#8217;s fetchable by crawlers, so I wouldn&#8217;t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That&#8217;s why I use rel-nofollow &#8211;on crawler requests&#8211; even in JS ads.</p>
<p>Change your user agent name to Googlebot or so, install <a href="http://www.mattcutts.com/blog/seeing-nofollow-links/">Matt&#8217;s show nofollow hack</a> or something similar, and you&#8217;ll see that the affiliate-URL gets nofollow&#8217;ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can&#8217;t offend my readers with nofollow&#8217;ed links (although I don&#8217;t care about crawler spoofing, actually that&#8217;s a good procedure to let advertisers check out my linking attitude).</p>
<p>We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don&#8217;t count pages displayed in iFrames as on-page contents, especially not when the iFrame&#8217;s content is hosted on another domain. Here is an example straight from the horse&#8217;s mouth: <code><br />
&lt;iframe name=&quot;google_ads_frame&quot; src=&quot;http://pagead2.googlesyndication.com/pagead/ads? very-long-and-ugly-query-string&quot; marginwidth=&quot;0&quot; marginheight=&quot;0&quot; vspace=&quot;0&quot; hspace=&quot;0&quot; allowtransparency=&quot;true&quot; frameborder=&quot;0&quot; height=&quot;90&quot; scrolling=&quot;no&quot; width=&quot;728&quot;&gt;&lt;/iframe&gt;</code> In a noframes tag we could put a static ad for surfers using browsers which don&#8217;t support frames/iFrames. </p>
<p>If for some reasons you don&#8217;t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael&#8217;s <a href="http://www.bad-neighborhood.com/htmlhashing.htm">Full Text/HTML Obfuscator Tool</a> (hat tip to <a href="http://www.seo-scoop.com/2007/09/13/new-tool-to-hide-stuff/">Donna</a>).</p>
<h3>Always redirect to affiliate URLs</h3>
<p>There&#8217;s absolutely no point in using ugly affiliate URLs on your pages. Actually, that&#8217;s the last thing you want to do for various reasons.
<ul>
<li>For example, affiliate URLs as well as source codes can change, and you don&#8217;t want to edit tons of pages if that happens.</li>
<li>When an affiliate program doesn&#8217;t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you&#8217;d replace outdated ads completely with one mouse click or so.</li>
<li>Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own &#8211;often confusing&#8211; layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don&#8217;t track your outgoing traffic, some sponsors will cheat and you can&#8217;t prove your gut feelings.</li>
<li>Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don&#8217;t smell like affiliate URLs.</li>
<li><b>&#8230;</b></li>
</ul>
<p>When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really <b>manage your affiliate links</b>. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.</p>
<p>Affiliate links provide great footprints. Unique URL parts respectively <b>query string variable names</b> gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The <b>values</b> identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn&#8217;t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that&#8217;s one trackable footprint you can avoid to some degree. </p>
<p>If the redirect-script&#8217;s location is on the same server (in fact it&#8217;s not thanks to symlinks) and not named &#8220;adserver&#8221; or so, chances are that a heuristic check won&#8217;t identify the link&#8217;s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren&#8217;t crawlable with paramaters, or similar &#8220;legit&#8221; stuff. However, you can&#8217;t fool smart algos forever, but if you&#8217;ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine&#8217;s point of view, and users might agree on this).</p>
<p>Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn&#8217;t benefit you (<a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">usually</a>). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.</p>
<h3>The behavior of an adserver URL masking an affiliate link</h3>
<p>Lets look at the redirect-script&#8217;s URL from my code example above:<br />
<a href="http://sebastians-pamphlets.com/ads/router.php?adName=seobook&#038;adServed=banner">/propaganda/router.php?adName=seobook&#038;adServed=banner</a><br />
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert. </p>
<p>Now that we&#8217;ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor&#8217;s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor&#8217;s IP addy.</p>
<p>If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor&#8217;s landing page: <code><br />
$sponsorUrl      = &quot;http://www.seobook.com/262.html&quot;;<br />
$requestProtocol = $_SERVER[&quot;SERVER_PROTOCOL&quot;];<br />
$protocolArr     = explode(&quot;/&quot;,$requestProtocol);<br />
$protocolName    = trim($protocolArr[0]);<br />
$protocolVersion = trim($protocolArr[1]);<br />
if (stristr($protocolName,&quot;HTTP&quot;)<br />
    &#038;&#038; strtolower($protocolVersion) > &quot;1.0&quot; ) {<br />
    $httpStatusCode = 307;<br />
}<br />
else {<br />
    $httpStatusCode = 302;<br />
}<br />
$httpStatusLine = &quot;$requestProtocol $httpStatusCode Temporary Redirect&quot;;<br />
@header($httpStatusLine, TRUE, $httpStatusCode);<br />
@header(&quot;Location: $sponsorUrl&quot;);<br />
exit;</code><br />
A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can&#8217;t use the 301 response code, because 301 redirects are always cachable. (<a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/">More information on HTTP redirects</a>.)</p>
<p>If the requestor is a major search engine&#8217;s crawler, we perform the most brutal bounce back known to man: <code><br />
if ($isSpider) {<br />
    @header(&quot;HTTP/1.1 403 Sorry Crawlers Not Allowed&quot;, TRUE, 403);<br />
    @header(&quot;X-Robots-Tag: nofollow,noindex,noarchive&quot;);<br />
    exit;<br />
}</code><br />
The 403 response code translates to &#8220;kiss my ass and get the fuck outta here&#8221;. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn&#8217;t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine &#8220;forget this URL, don&#8217;t request it again&#8221;. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.</p>
<p>If you don&#8217;t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don&#8217;t need to live in fear of penalties.</p>
<h3 id="recap-hide-afflinks">Recap</h3>
<p><img src="http://sebastians-pamphlets.com/img/posts/prospering-affiliate-link.png" width="200" height="200" border="0" align="right" style="margin-left:4px;" alt="NICE: prospering affiliate link" title="Prospering affiliate link" />Before you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.</p>
<p>Don&#8217;t deliver different (editorial) contents to users and crawlers, but also don&#8217;t serve ads to crawlers. They just don&#8217;t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.</p>
<p>Your ads look plain ugly with dotted borders in firebrick, hence don&#8217;t apply rel=&#8221;nofollow&#8221; to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.</p>
<p>No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers. </p>
<p>Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google&#8217;s job is not exactly funny.</p>
<p>I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you&#8217;re not paranoid, that does not mean Google will not chase you down. You really don&#8217;t need to be a so called black hat to displease Google. Not knowing respectively not understanding <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Google&#8217;s 12 commandments</a> doesn&#8217;t prevent you from being spanked for sins you&#8217;ve never heard of. If you&#8217;re keen on Google&#8217;s nicely targeted traffic, better play by Google&#8217;s rules, leastwise on creawler requests.</p>
<p>Feel free to contribute your tips and tricks in the comments.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/", "style": "big", "title": "Act out your sophisticated affiliate link paranoia" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
