<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; Cloaking</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Wed, 11 Aug 2010 18:57:05 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Cloaking is good for you. Just ignore Bing&#8217;s/Google&#8217;s guidelines.</title>
		<link>http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/</link>
		<comments>http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/#comments</comments>
		<pubDate>Mon, 05 Jul 2010 18:24:08 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Usability]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/</guid>
		<description><![CDATA[
Summary first: If you feel the need to cloak, just do it within reason. Don&#8217;t cloak because you can, but because it&#8217;s technically the most elegant procedure to accomplish a Web development task. Bing and Google can&#8217;t detect your (in no way deceptive) intend algorithmically. Don&#8217;t spam away, though, because you might leave trails besides [...]]]></description>
			<content:encoded><![CDATA[
<p>Summary first: If you feel the need to cloak, just do it within reason. Don&#8217;t cloak because you can, but because it&#8217;s technically the most elegant procedure to accomplish a Web development task. Bing and Google can&#8217;t detect your (in no way deceptive) intend algorithmically. Don&#8217;t spam away, though, because you might leave trails besides cloaking alone, if you aren&#8217;t good enough at spamming search engines. Keep your users interests in mind. Don&#8217;t comply to search engine guidelines as set in stone, but to a reasonable level, for example when those <a href="http://www.youtube.com/watch?v=XWfqyy7J34s">force you to comply to Web standards</a> that make more sense than the fancy idea you&#8217;ve developed on internationalization, based on detecting browser language settings or so.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/penalizing-cloaking-is--bullshit.png" width="250" height="376" align="right" alt="search engine guidelines are bullshit WRT cloaking" title="Search engines must not penalize cloaking" style="margin-left:5px;" />This pamphlet is an opinion piece. The above said should be considered best practice, even by search engines. Of course it&#8217;s not, because search engines can and do fail, just like a webmaster who takes my statement &#8220;go cloak away if it makes sense&#8221; as technical advice and gets his search engine visibility tanked the hard way.</p>
<h3>WTF is cloaking?</h3>
<p>Cloaking, also known as IP delivery, means delivering content tailored for specific users who are identified primarily by their IP addresses, but also by user agent (browser, crawler, screen reader&#8230;) names, and whatnot. Here&#8217;s a simple demonstration of this technique. The content of the next paragraph differs depending on the user requesting this page. Googlebot, Googlers, as well as Matt Cutts at work, will read a personalized message:</p>
<p><em>Dear visitor, thanks for your visit from 38.107.191.85 (38.107.191.85).</em></p>
<p>You surely can imagine that cloaking opens <del>a can of worms</del> <ins>lots of opportunities to enhance a user&#8217;s surfing experience</ins>, besides &#8220;stalking&#8221; particular users like Google&#8217;s head of WebSpam.</p>
<h3>Why do search engines dislike cloaking?</h3>
<p>Apparently they don&#8217;t. They use IP delivery themselves. When you&#8217;re traveling in europe, you&#8217;ll get hints like &#8220;go to Google.fr&#8221; or &#8220;go to Google.at&#8221; all the time. That&#8217;s google.com checking where you are, trying to lure you into their regional services.</p>
<p>More seriously, there&#8217;s a so-called &#8220;dark side of cloaking&#8221;. Say you&#8217;re a <a href="http://fantomaster.com/fantomNews/archives/2010/07/08/fantomas-shadowmaker/">seasoned Internet marketer</a>, then you could show Googlebot an educational page with compelling content under an URI like &#8220;/games/poker&#8221; with an X-Robots-Tag HTTP header telling &#8220;noarchive&#8221;, whilst surfers (search engine users) supplying an HTTP_REFERER and not coming from employee.google.com get redirected to poker dot com (simplified example).</p>
<p>That&#8217;s hard to detect for Google&#8217;s WebSpam team. Because they don&#8217;t do evil themselves, they can&#8217;t officially operate sneaky bots that use for example AOL as their ISP to compare your spider fodder to pages/redirects served to actual users.</p>
<p>Bing sends out spam bots that request your pages &#8220;as a surfer&#8221; in order to discover deceptive cloaking. Of course those bots can be identified, so professional spammers serve them their spider fodder. Besides burning the bandwidth of non-cloaking sites, Bing doesn&#8217;t accomplish anything useful in terms of search quality.</p>
<p>Because search engines can&#8217;t detect cloaking properly, not to speak of a cloaking webmaster&#8217;s intentions, they&#8217;ve launched webmaster guidelines (FUD) that forbid cloaking at all. All Google/Bing reps tell you that cloaking is an evil black hat tactic that will get your site penalized or even banned. By the way, the same goes for perfectly legit &#8220;hidden content&#8221; that&#8217;s invisible on page load, but viewable after a mouse click on a &#8220;learn more&#8221; widget/link or so.</p>
<h3>Bullshit.</h3>
<p>If your competitor makes creative use of IP delivery to enhance their visitors&#8217; surfing experience, you can file a spam report for cloaking and Google/Bing will ban the site eventually. Just because cloaking <em>can</em> be used with deceptive intent. And yes, it works this way. See below.</p>
<p>Actually, those spam reports trigger a review by a human, so maybe your competitor gets away with it. But search engines also use spam reports to develop spam filters that penalize crawled pages totally automatted. Such filters can fail, and &#8211;trust me&#8211; they do fail often. Once you must optimize your content delivery for particular users or user groups yourself, such a filter could tank your very own stuff by accident. So don&#8217;t snitch on your competitors, because tomorrow they&#8217;ll return the favor.</p>
<h3>Enforcing a &#8220;do not cloak&#8221; policy is evil</h3>
<p>At least Google&#8217;s WebSpam team comes with cojones. They&#8217;ve even <a href="http://searchengineland.com/google-adwords-help-cloaks-to-google-gets-banned-45541">banned their very own help pages</a> for &#8220;<a href="http://google.com/search?hl=en&#038;q=matt+cutts+cloaking&#038;num=13&#038;safe=off">cloaking</a>&#8220;, although those didn&#8217;t serve porn to minors searching for SpongeBob images with safe-search=on.</p>
<p>That&#8217;s overdrawn, because the help files of any Google product aren&#8217;t usable without a search facility. When I click &#8220;help&#8221; in any Google service like AdWords, I get either blank pages, and/or links within the help system are broken because the destination pages were deindexed for cloaking. Plain evil, and counter productive.</p>
<p>Just because Google&#8217;s help software doesn&#8217;t show ads and related links to Googlebot, those pages aren&#8217;t guilty of deceptive cloaking. Ms Googlebot won&#8217;t pull the plastic, so it makes no sense to serve her advertisements. Related links are context sensitive just like ads, so it makes no sense to persist them in Google&#8217;s crawling cache, or even in Google&#8217;s search index. Also, as a user I really don&#8217;t care whether Google has crawled the same heading I see on a help page or not, as long as I get directed to relevant content, that is a paragraph or more that answers my question.</p>
<p>When a search engine doesn&#8217;t deliver the very best search results intentionally, just because those pages violate an outdated and utterly useless policy that rules fraudulent tactics in a shape lastly used in the last century and doesn&#8217;t take into account how the Internet works today, I&#8217;m pissed.</p>
<p>Maybe that&#8217;s not bad at all when applied to Google products? Bullshit, again. The same happens to any other website that doesn&#8217;t fit Google&#8217;s weird idea of &#8220;serving the same content to users and crawlers&#8221;. I mean, as long as Google&#8217;s crawlers come from US IPs only, how can a US based webmaster serve the same content in German language to a user coming from Austria and Googlebot, both requesting a URI like &#8220;/shipping-costs?lang=de&#8221; that has to be different for each user because shipping a parcel to Germany costs $30.00 and a parcel of the same weight shipped to Vienna costs $40.00? Don&#8217;t tell me bothering a user with shipping fees for all regions in CH/AT/DE all on one page is a good idea, when I can reduce the information overflow to a tailored info of just one shipping fee that my user expects to see, followed by a link to a page that lists shipping costs for all european countries, or all countries where at least some folks might speak/understand German.</p>
<p>Back to Google&#8217;s ban of its very own help pages that hid AdSense code from Googlebot. Of course Google wants to see what surfers see in order to deliver relevant search results, and that might include advertisements. However, surrounding ads don&#8217;t necessarily obfuscate the page&#8217;s content. Ads served instead of content do. So when Google wants to detect ad laden thin pages, they need to become smarter. Penalizing pages that don&#8217;t show ads to search engine crawlers is a bad idea for a search engine, because not showing ads to crawlers is a good idea, not only bandwidth-wise, for a webmaster.</p>
<p>Managing this dichotomy is the search engine&#8217;s job. They shouldn&#8217;t expect webmasters to help them solving their very own problems (maintaining search quality). In fact, bothering webmasters with policies solely put because search engine algos are fallible and incapable is plain evil. The same applies to instruments like rel-nofollow (launched to help Google devaluing spammy links but backfiring enormously) or Google&#8217;s war on paid links (as if not each and every link on the whole Internet is paid/bartered for, somehow).</p>
<p>What do you think, should search engines ditch their way too restrictive &#8220;don&#8217;t cloak&#8221; policies? <a href="http://twitter.com/home?status=Hey+@Google+@Bing,+go+ditch+your+outdated+webmaster+guidelines!+http%3A%2F%2Fsebastians-pamphlets.com/cloaking-is-good-for-your-vistors/" target="twitter" title="Stop search engines that tyrannize webmasters!"><b>Click to vote:</b> <img src="http://sebastians-pamphlets.com/img/twitter-icon.gif" width="10" height="10" style="border:none;" alt="Stop search engines that tyrannize webmasters!"  /></a></p>
<p> </p>
<p><b>Update 2010-07-06:</b> Don&#8217;t miss out on Danny Sullivan&#8217;s &#8220;<strong>Google be fair!</strong>&#8221; appeal, posted today: <a href="http://searchengineland.com/why-google-should-ban-its-own-help-pages-45781">Why Google Should Ban Its Own Help Pages — But Also Shouldn’t</a></p>
<p> <!-- Processed by EzStatic --></p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/", "style": "big", "title": "Cloaking is good for you. Just ignore Bing's/Google's guidelines." } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google went belly-up: SERPs sneakily redirect to FPAs</title>
		<link>http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/</link>
		<comments>http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/#comments</comments>
		<pubDate>Wed, 12 May 2010 17:06:19 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Spam Report]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/</guid>
		<description><![CDATA[
I&#8217;m pissed. I do know I shouldn&#8217;t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.
What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? Here you go. Google&#8217;s webmaster guidelines [...]]]></description>
			<content:encoded><![CDATA[
<p>I&#8217;m pissed. I do know I shouldn&#8217;t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.</p>
<p>What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Here you go</a>. Google&#8217;s webmaster guidelines explicitely forbid such black hat tactics: </p>
<p>&#8220;<strong>Don&#8217;t use cloaking or sneaky redirects.</strong>&#8221; Google just did the latter with its very own <a href="http://google.com/ie?q=buy+viagra+online">SERPs</a>. The search interface <a href="http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/#goog-ie-ui">google.com/ie</a>, out in the wild for nearly a decade, redirects to a piece of sidebar HTML offering a download of IE8 optimized for Google. That&#8217;s a helpful redirect for some IE6 users who don&#8217;t suffer from an IT department stuck with this outdated browser, but it&#8217;s plain misleading in the eyes of all those searchers who appreciated this clean and totally uncluttered search interface. Interestingly, <abbr title="User Agent">UA</abbr> cloaking is the only way to heal this sneaky behavior.</p>
<p>&#8220;<strong>Don&#8217;t create pages with malicious behavior.</strong>&#8221; Google&#8217;s guilty, too. Instead of checking for the user&#8217;s browser, redirecting only IE6 requests from <a href="http://www.google.com/search?output=ie&#038;num=100&#038;hl=en&#038;safe=off&#038;q=google+discontinues+IE6+support">Google&#8217;s discontiued IE6 support</a> (IE6 toolbar &#8230;) to the IE8 advertisement, whilst all other user agents get their desired search box, respectively their SERPs, under a google.com/search?output=ie&amp;&#8230; URI, Google performs an unconditional redirect to a page that&#8217;s utterly useless and also totally unexpected for many searchers. I consider misleading redirects malicious.</p>
<p>&#8220;<strong>Avoid links to web spammers or &#8216;bad neighborhoods&#8217; on the web.</strong>&#8221; I consider the propaganda for IE that Google displays instead of the search results I&#8217;d expect a bad neighborhood on the Web, because IE constantly ignores Web standards, forcing developers and designers to implement superfluous work arounds. (Ok, ok, ok &#8230; Google&#8217;s lack of geekiness doesn&#8217;t exactly count as violation of their webmaster guidelines, but it sounds good, doesn&#8217;t it?)</p>
<p><a href="http://twitter.com/home?status=Hey+@MattCutts,+about+time+to+ban+google.com/ie?q=spam!+http%3A%2F%2Fsebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/" target="twitter" title="Tweet That!"><strong>Hey Matt Cutts, about time to ban google.com/ie!</strong> <img src="http://sebastians-pamphlets.com/img/twitter-icon.gif" width="10" height="10" style="border:none;" alt="Click to tweet that"  /></a></p>
<p id="goog-ie-ui"><a href="http://sebastians-pamphlets.com/rediscover-googles-free-ranking-checker/">Google&#8217;s very best search interface</a> is history. Here is what you got under<code><br />
<b>http://www.google.com/ie?num=100&#038;hl=en&#038;safe=off&#038;q=minimalistic</b></code>:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/google-awesome-ie-serp.png" width="448" height="503" style="text-align:center; display:block;" align="middle" alt="Google's famous minimalistic search UI" title="Google's famous minimalistic search UI" /></p>
<p>And here is where Google sneakily redirects you to when you load the SERP link above (even with Chrome!):<code><br />
<b>http://www.google.com/toolbar/ie8/sidebar.html</b></code>:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/google-fpa-ie8.png" width="268" height="569" style="border:dotted red 1px; text-align:center; display:block;" align="middle" alt="Google's sneaky IE8 propaganda" title="Google's sneaky IE8 propaganda" /></p>
<p id="goog-ie-spam-report">It&#8217;s sad that a browser vendor like Google (and yes, Google Chrome <b>is</b> my favorite browser) feels the need to mislead its users with propaganda for a competiting browser that&#8217;s slower and doesn&#8217;t render everything as it should render it. But when this particular browser vendor also leads Web search, and makes use of black hat techniques that it bans webmasters for, then that&#8217;s a scandal. So, if you agree, please submit a spam report to Google:</p>
<p><a href="http://twitter.com/home?status=Hey+@MattCutts,+about+time+to+ban+google.com/ie! %23spam-report+http%3A%2F%2Fsebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/" target="twitter" title="Tweet Your Spam Report!"><strong>Hey Matt Cutts, about time to ban google.com/ie! #spam-report</strong> <img src="http://sebastians-pamphlets.com/img/twitter-icon.gif" width="10" height="10" style="border:none;" alt="Tweet Your Spam Report"  /></a></p>
<p>2010-05-17 I&#8217;ve updated this pamphlet because it didn&#8217;t explain the &#8220;sneakiness&#8221; clear enough. As of today, the unconditional redirect is still sneaky IMHO. Google needs to deliver searchers their desired search results, and only stubborn IE6 users ads for a somewhat better browser.</p>
<p>2010-05-18 <b>Q:</b> You&#8217;re pissed solely because your SERP scraping scrips broke. <b>A:</b> Glad you&#8217;ve asked. Yes, I&#8217;ve <a href="http://www.scroogle.org/cgi-bin/scraper.htm" rel="crap nofollow">scraped Google&#8217;s /ie search</a> too. Not because I&#8217;m a <a href="http://www.google-watch.org/" rel="crap nofollow">privacy nazi</a> like Daniel Brandt. I&#8217;ve just checked (my) rankings. However, when I spotted the redirects I didn&#8217;t even remember the location of the scripts that scraped this service, because I didn&#8217;t look at ranking reports for years. I&#8217;m interested in actual traffic, and revenues. Ego food annoys me. I just love the /ie search interface. So the answer is a bold &#8220;no&#8221;. I don&#8217;t give a fucking dead rat&#8217;s ass what ranking reports based on scraped SERPs could tell.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/", "style": "big", "title": "Google went belly-up: SERPs sneakily redirect to FPAs" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/google-serps-sneakily-redirect-to-ads/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Get yourself a smart robots.txt</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/</link>
		<comments>http://sebastians-pamphlets.com/smart-robots-txt/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 18:27:16 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/smart-robots-txt/</guid>
		<description><![CDATA[
Crawlers and other Web robots are the plague of today&#8217;s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don&#8217;t. Behaving or not, most bots just steal your content. You don&#8217;t appreciate that, so block them.
This pamphlet is about blocking behaving bots with a smart robots.txt file. [...]]]></description>
			<content:encoded><![CDATA[
<p><img  src="http://sebastians-pamphlets.com/img/posts/block-greedy-bots.png" width="200" height="261" align="right" style="margin-left:5px;" alt="greedy and aggressive web robots steal your content" title="Block greedy Web robots stealing your content, the smart way" />Crawlers and <a href="http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/" title="MJ12bot, dotbot, ...">other Web robots</a> are the <a href="http://sebastians-pamphlets.com/social-media-plague/">plague</a> of today&#8217;s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">REP</a>), others don&#8217;t. Behaving or not, most bots just steal your content. You don&#8217;t appreciate that, so <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-rogue-bot">block</a> them.</p>
<p>This pamphlet is about blocking behaving bots with a <a href="http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/" title="Read this old robots.txt pamphlet to recap the basics">smart robots.txt</a> file. I&#8217;ll show you how you can restrict crawling to bots operated by major search engines &#8211;that bring you nice traffic&#8211; while keeping the nasty (or useless, traffic-wise) bots out of the game.</p>
<p>The basic idea is that blocking all bots &#8211;with very few exceptions&#8211; makes more sense than maintaining kinda <a href="http://www.robotstxt.org/db.html">Web robots who&#8217;s who</a> in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold <b>Disallow: /</b>.</p>
<p>Of course that&#8217;s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That&#8217;s bullshit. You can&#8217;t handle a gazillion bots with a black list. </p>
<p>Even bots that respect the REP can harm your search engine rankings, or <a href="http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/">reveal sensitive information to your competitors</a>. Every minute a new bots turns up. You can&#8217;t manage all of them, and you can&#8217;t trust any (behaving) bot. Or, as the <a href="http://incredibill.blogspot.com/">master of bot control</a> <a href="http://sphinn.com/story/139361/#74924">explains</a>: &#8220;<b>That&#8217;s the only thing I&#8217;m concerned with: what do I get in return. If it&#8217;s nothing, it&#8217;s blocked</b>&#8220;. </p>
<p>Also, large robots.txt files handling tons of bots are fault prone. It&#8217;s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don&#8217;t enlighten your competitors.</p>
<p id="smart-robots-txt-toc">If you&#8217;re a smart webmaster agreeing with this approach, here&#8217;s your ToDo-List: <br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/downloads/smart_robots_txt.zip">Grab the code</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-install">Install</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-customize">Customize</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-test">Test</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-launch">Implement</a>.<br />On error read further.</p>
<h3 id="smart-robots-txt-anatomy">The anatomy of a smart robots.txt</h3>
<p>Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you&#8217;re somewhat fucked. The code isn&#8217;t elegant. I&#8217;ve tried to keep it easy to understand even for noobs &#8212; at the expense of occasional lengthiness and redundancy.</p>
<h4 id="smart-robots-txt-install">Install</h4>
<p>First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that&#8217;s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you&#8217;d have to add a leading <code>&lt;?php ?&gt;</code> string to all of them. Hence you add this code snippet to your root&#8217;s .htaccess file:<code><br />
&lt;FilesMatch ^robots\.txt$&gt;<br />
SetHandler application/x-httpd-php<br />
&lt;/FilesMatch&gt;</code><br />
As long as you&#8217;re testing and customizing my script, make that <code>^smart_robots\.txt$</code>.</p>
<p>Next <a href="http://sebastians-pamphlets.com/downloads/smart_robots_txt.zip">grab the code</a> and extract it into your document root directory. <strong>Do not rename /smart_robots.txt to /robots.txt until you&#8217;ve customized the PHP code!</strong></p>
<p>For testing purposes you can use the logRequest() function. Probably it&#8217;s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don&#8217;t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.</p>
<h4 id="smart-robots-txt-customize">Customize</h4>
<p>Search the code for <code>#EDIT</code> and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you&#8217;ve launched your smart robots.txt.</p>
<p>Let&#8217;s look at <strong>/smart_robots.txt</strong>. First of all, it sets the canonical server name, change that to yours. After routing <em>robots.txt request logging</em> to a flat file (change that to a database table!) it includes /smart_robots_inc.php. </p>
<p>Next it sends some HTTP headers that you shouldn&#8217;t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn&#8217;t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.</p>
<p style="margin-left:15px;">As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its <a href="http://google.com/search?q=cache:sebastians-pamphlets.com/robots.txt">cached copy</a>. To test your own robots.txt with Googlebot, you can login to <a href="https://www.google.com/webmasters/tools/">GWC</a> and fetch it as Googlebot. It&#8217;s a shame that the other search engines don&#8217;t provide a feature like that.</p>
<p>When you implement the <strong>whitelisted crawler</strong> method, you really should provide a contact page for crawling requests. So please change the &#8220;In order to gain permissions to crawl blocked site areas&#8230;&#8221; comment.</p>
<p>Next up are the search engine specific crawler directives. You put them as <code><br />if (isGooglebot()) {<br />
   $content .= &quot;<br />
User-agent: Googlebot<br />
Disallow:<br />
&#8230;<br />
\n\n&quot;;<br />
}</code><br />
If your URIs contain double quotes, escape them as <code>\"</code> in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)</p>
<p>Please note that you need to output at least one empty line before each <code>User-agent:</code> section.  Repeat that for each accepted crawler, before you output <code><br />
$content .= &quot;User-agent: *<br />
Disallow: /<br />
\n\n&quot;;   </code><br />
Every behaving Web robot that&#8217;s not whitelisted will bounce at the  <code>Disallow: /</code>. </p>
<p>Before <code>$content</code> is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all <a href="http://www.seoconsultants.com/tools/headers/">SEO tools</a> that spoof the user agent. Make sure that you do not output a single byte &#8211;for example leading whitespaces, a debug message, or a #comment&#8211; before the <code>print $content;</code> statement.</p>
<p style="margin-left:15px;">Blocking rogue bots is important. If you discover a rogue bot &#8211;for example a scraper that pretends to be Googlebot&#8211; during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can&#8217;t access your content!</p>
<p>Bear in mind that each and every piece of content served from your site should implement <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-rogue-bot">rogue bot detection</a>, that&#8217;s doable even with non-HTML resources like images or PDFs.</p>
<p>Finally we deliver the user agent specific robots.txt and terminate the connection.</p>
<p>Now let&#8217;s look at <strong>/smart_robots_inc.php</strong>. Don&#8217;t fuck-up the variable definitions and routines that populate them or deal with the requestor&#8217;s IP addy.</p>
<p>Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that&#8217;s maintained 24/7 and updated every 6 hours: <a href="http://fantomaster.com/fasvsspy01.html">here you go</a> (don&#8217;t use it for deceptive cloaking, promised?).</p>
<p>/smart_robots_inc.php comes with functions that detect <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=80553">Googlebot</a>, <a href="http://www.bing.com/community/blogs/search/archive/2006/11/29/search-robots-in-disguise.aspx">MSNbot</a>, and <a href="http://www.ysearchblog.com/2007/06/05/yahoo-search-crawler-slurp-has-a-new-address-and-signature-card/">Slurp</a>.</p>
<p>Most search engines tell how you can <a href="http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html">verify</a> their crawlers and which crawler directives <a href="http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html">their user agents</a> support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading &#8220;spider&#8221; and trailing &#8220;.yandex.ru&#8221; string and inbetween an integer, like in the isSlurp() function.</p>
<h4 id="smart-robots-txt-test">Test</h4>
<p>Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don&#8217;t need to wait for crawler visits, you can use the &#8220;Fetch as Googlebot&#8221; thingy in your webmaster console.</p>
<p>Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from &#8220;.search.live.com&#8221; by accident, or that someone at Yahoo starts an ancient test bot that still uses an &#8220;inktomisearch.com&#8221; host name.</p>
<p>Don&#8217;t rely on my crawler detection routines. They&#8217;re dumped from memory in a hurry, I&#8217;ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It&#8217;s up to you to make it smart.</p>
<h4 id="smart-robots-txt-launch">Launch</h4>
<p>Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.</p>
<h3 id="smart-robots-txt-output">The output of a smart robots.txt</h3>
<p>When you download a <a href="http://sebastians-pamphlets.com/smart_robots.txt">smart robots.txt</a> with your browser, wget, or any other tool that comes with user agent spoofing, you&#8217;ll see a 403 or something like:</p>
<blockquote><p><code><br />
HTTP/1.1 200 OK<br />
Date: Wed, 24 Feb 2010 16:14:50 GMT<br />
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1<br />
X-Powered-By: sebastians-pamphlets.com<br />
X-Robots-Tag: noindex, noarchive, nosnippet<br />
Connection: close<br />
Transfer-Encoding: chunked<br />
Content-Type: text/plain;charset=iso-8859-1</p>
<p># In order to gain permissions to crawl blocked site areas<br />
# please contact the webmaster via<br />
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot </p>
<p>User-agent: *<br />
Disallow: /<br />
</code>(the contact form URI above doesn&#8217;t exist)</p></blockquote>
<p>whilst a real search engine crawler like Googlebot gets slightly different contents:</p>
<blockquote><p><code><br />
HTTP/1.1 200 OK<br />
Date: Wed, 24 Feb 2010 16:14:50 GMT<br />
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1<br />
X-Powered-By: sebastians-pamphlets.com<br />
X-Robots-Tag: noindex, noarchive, nosnippet<br />
Connection: close<br />
Transfer-Encoding: chunked<br />
Content-Type: text/plain; charset=iso-8859-1</p>
<p># In order to gain permissions to crawl blocked site areas<br />
# please contact the webmaster via<br />
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot </p>
<p>User-agent: Googlebot<br />
Allow: /<br />
Disallow: </p>
<p>Sitemap: http://sebastians-pamphlets.com/sitemap.xml</p>
<p>User-agent: *<br />
Disallow: /<br />
</code></p></blockquote>
<h3 id="smart-robots-txt-rant">Search engines hide important information from webmasters</h3>
<p>Unfortunately, most search engines don&#8217;t provide enough information about their crawling. For example, last time I looked Google doesn&#8217;t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for &#8220;<span title="Googlebot-Mobile/2.1">Googlebot-</span>&#8221; and you&#8217;ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.</p>
<p>I&#8217;ve nudged Google and their response was that they don&#8217;t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn&#8217;t even announce properly that they&#8217;ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.</p>
<p>Since search engines don&#8217;t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents &#8220;on behalf of&#8221; particular user agents.</p>
<p>&nbsp;</p>
<p><strong>Enjoy.</strong></p>
<p>&nbsp;</p>
<div id="smart-robots-txt-changelog" style="margin-left:10px;">
<p><strong>Changelog:</strong></p>
<p>2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log  /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.</p>
</div>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/smart-robots-txt/", "style": "big", "title": "Get yourself a smart robots.txt" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/smart-robots-txt/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The anatomy of a deceptive Tweet spamming Google Real-Time Search</title>
		<link>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/</link>
		<comments>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/#comments</comments>
		<pubDate>Thu, 10 Dec 2009 10:12:44 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[Twitter]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/</guid>
		<description><![CDATA[
Minutes after the launch of Google&#8217;s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.
I&#8217;m sure Google&#8217;s WebSpam team will pull the plug sooner or later, but as of today Google&#8217;s real time search results are extremely vulnerable to questionable content.
The somewhat [...]]]></description>
			<content:encoded><![CDATA[
<p><img  src="http://sebastians-pamphlets.com/img/posts/spamming-google-real-time-search.png" width="250" height="345" align="right" style="margin-left:5px;" alt="Google real time search spammed and abused" title=""  />Minutes after the <a href="http://googleblog.blogspot.com/2009/12/relevance-meets-real-time-web.html?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">launch</a> of Google&#8217;s <a href="http://searchengineland.com/search-real-time-madness-31668">famous</a> Real Time Search, the Internet marketing community <a href="http://sphinn.com/story/135685">began</a> to <a href="http://outspokenmedia.com/seo/google-real-time-spam/">spam</a> the <a href="http://www.google.com/search?hl=en&#038;safe=off&#038;esrch=RTSearch&#038;tbo=1&#038;num=100&#038;q=spam&#038;tbs=rltm:1">scrolling SERPs</a>. Google gave birth to a <a href="http://www.seo-theory.com/2009/12/07/google-launches-a-new-spam-industry/">new spam industry</a>.</p>
<p>I&#8217;m sure Google&#8217;s <a href="http://friendfeed.com/dannysullivan/d973e438/real-time-spam-google-says-been-fighting-so-long">WebSpam</a> team will pull the plug sooner or later, but as of today Google&#8217;s real time search results are extremely vulnerable to questionable content.</p>
<p>The somewhat shady approach to make creative use of real time search I&#8217;m outlining below will not work forever. It can be used for really evil purposes,  and Google is aware of the problem. Frankly, if I&#8217;d be the Googler in charge, I&#8217;d dump the whole real-time thingy until the spam defense lines are rock solid.</p>
<p id="rtss-recipe"><strong>Here&#8217;s the recipe from Dr Evil&#8217;s WebSpam-Cook-Book:</strong></p>
<h3 id="rtss-ingredients">Ingredients</h3>
<ul>
<li>1 <a href="http://www.google.com/trends?q=spam+google">popular topic</a> that pulls lots of searches, but not so many that the results scroll down too fast.</li>
<li>1 <a href="http://www.google.com/products?q=spam+google&#038;hl=en&#038;aq=f">landing page</a> that makes the punter pull out the plastic in no time.</li>
<li>1 <a href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&#038;answer=93713">trusted authority page</a> totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.</li>
<li>1 <a href="http://goo.gl/">short</a> domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.</li>
</ul>
<h3 id="rtss-preparation">Preparation</h3>
<p>Create a new text file and name it <code>hot-topic.php</code> or so. Then code:<code><br />
&lt;?php<br />
$landingPageUri = "http://affiliate-program.com/?your-aff-id";<br />
$trustedPageUri = "http://google.com/something.py";<br />
if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) {<br />
   header("HTTP/1.1 307 Here you go today", TRUE, 307);<br />
   header("Location: $trustedPageUri");<br />
}<br />
else {<br />
   header("HTTP/1.1 301 Happy shopping", TRUE, 301);<br />
   header("Location: $landingPageUri");<br />
}<br />
exit;<br />
?&gt;</code></p>
<p>Provided you&#8217;re a savvy spammer, your crawler detection routine will be a little more <a href="http://fantomaster.com/fasvsspy01.html">complex</a>.</p>
<p>Save the file and upload it, then test the URI <code>http://youspamaw.ay/hot-topic.php</code> in your browser.</p>
<h3 id="rtss-serving">Serving</h3>
<ul>
<li>Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don&#8217;t swear, and sail around phrases like &#8216;buy cheap viagra&#8217; with synonyms like &#8216;brighten up your girl friend&#8217;s romantic moments&#8217;.</li>
<li>On their SERPs, Google will display the text from the trusted page&#8217;s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.</li>
<li>Just for entertainment, closely monitor Google&#8217;s real time SERPs, and your real-time sales stats as well.</li>
<li>Be happy and get rich by end of the week.</li>
</ul>
<p>Google removes links to untrusted destinations, that&#8217;s why you need to abuse authority pages. As long as you don&#8217;t launch f-bombs, Google&#8217;s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.</p>
<p>Hey <a href="http://twitter.com/GoogleWebspam">Google</a>, for the sake of our children, take that as a spam report!</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/", "style": "big", "title": "The anatomy of a deceptive Tweet spamming Google Real-Time Search" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How to handle a machine-readable pandemic that search engines cannot control</title>
		<link>http://sebastians-pamphlets.com/rip-rel-nofollow-funeral-party/</link>
		<comments>http://sebastians-pamphlets.com/rip-rel-nofollow-funeral-party/#comments</comments>
		<pubDate>Fri, 19 Jun 2009 20:59:51 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Blogging]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/rip-rel-nofollow-funeral-party/</guid>
		<description><![CDATA[
When you&#8217;re familiar with my various rants on the ever morphing rel-nofollow microformat infectious link disease, don&#8217;t read further. This post is not polemic, ironic, insulting, or otherwise meant to entertain you. I&#8217;m just raving about a way to delay the downfall of the InterWeb.

Lets recap: The World Wide Web is based on hyperlinks. Hyperlinks [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/rel-nofollow-rest-in-peace.png" width="200" height="248" align="right" style="margin-left:2px;" alt="R.I.P. rel-nofollow" title="Rest In Peace, rel-nofollow!" />When you&#8217;re familiar with my various rants on the ever morphing <strike>rel-nofollow microformat</strike> <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">infectious link disease</a>, don&#8217;t read further. This post is not polemic, ironic, insulting, or otherwise meant to entertain you. I&#8217;m just raving about a way to delay the downfall of the InterWeb.</p>
<div style="margin-left:5px; border-left:thin dotted red; pading-left:5px;">
<p style="margin-left:5px;"><b>Lets recap:</b> The World Wide Web is based on hyperlinks. Hyperlinks are supposed to lead humans to interesting stuff they want to consume. This simple and therefore brilliant concept worked great for years. The Internet grew up, bubbled a bit, but eventually it gained world domination. Internet traffic was counted, sold, bartered, purchased, and even exchanged for free in units called &#8220;hits&#8221;. (A &#8220;hit&#8221; means one human surfer landing on a sales pitch. That is a popup hell designed in a way that somebody involved just has to make a sale).</p>
<p style="margin-left:5px;">Then in the past century two smart guys discovered that links scraped from Web pages can be misused to provide humans with very accurate search results. They even created a new currency on the Web, and quickly assigned their price tags to Web pages. Naturally, folks began to trade green pixels instead of traffic. After a short while the Internet voluntarily transferred it&#8217;s world domination to the company founded by those two smart guys from Stanford.</p>
<p style="margin-left:5px;">Of course the huge amount of green pixel trades made the search results based on link popularity somewhat useless, because the webmasters gathering the most incoming links got the top 10 positions on the search result pages (SERPs). Search engines claimed that a few webmasters cheated on their way to the first SERPs, although lawyers say there&#8217;s no evidence of any illegal activities related to search engine optimization (SEO).</p>
<p style="margin-left:5px;">However, after suffering from heavy attacks from a whiny blogger, the Web&#8217;s dominating search engine got somewhat upset and required that all webmasters have to assign a machine-readable tag (link condom) to links sneakily inserted into their Web pages by other webmasters. &#8220;Sneakily inserted links&#8221; meant references to authors as well as links embedded in content supplied by users. All blogging platforms, CMS vendors and alike implemented the link condom, eliminating presumably 5.00% of the Web&#8217;s linkage at this time.</p>
<p style="margin-left:5px;">A couple of months later the world dominating search engine demanded that webmasters have to condomize their banner ads, intercompany linkage and other commercial links, as well as all hyperlinked references that do not count as pure academic citation (aka editorial links). The whole InterWeb complied, since this company controlled nearly all the free traffic available from Web search, as well as the Web&#8217;s purchasable traffic streams.</p>
<p style="margin-left:5px;">Roughly 3.00% of the Web&#8217;s links were condomized, as the search giant spotted that their users (searchers) missed out on lots and lots of valuable contents covered by link condoms. Ooops. Kinda dilemma. Taking back the link condom requirements was no option, because this would have flooded the search index with billions of unwanted links empowering commercial content to rank above boring academic stuff.</p>
<p style="margin-left:5px;">So the handling of link condoms in the search engine&#8217;s crawling engine as well as in it&#8217;s ranking algorithm was changed silently. Without telling anybody outside their campus, some condomized links gained power, whilst others were kept impotent. In fact they&#8217;ve developed a method to judge each and every link on the whole Web without a little help from their <strike>friends</strike> link condoms. In other words, the link condom became obsolete.</p>
<p style="margin-left:5px;">Of course that&#8217;s what they should have done in the first place, without asking the world&#8217;s webmasters for gazillions of free-of-charge man years producing shitloads of useless code bloat. Unfortunately, they didn&#8217;t have the balls to stand up and admit &#8220;sorry folks, we&#8217;ve failed miserably, link condoms are history&#8221;. Therefore the Web community still has to bother with an obsolete microformat. And if they &#8211;the link comdoms&#8211; are not dead, then they live today. In your markup. Hurting your rankings.</p>
</div>
<p ytele="margin-left:30px;"><small>If you, dear reader, are a Googler, then please don&#8217;t feel too annoyed. You may have thought that you didn&#8217;t do evil, but the above said reflects what webmasters outside the &#8216;Plex got from your actions. Don&#8217;t ignore it, please think about it from our point of view. Thanks.</small></p>
<p>Still here and attentive? Great. Now lets talk about scenarios in WebDev where you still can&#8217;t avoid rel-nofollow. If there are any &#8212; We&#8217;ll see.</p>
<h3>PageRank&trade; sculpting</h3>
<p>Dude, PageRank&trade; sculpting with rel-nofollow doesn&#8217;t work for the average webmaster. It might even fail when applied as high sophisticated SEO tactic. So don&#8217;t even think about it. Simply remove the <code>rel=nofollow</code> from links to your TOS, imprint, and contact page. Cloak away your links to signup pages, login pages, shopping carts and stuff like that.</p>
<h3>Link monkey business</h3>
<p>I leave this paragraph empty, because when you know what you do, you don&#8217;t need advice.</p>
<h3>Affiliate links</h3>
<p>There&#8217;s no point in serving <a href="http://www.smart-it-consulting.com/article.htm?node=155&#038;page=90">A elements</a> to Googlebot at all. If you haven&#8217;t cloaked your aff links yet, go see a SEO doctor.</p>
<h3>Advanced SEO purposes</h3>
<p>See above.</p>
<p><b>So what&#8217;s left?</b> User generated content. Lets concentrate our extremely superfluous condomizing efforts on the one and only occasion that might allow to apply rel-nofollow to a hyperlink on request of a major search engine, if there&#8217;s any good reason to paint shit brown at all.</p>
<h3>Blogging</h3>
<p>If you link out in a blog post, then you vouch for the link&#8217;s destination. In case you disagree with the link destination&#8217;s content, just put the link as</p>
<p><strong id="enemylink" title="http://example.com/"><code>&lt;strong class="blue_underlined" title="http://myworstenemy.org/" onclick="window.location=this.title;"&gt;<span onclick="window.location=document.getElementById('enemylink').title; return false;" style="color:blue; text-decoration:underlined;">My Worst Enemy</span>&lt;/strong&gt;</code></strong></p>
<p>or so. The surfer can click the link and lands at the estimated URI, but search engines don&#8217;t pass reputation. Also, they don&#8217;t evaporate link juice, because they don&#8217;t interpret the markup as hyperlink.</p>
<h3>Blog comments</h3>
<p>My rule of thumb is: <strong>Moderate, DoFollow quality, DoDelete crap</strong>. Install a conditional do-follow plug-in, set everything on moderation, use captchas or something similar, then let the comment&#8217;s link juice flow. You can maintain a white list that allows instant appearance of comments from your buddies.</p>
<h3>Forums, guestbooks and unmoderated stuff like that</h3>
<p>Separate all Web site areas that handle user generated content. Serve &#8220;index,nofollow&#8221; meta tags or x-robots-headers for all those pages, and link them from a site map or so. If you gather index-worthy content from users, then feed crawlers the content in a parallel &#8211;crawlable&#8211; structure, without submit buttons, perhaps with links from trusted users, and redirect human visitors to the interactive pages. Vice versa redirect crawlers requesting live pages to the spider fodder. All those redirects go with a 301 HTTP response code.</p>
<p>If you lack the technical skills to accomplish that, then edit your <code>/robots.txt</code> file as follows:</p>
<p><code>User-agent: Googlebot<br />
# Dear Googlebot, drop me a line when you can handle forum pages<br />
# w/o rel-nofollow crap. Then I'll allow crawling.<br />
# Treat that as conditional disallow:<br />
Disallow: /forum</code></p>
<p>As soon as Google can handle your user generated content naturally, they might send you a message in their Webmaster console.</p>
<h3>Anything else</h3>
<p>Judge yourself. Most probably you&#8217;ll find a way to avoid rel-nofollow.</p>
<h3>Conclusion</h3>
<p><strong>Absolutely nobody needs the rel-nofollow microformat. Not even search engines for the sake of their index.</strong> Hence webmasters as well as search engines can stop wasting resources. Farewell <code>rel="nofollow"</code>, rest in peace. We won&#8217;t miss you.</b></p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/rip-rel-nofollow-funeral-party/", "style": "big", "title": "How to handle a machine-readable pandemic that search engines cannot control" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/rip-rel-nofollow-funeral-party/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Vaporize yourself before Google burns your linking power</title>
		<link>http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/</link>
		<comments>http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 19:14:12 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Anchor Text]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/</guid>
		<description><![CDATA[
I couldn&#8217;t care less about PageRank&#8482; sculpting, because a well thought out link architecture does the job with all search engines, not just Google. That&#8217;s where Google is right on the money.
They own PageRank&#8482;, hence they can burn, evaporate, nillify, and even divide by zero or multiply by -1 as much PageRank&#8482; as they like; [...]]]></description>
			<content:encoded><![CDATA[
<p><img id="pic1" src="http://sebastians-pamphlets.com/img/posts/google-page-rank-factory-2007.png" width="218" height="367" border="0" align="right" alt="PIC-1: Google PageRank(tm) 2007" title="Google PageRank(tm) 2007" />I couldn&#8217;t care less about PageRank&trade; sculpting, because a well thought out link architecture does the job with all search engines, not just Google. That&#8217;s where Google is right on the money.</p>
<p>They own PageRank&trade;, hence they can burn, evaporate, nillify, and even divide by zero or multiply by -1 as much PageRank&trade; as they like; of course as long as they rank my stuff nicely above my competitors.</p>
<p><a href="http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/#pic1">Picture 1</a> shows Google&#8217;s PageRank&trade; factory as of 2007 or so. Actually, it&#8217;s a pretty simplified model, but since they&#8217;ve changed the PageRank&trade; algo anyway, you don&#8217;t need to bother with all the geeky details.</p>
<p>As a side note: you might ask why I don&#8217;t link to <a id="dullestlink" onmouseout="alert('Ok, you couldn`t resist ... and here you go. That is, if you`re able to click the darn link to Matt`s blog ...'); return true;" onclick="alert('Matt didn`t pay me to link out, so go search his blog on the Interweb'); window.location=document.getElementById('dullestlink').rev; return true;" rev="http://www.dullest.com/blog/pagerank-sculpting/" style="color:#0063dc;font-weight:400;text-decoration:none;border-bottom:1px solid #ccc;">Matt Cutts</a> and <a href="http://searchengineland.com/pagerank-sculpting-is-dead-long-live-pagerank-sculpting-21102" id="searchenginelandlink" onmouseout="alert('Why do you think that`s a link? Never trust underlined blue text anymore! Guess where you`ll land, search cowboy ...'); return true;">Danny Sullivan</a> discussing the whole mess on their blogs? Well, probably Matt can&#8217;t afford my advertising rates, and the whole SEO industry has linked to Danny anyway. If you&#8217;re nosy, check out my source code to learn more about state of the art linkage very compliant to <span onclick="alert('Gotcha!  That`s not a link, it`s a fucking fake as per Google`s request. I can`t link out to Google`s guidelines any more, coz they steal my link juice.'); return true;" style="color:#0063dc;font-weight:400;text-decoration:none;border-bottom:1px solid #ccc;" title="High quality nonsense">Google&#8217;s newest guidelines for advanced SEOs</span> (summary: &#8220;Don&#8217;t trust underlined blue text on Web pages any longer!&#8221;).</p>
<p><img id="pic2" src="http://sebastians-pamphlets.com/img/posts/google-page-rank-factory-2009.png" width="218" height="429" border="0" align="right" alt="PIC-2: Google PageRank(tm) 2009" title="Google PageRank(tm) 2009" />What really matters is <a href="http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/#pic2">picture 2</a>, revealing Google&#8217;s new PageRank&trade; facilities, silently launched in 2008. Again, geeky details are of minor interest. If you really want to know everything, then search for  [<a href="http://www.seofaststart.com/blog/googles-operation-bendover-exposed-nofollow-pagerank-sculpting" rel="dofollow highly-recommended" style="text-decoration:none;"><code style="font-weight:bolder;font-size:1em;">operation bendover</code></a>] at !Yahoo (it&#8217;s still top secret, and therefore not searchable at Google).</p>
<p>Unfortunately, advanced SEO folks <small>(whatever that means, I use this term just because it seems to be an essential property assigned to the participants of the current PageRank&trade; <strike>uprising</strike> discussion)</small> always try to confuse you with <a href="http://www.seomoz.org/blog/google-says-yes-you-can-still-sculpt-pagerank-no-you-cant-do-it-with-nofollow">overcomplicated graphics and formulas</a> when it comes to PageRank&trade;. Instead, I ask you to focus on the (important) hard core stuff. So go grab a magnifier, and work out the differences:</p>
<ul>
<li><a href="http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/#pic2">PageRank&trade; 2009</a> in comparision to <a href="http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/#pic1">PageRank&trade; 2007</a> comes with a pipeline supplying unlimited fuel. Also, it seems they&#8217;ve implemented the green new deal, switching from gas to natural gas. That means they can vaporize way more link juice than ever before.</li>
<li>PageRank&trade; 2009 produces more steam, and the clouds look slightly different. Whilst PageRank&trade; 2007 ignored <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">nofollow crap</a> as well as links put with client sided scripting, PageRank&trade; 2009 evaporates not only juice covered with <a href="http://link-condom.com/">link condoms</a>, but also tons of other permutations of the <a href="http://www.smart-it-consulting.com/article.htm?node=155&#038;page=90">standard A element</a>. </li>
<li>To compensate the huge overall loss of PageRank&trade; caused by those changes, Google has decided to pass link juice from condomized links to their target URI hidden to Googlebot with JavaScript. Of course Google formerly has recommended the use of JavaScript-links to prevent the webmasters from penalties for so-called &#8220;questionable&#8221; outgoing links. Just as they&#8217;ve not only invented rel-nofollow, but heavily recommended the use of this microformat with all links disliked by Google, and now they take that back as if a gazillion links on the Web could magically change just because Google tweeks their algos. Doh! I really hope that the WebSpam-team checks the age of such links before they penalize everything implemented according to their guidelines before mid-2009 or the InterWeb&#8217;s downfall, whatever comes last. </li>
</ul>
<p>I guess in the meantime you&#8217;ve figured out that I&#8217;m somewhat pissed. Not that the secretly changed flow of PageRank&trade; a year ago in 2008 had any impact on my rankings, or SERP traffic. I&#8217;ve always designed my stuff with PageRank&trade; flow in mind, but without any misuses of rel=&#8221;nofollow&#8221;, so I&#8217;m still fine with Google.</p>
<p> What I can&#8217;t stand is when a search engine tries to tell me how I&#8217;ve to link (out). Google engineers are really smart folks, they&#8217;re perfectly able to develop a PageRank&trade; algo that can decide how much Google-juice a particular link should pass. So dear Googlers, please &#8211;WRT to the implementation of hyperlinks&#8211; leave us webmasters alone, dump the rel-nofollow crap and rank our stuff in the best interest of your searchers. No longer bother us with linking guidelines that change yearly. It&#8217;s not our job nor responsibility to act as your <strike>cannon fodder</strike> slavish code monkeys when you spot a loophole in your ranking- or spam-detection-algos.</p>
<p>Of course the above said is based on common sense, so Google won&#8217;t listen (remember: I&#8217;m really upset, hence polemic statements are absolutely appropriate). To prevent webmasters from irrational actions by misleaded search engines, I hereby introduce the</p>
<h3>Webmaster guidelines for search engine friendly links</h3>
<p>What follows is pseudo-code, implement it with your preferred server sided scripting language.</p>
<p><code>if (getAttribute($link, 'rel') matches '*nofollow*' &#038;&#038;<br />
&nbsp;&nbsp;&nbsp;&nbsp;$userAgent matches '*Googlebot*') {<br />
&nbsp;&nbsp;&nbsp;&nbsp;print '&lt;strong rev="' + getAttribute(link, 'href') + '"'<br />
&nbsp;&nbsp;&nbsp;&nbsp;+ ' style="color:blue; text-decoration:underlined;"'<br />
&nbsp;&nbsp;&nbsp;&nbsp;+ ' onmousedown="window.location=document.getElementById(this.id).rev; "'<br />
&nbsp;&nbsp;&nbsp;&nbsp;+ '&gt;' + getAnchorText($link) + '&lt;/strong&gt;';<br />
}<br />
else {<br />
&nbsp;&nbsp;&nbsp;&nbsp;print $link;<br />
}</code></p>
<p>Probably it&#8217;s a good idea to snip both the onmousedown trigger code as well as the rev attribute, when the script gets executed by Googlebot. Just because today Google states that they&#8217;re going to pass link juice to URIs grabbed from the onclick trigger, that doesn&#8217;t mean they&#8217;ll never look at the onmousedown event or misused (X)HTML attributes.</p>
<p>This way you can deliver Googlebot exactly the same stuff that the <strike>punter</strike> surfer gets. You&#8217;re perfectly compliant to Google&#8217;s cloaking restrictions. There&#8217;s no need to bother with complicated stuff like iFrames or even disabled blog comments, forums or guestbooks.</p>
<p>Just feed the crawlers with all the crap the search engines require, then concentrate all your efforts on your UI for human vistors. Web robots (bots, crawlers, spiders, &#8230;) don&#8217;t supply your signup-forms w/ credit card details. Humans do. If you find the time to upsell them while search engines keep you busy with thoughtless change requests all day long.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/", "style": "big", "title": "Vaporize yourself before Google burns your linking power" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/dear-google-please-vaporize-yourself-and-dont-bother-us-webmasters/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Dump your self-banning CMS</title>
		<link>http://sebastians-pamphlets.com/dump-your-search-engine-unfriendly-cms/</link>
		<comments>http://sebastians-pamphlets.com/dump-your-search-engine-unfriendly-cms/#comments</comments>
		<pubDate>Mon, 11 May 2009 19:11:02 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Blogging]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/dump-your-search-engine-unfriendly-cms/</guid>
		<description><![CDATA[
When it comes to cluelessness [silliness, idiocy, stupidity &#8230; you name it], you can&#8217;t beat CMS developers. You really can&#8217;t. There&#8217;s literally no way to kill search engine traffic that the average content management system (CMS) developer doesn&#8217;t implement. Poor publishers, probably you suffer from the top 10 issues on my shitlist. Sigh.
Imagine you&#8217;re the [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/dogshit-200x290.gif" width="200" height="190" align="right" style="margin-left:4px;" alt="CMS developer's output: unusable dogshit" title="Tag your CMS 'DOGSHIT'" />When it comes to <a href="http://www.thinkgeek.com/homeoffice/posters/370b/">cluelessness</a> [silliness, idiocy, stupidity &#8230; you name it], you can&#8217;t beat CMS developers. You really can&#8217;t. There&#8217;s literally no way to kill search engine traffic that the average content management system (CMS) developer doesn&#8217;t implement. Poor publishers, probably you suffer from the top 10 issues on my shitlist. Sigh.</p>
<p>Imagine you&#8217;re the proud owner of a Web site that enables logged-in users customizing the look &amp; feel and whatnot. Here&#8217;s how your CMS does the trick:</p>
<h3>Unusable user interface</h3>
<p>The user control panel offers a gazillion of settings that can overwrite each and every CSS property out there. To keep the user-cp pages lean and fast loading, the properties are spread over 550 pages with 10 attributes each, all with very comfortable Previous|Next-Page navigation. Even when the user has choosen a predefined template, the CMS saves each property in the <code>user</code> table. Of course that&#8217;s necessary because the site admin could change a template in use.</p>
<h3>Amateurish database design </h3>
<p>Not only for this purpose each <code>user</code> tuple comes with 512 mandatory attributes. Unfortunately, the underlaying database doesn&#8217;t handle tables with more than 512 columns, so the overflow gets stored in an array, using the large text column #512.</p>
<h3>Cookie hell</h3>
<p>Since every database access is expensive, the login procedure creates a persistent cookie (today + 365 * 30) for each user property. Dynamic and user specific external CSS files as well as style-sheets served in the HEAD section could fail to apply, so all CMS scripts use a routine that converts the user settings into inline style directives like <code>style="color:red; text-align:bolder; text-decoration:none; ..."</code>. The developer consults the <a href="http://www.w3.org/TR/CSS2/propidx.html">W3C CSS guidelines</a> to make sure that not a single CSS property is left out.</p>
<h3>Excessive query strings</h3>
<p>Actually, not all user agents handle cookies properly. Especially cached pages clicked from SERPs load with a rather weird design. The same goes for standard compliant browsers.  Seems to depend on the user agent string, so the developer adds a <code>if ($well_behaving_user_agent_string <> $HTTP_USER_AGENT) then [read the user record and add each property as GET variable to the URI&#8217;s querystring])</code> sanity check. Of course the <code>$well_behaving_user_agent_string</code> variable gets populated with a constant containing the developer&#8217;s ancient IE user agent, and the GET inputs overwrite the values gathered from cookies.</p>
<h3>Even more sanitizing</h3>
<p>Some unhappy campers still claim that the CMS ignores some user properties, so the developer adds a routine that reads the <code>user</code> table and populates all variables that previously were filled from GET inputs overwriting cookie inputs. All clients are happy now.</p>
<h3>Covering robots</h3>
<p>&#8220;Cached copy&#8221; links from SERPs still produce weird pages. The developer stumbles upon my blog and adds <a href="http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/#grab-php-code-check-crawler">crawler detection</a>. S/he creates a tuple for each known search engine crawler in the <code>user</code> table of her/his local database and codes <code>if ($isSpider) then [select * from user where user.usrName = $spiderName, populating the current script's CSS property variables from the requesting crawler's user settings]</code>. Testing the rendering with a <a href="http://prefbar.mozdev.org/">user agent faker</a> gives fine results: bug fixed. To make sure that all user agents get a nice page, the developer sets the output default to &#8220;printer&#8221;, which produces a printable page ignoring all user settings that assign <code>style="display:none;"</code> to superfluous HTML elements.</p>
<h3>Results</h3>
<p>Users are happy, they don&#8217;t spot the code bloat. But search engine crawlers do. They sneakily request a few pages as a crawler, <em>and</em> as a browser. Comparing the results they find the &#8220;poor&#8221; pages delivered to the feigned browser way too different from the &#8220;rich&#8221; pages serving as crawler fodder. The domain gets banned for poor-man&#8217;s-cloaking (as if cloaking in general could be a bad thing, but that&#8217;s a completely different story). The publisher spots decreasing search engine traffic and wonders why. No help avail from the CMS vendor. Must be unintentionally deceptive SEO copywritig or so. <strong>Crap.</strong> That&#8217;s self-banning by software design.</p>
<p><small>Ok, before you read on: get a <a href="http://www.last.fm/music/Free/_/All+Right+Now">calming tune</a>.</small></p>
<h3>How can I detect a shitty CMS?</h3>
<p>Well, you can&#8217;t, at least not as a non-geeky publisher. Not really. Of course you can check the &#8220;cached copy&#8221; links from your SERPs all night long. If they show way too different results compared to your browser&#8217;s rendering you&#8217;re at risk. You can look at your browser&#8217;s address bar to check your URIs for query strings with overlength, and if you can&#8217;t find the end of the URI perhaps you&#8217;re toast, search engine wise. You can download <a href="https://addons.mozilla.org/en-US/firefox/addon/60">tools</a> to check a page&#8217;s cookies, then if there are more than 50 you&#8217;re potentially search-engine-dead. Probably you can&#8217;t do a code review yourself coz you can&#8217;t read source code natively, and your CMS vendor has delivered spaghetti code. Also, as a publisher, you can&#8217;t tell whether your crappy rankings depend on shitty code or on your skills as as a copywriter. When you ask your CMS vendor, usually the search engine algo is faulty (especially Google, Yahoo, Microsoft and Ask) but some exotic search engine from Togo or so sets the standards for state of the art search engine technology. </p>
<p>Last but not least, as a non-search-geek challenged by Web development techniques you won&#8217;t recognize most of the laughable &#8211;but very common&#8211; mistakes outlined above. Actually, most savvy developers will not be able to create a complete shitlist from my scenario. Also, there a tons of other common CMS issues that do resolve in different crawlability issues - each as bad as this one, or even worse.</p>
<p>Now <strong>what <em>can</em> you do</strong>? Well, my best advice is: don&#8217;t click on Google ads titled &#8220;CMS&#8221;, and don&#8217;t look at prices. The cheapest CMS will cost you the most at the end of the day. And if your budget exceeds a grand or two, then please hire an experienced search engine optimizer (SEO) or search savvy Web developer before you implement a CMS.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/dump-your-search-engine-unfriendly-cms/", "style": "big", "title": "Dump your self-banning CMS" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/dump-your-search-engine-unfriendly-cms/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Nofollow still means don&#8217;t follow, and how to instruct Google to crawl nofollow&#8217;ed links nevertheless</title>
		<link>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/</link>
		<comments>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/#comments</comments>
		<pubDate>Sat, 23 Feb 2008 14:51:14 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Testing]]></category>

		<category><![CDATA[Anchor Text]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/</guid>
		<description><![CDATA[
What was meant as a quick test of rel-nofollow once again (inspired by Michelle&#8217;s post stating that nofollow&#8217;ed comment author links result in rankings), turned out to some interesting observations:

Google uses sneaky JavaScript links (that mask nofollow&#8217;ed static links) for discovery crawling, and indexes the link destinations despite there&#8217;s no hard coded link on any [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/painting-nofollow-dofollow.png" width="250" height="220" align="right" alt="painting a nofollow'ed link dofollow" style="margin-left:4px;" title="How to paint a nofollow'ed link dofollow" />What was meant as a quick test of <a href="http://sebastians-pamphlets.com/links/categories/&amp;cat=nofollow">rel-nofollow</a> once again (inspired by <a href="http://www.michellemacphearson.com/do-nofollow-links-count-redux/">Michelle&#8217;s post</a> stating that nofollow&#8217;ed comment author links result in rankings), turned out to some interesting observations:</p>
<ul>
<li>Google uses sneaky JavaScript links (that mask nofollow&#8217;ed static links) for discovery crawling, and indexes the link destinations despite there&#8217;s no hard coded link on any page on the whole Web.</li>
<li>Google doesn&#8217;t crawl URIs found in nofollow&#8217;ed links only.</li>
<li>Google most probably doesn&#8217;t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.</li>
<li>Google most probably doesn&#8217;t pass anchor text of JavaScript links to the link destination.</li>
<li>Google doesn&#8217;t pass anchor text of (hard coded) nofollow&#8217;ed links to the link destination.</li>
</ul>
<p>As for my inspiration, I guess not all links in Michelle&#8217;s test were truly nofollow&#8217;ed. However, she&#8217;s spot on stating that condomized author links aren&#8217;t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don&#8217;t pay too much attention on REL attributes when you spread your links.</p>
<p>As for my quick test explained below, please consider it an inspiration too. It&#8217;s not a full blown SEO test, because I&#8217;ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn&#8217;t influenced by external noise, for example scraped links and such stuff.</p>
<p>On 2008-02-22 06:20:00 I&#8217;ve put a new nofollow&#8217;ed link onto my sidebar: <a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-2-a" rel="nofollow"><span id="repstuff-something-2-b">Zilchish Crap</span></a> <script type="text/javascript"> handle=document.getElementById("repstuff-something-2-b"); handle.firstChild.data="Nillified, Nil"; handle=document.getElementById("repstuff-something-2-a"); handle.href="http://sebastians-pamphlets.com/repstuff/something.php?nil=js1"; handle.rel="dofollow"; </script><code><small><br />
&lt;a href=&quot;http://sebastians-pamphlets.com/repstuff/something.php&quot; id=&quot;repstuff-something-a&quot; rel=&quot;nofollow&quot;&gt;&lt;span id=&quot;repstuff-something-b&quot;&gt;Zilchish Crap&lt;/span&gt;&lt;/a&gt;<br />
&lt;script type=&quot;text/javascript&quot;&gt;<br />
handle=document.getElementById(&lsquo;repstuff-something-b&rsquo;);<br />
handle.firstChild.data=&lsquo;Nillified, Nil&rsquo;;<br />
handle=document.getElementById(&lsquo;repstuff-something-a&rsquo;);<br />
handle.href=&lsquo;http://sebastians-pamphlets.com/repstuff/something.php?nil=js1&rsquo;;<br />
handle.rel=&lsquo;dofollow&rsquo;;<br />
&lt;/script&gt; </small></code><br />
(The JavaScript code changes the link&#8217;s HREF, REL and anchor text.)</p>
<p>The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow&#8217;ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.</p>
<h3>Google crawls URIs extracted from somewhat sneaky JavaScript code</h3>
<p>20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element&#8217;s HREF: <code><br />
66.249.72.5 	2008-02-22 06:47:07 	200-OK 	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 	/repstuff/something.php?nil=js1</code></p>
<p>Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page: <code><small><br />
handle=document.getElementById(&lsquo;a1&rsquo;);<br />
handle.href=&lsquo;http://sebastians-pamphlets.com/repstuff/something.php?nil=js2&rsquo;;<br />
handle.rel=&lsquo;dofollow&rsquo;; </small></code><br />
From the log: <code><br />
66.249.72.5 	2008-02-22 09:37:11 	200-OK 	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 	/repstuff/something.php?nil=js2</code></p>
<p>So far Google ignored the hidden JavaScript link to <code>/repstuff/something.php?nil=js3</code> on the test page. Its code doesn&#8217;t change a static link, so that makes sense in the context of repeated statements like &#8220;Google ignores JavaScript links / treats them like nofollow&#8217;ed links&#8221; by Google reps.</p>
<p class="excursus">Of course the JS code above is easy to analyze, but don&#8217;t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!</p>
<h3>Google indexes pages that have only JavaScript links pointing to them</h3>
<p>The next day I&#8217;ve checked the search index, and the <a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=zilchish%7Cnillyfiable+site%3Asebastians-pamphlets.com">results</a> are interesting:</p>
<p><img src="http://sebastians-pamphlets.com/img/google/nofollow-zilchish-nullifable-google-serp-24h.png" width="498" height="421" alt="rel-nofollow-test search results" title="Google indexes JS manipulated anchor text and content referenced only in JS links" /></p>
<p>The first search result is the content of the URI with the query string parameter <code>?nil=js1</code>, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI <code>/repstuff/something.php</code> without query string. There&#8217;s not a single real link to this URI elsewhere.</p>
<p>The second search result is a post URI where Google recognized the hard coded anchor text &#8220;zilchish crap&#8221;, but not the JS code that overwrites it with &#8220;Nillified, Nil&#8221;. With the SERP-URI parameter &#8220;&amp;filter=0&#8243; Google shows more posts that are findable with the search term [zilchish]. (Hey <a href="http://mattcutts.com/blog/">Matt</a> and <a href="http://brianwhite.org/">Brian</a>, here&#8217;s room for improvement!)</p>
<h3>Google doesn&#8217;t pass anchor text of nofollow&#8217;ed links to the link destination</h3>
<p>A search for [<a href="http://www.google.com/search?q=zilchish+site:sebastians-pamphlets.com&#038;num=100&#038;hl=en&#038;filter=0&#038;safe=off">zilchish site:sebastians-pamphlets.com</a>] doesn&#8217;t show the testpage that doesn&#8217;t carry this term. In other words, so far the anchor text &#8220;zilchish crap&#8221; of the nofollow&#8217;ed sidebar link didn&#8217;t impact the test page&#8217;s rankings yet. </p>
<h3>Google doesn&#8217;t treat anchor text of JavaScript links as textual content</h3>
<p>A search for [<a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=nillified+site%3Asebastians-pamphlets.com">nillified site:sebastians-pamphlets.com</a>] doesn&#8217;t show any URIs that have &#8220;nil, nillified&#8221; as client sided anchor text on the sidebar, just the test page:</p>
<p><img src="http://sebastians-pamphlets.com/img/google/nofollow-nillified-google-serp-24h.png" width="498" height="277" alt="rel-nofollow-test search results" title="Google indexes content from JS manipulated URIs" /></p>
<h3>Results, conclusions, speculation</h3>
<p>This test wasn&#8217;t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately &#8220;nil&#8221; and &#8220;nillified&#8221; appear both in the JS anchor text as well as on the page, so that&#8217;s for another post. However, it seems the JS anchor text isn&#8217;t indexed for the pages carrying the JS code, at least they don&#8217;t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination&#8217;s relevancy for &#8220;nil&#8221; or &#8220;nillified&#8221; as well. </p>
<p>Maybe Google&#8217;s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows. </p>
<p>However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I&#8217;m quite confident that condomized links still don&#8217;t pass reputation, nor topically relevance. My test page is unfindable for the nofollow&#8217;ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.</p>
<h3>How to safely strip a <a href="http://link-condom.com/">link condom</a></h3>
<p><b>And what&#8217;s the actual &#8220;news&#8221;?</b> Well, say you&#8217;ve links that you must condomize because they&#8217;re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow&#8217;ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don&#8217;t count, so that&#8217;s perfectly in line with Google&#8217;s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don&#8217;t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn&#8217;t that neat?</p>
<h3>Link condoms <strike>with juicy taste</strike> faking good karma</h3>
<p>Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they <i style="border: medium dotted firebrick; color:navy; background:pink;">look plain ugly</i> otherwise.</p>
<p>Here is how you can do this for a complete Web page. <a href="http://example.com/" rel="nofollow example" title="Nofollow'ed and unclickable link example, use 'view source' to check it out" onclick="return false;">This link is nofollow&#8217;ed</a>. The JavaScript code below changed its REL value to &#8220;dofollow&#8221;. When you put this code <em>at the bottom of your pages</em>, it will un-condomize all your nofollow&#8217;ed links. <code><br />
&lt;script type=&quot;text/javascript&quot;&gt;<br />
    if (document.getElementsByTagName) {<br />
        var aElements = document.getElementsByTagName(&quot;a&quot;);<br />
        for (var i=0; i&lt;aElements.length; i++) {<br />
            var relvalue = aElements[i].rel.toUpperCase();<br />
            if (relvalue.match(&quot;NOFOLLOW&quot;) != &quot;null&quot;) {<br />
                aElements[i].rel = &quot;dofollow&quot;;<br />
            }<br />
        }<br />
    }<br />
&lt;/script&gt;   </code></p>
<p><script type="text/javascript">
    if (document.getElementsByTagName) {
        var aelements = document.getElementsByTagName("a");
        for (var i=0; i<aelements.length; i++) {
            var relvalue = aelements[i].rel.toUpperCase();
            if (relvalue.match("NOFOLLOW") != "null") {
                aelements[i].rel = "dofollow";
            }
        }
    }
</script></p>
<p>(You&#8217;ll find still condomized links on this page. That&#8217;s because the JavaScript routine above changes only links placed above it.)</p>
<p>When you add JavaScript routines like that to your pages, you&#8217;ll increase their page loading time. IOW you slow them down. Also, you should add a note to your <a href="http://sebastians-pamphlets.com/links/full-disclosure/">linking policy</a> to avoid confused advertisers who chase toolbar PageRank.</p>
<p><b>Updates:</b> Obviously Google distrusts me, how come? Four days after the link discovery the <abbr title="Googlebot coming from another IP">search quality archangel</abbr> requested the nofollow&#8217;ed URI &#8211;without query string&#8211; possibly to check whether I serve different stuff to bots and people. As if I&#8217;d cloak, laughable. (Or an assclown linked the URI without condom.)<br />
Day five: Google&#8217;s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren&#8217;t interested in client-sided links at all?</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/", "style": "big", "title": "Nofollow still means don't follow, and how to instruct Google to crawl nofollow'ed links nevertheless" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Update your crawler detection: MSN/Live Search announces msnbot/1.1</title>
		<link>http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/</link>
		<comments>http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/#comments</comments>
		<pubDate>Tue, 12 Feb 2008 18:41:28 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Analytics]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/</guid>
		<description><![CDATA[
Fabrice Canel from Live Search announces significant improvements of their crawler today. The very much appreciated changes are:

HTTP compression

The revised msnbot supports gzip and deflate as defined by RFC 2616 (sections 14.11 and 14.39). Microsoft also provides a tool to check your server&#8217;s compression / conditional GET support. (Bear in mind that most dynamic pages [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/msnbot-1-1.png" width="250" height="180" align="right" alt="msnbot/1.1" style="margin-left:4px;" title="MSNBOT/1.1" />Fabrice Canel from <a href="http://blogs.msdn.com/webmaster/archive/2008/02/12/announcing-crawling-improvements-for-live-search.aspx">Live Search announces significant improvements of their crawler</a> today. The very much appreciated changes are:</p>
<dl>
<dt>HTTP compression</dt>
<dd>
<p>The revised msnbot supports <b>gzip</b> and <b>deflate</b> as defined by <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html">RFC 2616</a> (sections 14.11 and 14.39). Microsoft also provides a <a href="http://go.microsoft.com/?linkid=8272590">tool to check your server&#8217;s compression / conditional GET support</a>. (Bear in mind that most dynamic pages (blogs, forums, &#8230;) will fool such <a href="http://www.microsoft.com/search/Tools/">tools</a>, try it with a static page or use your robots.txt.)</p>
</dd>
<dt>No more crawling of unchanged contents</dt>
<dd>
<p>The new msnbot/1.1 will not fetch pages that didn&#8217;t change since the last request, as long as the Web server supports the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25">&#8220;If-Modified-Since&#8221; header</a> in conditional GET requests. If a page didn&#8217;t change since the last crawl, the server responds with 304 and the crawler moves on. In this case your Web server exchanges only a handful of short lines of text with the crawler, not the contents of the requested resource.</p>
<p>If your server isn&#8217;t configured for HTTP compression and conditional GETs, you really should request that at your hosting service for the sake of your bandwidth bills.</p>
</dd>
<dt>New user agent name</dt>
<dd>
<p>From reading server log files we know the Live Search bot as &#8220;msnbot/1.0 (+http://search.msn.com/msnbot.htm)&#8221;, or &#8220;msnbot-media/1.0&#8243;, &#8220;msnbot-products/1.0&#8243;, and &#8220;msnbot-news/1.0&#8243;. From now on you&#8217;ll see &#8220;<b>msnbot/1.1</b>&#8220;. Nathan Buggia from Live Search clarifies: &#8220;<b>This update does not apply to all the other &#8216;msnbot-*&#8217; crawlers, just the main msnbot</b>. We will be updating those bots in the future&#8221;.</p>
<p>If you just check the user agent string for &#8220;msnbot&#8221; you&#8217;ve nothing to change, otherwise you should check the user agent string for both &#8220;msnbot/1.0&#8243; as well as &#8220;msnbot/1.1&#8243; before you do the reverse DNS lookup to identify bogus bots. MSN will not change the host name &#8220;.search.live.com&#8221; used by the crawling engine.</p>
<p>The announcement didn&#8217;t tell us whether the new bot will utilize HTTP/1.1 or not (MS and Yahoo crawlers, like other Web robots, still perform, respectively fake, HTTP/1.0 requests).</p>
</dd>
</dl>
<p>It looks like it&#8217;s no longer necessary to <a href="http://searchengineland.com/080207-174632.php">charge Live Search for bandwidth their crawler has burned</a>. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  Jokes aside, instead of reporting crawler issues to msnbot@microsoft.com, you can post your questions or concerns at a forum dedicated to <a href="http://forums.microsoft.com/webmaster/ShowForum.aspx?ForumID=1984&#038;SiteID=79">MSN crawler feedback and discussions</a>.</p>
<p>I&#8217;m quite nosy, so I just had to investigate what &#8220;there are many more improvements&#8221; in the blog post meant. I&#8217;ve asked <a href="http://nathanbuggia.com/">Nathan Buggia</a> from Microsoft a few questions. </p>
<p class="question">Nate, thanks for the opportunity to <em>talk crawling</em>&nbsp; with you. Can you please reveal a few msnbot/1.1 secrets? <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p class="answer">I&#8217;m glad you&#8217;re interested in our update, but we&#8217;re not yet ready to provide more details about additional improvements. However, there are several more that we&#8217;ll be shipping in the next couple months.</p>
<p class="question">Fair enough. So lets talk about related topics.</p>
<p class="question">Currently I can set crawler directives for file types identified by their extensions in my robots.txt&#8217;s msnbot section. Will you fully support wildcards (* and $ for all URI components, that is path and query string) in robots.txt in the foreseeable future?</p>
<p class="answer">This is one of several additional improvements that we are looking at today, however it has not been released in the current version of MSNBot. In this update we were squarely focused on reducing the burden of MSNBot on your site.</p>
<p class="question">What can or should a Webmaster do when you seem to crawl a site way too fast, or not fast enough? Do you plan to provide a tool to reduce the server load, respectively speed up your crawling for particular sites?</p>
<p class="answer">We currently support the &#8220;<a href="http://search.live.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIndexing.htm&#038;FORM=WFDD#D">crawl-delay</a>&#8221; option in the robots.txt file for webmasters that would like to slow down our crawling. We do not currently support an option to increase crawling frequency, but that is also a feature we are considering.</p>
<p class="question">Will msnbot/1.1 extract URLs from client sided scripts for discovery crawling? If so, will such links pass reputation?</p>
<p class="answer">Currently we do not extract URLs from client-side scripts.</p>
<p class="question">Google&#8217;s last change of their infrastructure made nofollow&#8217;ed links completely worthless, because they no longer used those in their discovery crawling. Did you change your handling of links with a &#8220;nofollow&#8221; value in the REL attribute with this upgrade too?</p>
<p class="answer">No, changes to how we process nofollow links were not part of this update.</p>
<p class="question">Nate, many thanks for your time and your interesting answers! </p>
<ul><b>Related posts:</b></p>
<li><a href="http://blogs.msdn.com/webmaster/archive/2008/02/12/announcing-crawling-improvements-for-live-search.aspx">Official announcement</a> - by <a href="http://nathanbuggia.com/">Nathan Buggia</a>, Live Search Webmaster Center Blog</a></li>
<li><a href="http://searchengineland.com/080212-160910.php">MSNbot 1.1: Live Search Implements A More Efficient Crawl</a> - by <a href="http://vanessafoxnude.com/">Vanessa Fox</a>, Search Engine Land</li>
</ul>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/", "style": "big", "title": "Update your crawler detection: MSN/Live Search announces msnbot/1.1" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>MSN spam to continue says the Live Search Blog</title>
		<link>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/</link>
		<comments>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/#comments</comments>
		<pubDate>Wed, 05 Dec 2007 08:58:46 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Spoofing]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[Cloaking]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/</guid>
		<description><![CDATA[
It seems MSN/LiveSearch has tweaked their rogue bots and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and news.
Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/msn-live-search-clueless-webspam-detection.png" width="250" height="352" style="margin-left:4px;" align="right" alt="MSN Live Search clueless webspam detection" title="MSN Live Search is totally clueless when it comes to spam detection"  />It seems MSN/LiveSearch has tweaked their <a href="http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/">rogue bots</a> and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and <a href="http://blogs.msdn.com/webmaster/archive/2007/12/04/live-search-and-cloaking-detection.aspx">news</a>.</p>
<p>Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot downloads everything from a page, that is images and other objects, external CSS/JS files, and ad blocks rendering even contextual advertising from Google and Yahoo. It fakes MSN SERP referrers diluting the search term stats with generic and unrelated keywords. Webmasters running non-adult sites wondered why a database tutorial suddenly ranks for [oral sex] and why MSN sends visitors searching for [<acronym title="Mothers I Like (to) Fuck">MILF</acronym> pix] to a teenager&#8217;s diary. Webmasters assumed that MSN is after deceitful cloaking, and laughed out loud because their webspam detection method was that primitive and easy to fool.</p>
<p>Now MSN admits <a href="http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/">all their sins</a> &#8211;except the launch of a porn affiliate program&#8211; and posted a <a href="http://blogs.msdn.com/webmaster/archive/2007/12/04/live-search-and-cloaking-detection.aspx">vague excuse on their Webmaster Blog</a> telling the world that they discovered the evil cloakers and their index is somewhat spam free now. <a href="http://www.seo-scoop.com/2007/12/04/msnlive-ponies-up-about-the-referrer-spam/">Donna has chatted with the MSN spam team about their spambot</a> and reports that blocking its IP addresses is a bad idea, even for sites that don&#8217;t cloak. <a href="http://www.vanessafoxnude.com/">Vanessa Fox</a> summarized MSN&#8217;s poor man&#8217;s cloaking detection at <a href="http://searchengineland.com/071204-150233.php">Search Engine Land</a>:</p>
<blockquote><p>And one has to wonder how effective methods like this really are. Those savvy enough to cloak may be able to cloak for this new cloaker detection bot as well.</p>
</blockquote>
<p>They say that they no longer spam sites that don&#8217;t cloak, but reverse this statement telling Donna</p>
<blockquote><p>we need to be able to identify the legitimate and illegitimate content</p>
</blockquote>
<p>and Vanessa </p>
<blockquote><p>sites that are cloaking may continue to see some amount of traffic from this bot. This tool crawls sites throughout the web &#8212; both those that cloak and those that don&#8217;t &#8212; but those not found to be cloaking won&#8217;t continue to see traffic.</p>
</blockquote>
<p>Here is an excerpt from yesterdays referrer log of a site that does not cloak, and never did: <code><br />
http://search.live.com/results.aspx?q=webmaster&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=smart&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=search&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=progress&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=google&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=google&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=domain&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=database&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=content&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=business&#038;mrt=en-us&#038;FORM=LIVSOP</code><br />
Why can&#8217;t the MSN dudes tell the truth, not even when they apologize?</p>
<p>Another lie is &#8220;we obey robots.txt&#8221;. Of course the spambot doesn&#8217;t request it to bypass bot traps, but according to MSN it uses a copy served to the LiveSearch crawler &#8220;msnbot&#8221;:</p>
<blockquote><p>Yes, this robot does follow the robots.txt file. The reason you don’t see it download it, is that we use a fresh copy from our index. The tool does respect the robots.txt the same way that MSNBot does with a caveat; the tool behaves like a browser and some files that a crawler would ignore will be viewed just like real user would.</p>
</blockquote>
<p>In reality, it doesn&#8217;t help to block CSS/JS files or images in robots.txt, because MSN&#8217;s spambot will download them anyway. The long winded statement above translates to &#8220;We promise to obey robots.txt, but if it fits our needs we&#8217;ll ignore it&#8221;. </p>
<p>Well, MSN is not the only search engine running <a href="http://www.webshoppehosting.com/weblog/?p=17">stealthy bots</a> to detect cloaking, but they aren&#8217;t clever enough to do it in a less abusive and detectable way. </p>
<p>Their insane spambot led all cloaking specialists out there to their not that obvious spam detection methods. They may have caught a few cloaking sites, but considering the short life cycle of Webspam on throwaway domains they shot themselves in both feet. What they really have achieved is that the cloaking scripts are MSN spam detection immune now. </p>
<p>Was it really necessary to annoy and defraud the whole Webmaster community and to burn huge amounts of bandwidth just to catch a few cloakers who launched new scripts on new throwaway domains hours after the first appearance of the MSN spam bot?</p>
<p>Can cosmetic changes with regard to their useless spam activities restore MSN&#8217;s lost reputation? I doubt it. They&#8217;ve admitted their miserable failure five months too late. Instead of dumping the spambot, they announce that they&#8217;ll spam away for the foreseeable future. How silly is that? I thought Microsoft is somewhat profit orientated, why do they burn their and our money with such amateurish projects?</p>
<p>Besides all this crap MSN has good news too. Microsoft Live Search told Search Engine Roundtable that <a href="http://www.seroundtable.com/archives/015534.html">they&#8217;ll spam our sites with keywords related to our content</a> from now on, at least they&#8217;ll try it. And they have a <a href="http://forums.microsoft.com/webmaster/ShowForum.aspx?ForumID=1984&#038;SiteID=79">forum</a> and a <a href="https://feedback.live.com/default.aspx?productkey=livesearchwebmastercenter&#038;mkt=en-us">contact form</a> to gather complaints. Crap on, so much bureaucratic efforts to administer their ridiculous spam fighting funeral. They&#8217;d better build a search engine that actually sends human traffic.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/", "style": "big", "title": "MSN spam to continue says the Live Search Blog" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
