<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; SEO</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Wed, 11 Aug 2010 18:57:05 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Cloaking is good for you. Just ignore Bing&#8217;s/Google&#8217;s guidelines.</title>
		<link>http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/</link>
		<comments>http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/#comments</comments>
		<pubDate>Mon, 05 Jul 2010 18:24:08 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Usability]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/</guid>
		<description><![CDATA[
Summary first: If you feel the need to cloak, just do it within reason. Don&#8217;t cloak because you can, but because it&#8217;s technically the most elegant procedure to accomplish a Web development task. Bing and Google can&#8217;t detect your (in no way deceptive) intend algorithmically. Don&#8217;t spam away, though, because you might leave trails besides [...]]]></description>
			<content:encoded><![CDATA[
<p>Summary first: If you feel the need to cloak, just do it within reason. Don&#8217;t cloak because you can, but because it&#8217;s technically the most elegant procedure to accomplish a Web development task. Bing and Google can&#8217;t detect your (in no way deceptive) intend algorithmically. Don&#8217;t spam away, though, because you might leave trails besides cloaking alone, if you aren&#8217;t good enough at spamming search engines. Keep your users interests in mind. Don&#8217;t comply to search engine guidelines as set in stone, but to a reasonable level, for example when those <a href="http://www.youtube.com/watch?v=XWfqyy7J34s">force you to comply to Web standards</a> that make more sense than the fancy idea you&#8217;ve developed on internationalization, based on detecting browser language settings or so.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/penalizing-cloaking-is--bullshit.png" width="250" height="376" align="right" alt="search engine guidelines are bullshit WRT cloaking" title="Search engines must not penalize cloaking" style="margin-left:5px;" />This pamphlet is an opinion piece. The above said should be considered best practice, even by search engines. Of course it&#8217;s not, because search engines can and do fail, just like a webmaster who takes my statement &#8220;go cloak away if it makes sense&#8221; as technical advice and gets his search engine visibility tanked the hard way.</p>
<h3>WTF is cloaking?</h3>
<p>Cloaking, also known as IP delivery, means delivering content tailored for specific users who are identified primarily by their IP addresses, but also by user agent (browser, crawler, screen reader&#8230;) names, and whatnot. Here&#8217;s a simple demonstration of this technique. The content of the next paragraph differs depending on the user requesting this page. Googlebot, Googlers, as well as Matt Cutts at work, will read a personalized message:</p>
<p><em>Dear visitor, thanks for your visit from 38.107.191.85 (38.107.191.85).</em></p>
<p>You surely can imagine that cloaking opens <del>a can of worms</del> <ins>lots of opportunities to enhance a user&#8217;s surfing experience</ins>, besides &#8220;stalking&#8221; particular users like Google&#8217;s head of WebSpam.</p>
<h3>Why do search engines dislike cloaking?</h3>
<p>Apparently they don&#8217;t. They use IP delivery themselves. When you&#8217;re traveling in europe, you&#8217;ll get hints like &#8220;go to Google.fr&#8221; or &#8220;go to Google.at&#8221; all the time. That&#8217;s google.com checking where you are, trying to lure you into their regional services.</p>
<p>More seriously, there&#8217;s a so-called &#8220;dark side of cloaking&#8221;. Say you&#8217;re a <a href="http://fantomaster.com/fantomNews/archives/2010/07/08/fantomas-shadowmaker/">seasoned Internet marketer</a>, then you could show Googlebot an educational page with compelling content under an URI like &#8220;/games/poker&#8221; with an X-Robots-Tag HTTP header telling &#8220;noarchive&#8221;, whilst surfers (search engine users) supplying an HTTP_REFERER and not coming from employee.google.com get redirected to poker dot com (simplified example).</p>
<p>That&#8217;s hard to detect for Google&#8217;s WebSpam team. Because they don&#8217;t do evil themselves, they can&#8217;t officially operate sneaky bots that use for example AOL as their ISP to compare your spider fodder to pages/redirects served to actual users.</p>
<p>Bing sends out spam bots that request your pages &#8220;as a surfer&#8221; in order to discover deceptive cloaking. Of course those bots can be identified, so professional spammers serve them their spider fodder. Besides burning the bandwidth of non-cloaking sites, Bing doesn&#8217;t accomplish anything useful in terms of search quality.</p>
<p>Because search engines can&#8217;t detect cloaking properly, not to speak of a cloaking webmaster&#8217;s intentions, they&#8217;ve launched webmaster guidelines (FUD) that forbid cloaking at all. All Google/Bing reps tell you that cloaking is an evil black hat tactic that will get your site penalized or even banned. By the way, the same goes for perfectly legit &#8220;hidden content&#8221; that&#8217;s invisible on page load, but viewable after a mouse click on a &#8220;learn more&#8221; widget/link or so.</p>
<h3>Bullshit.</h3>
<p>If your competitor makes creative use of IP delivery to enhance their visitors&#8217; surfing experience, you can file a spam report for cloaking and Google/Bing will ban the site eventually. Just because cloaking <em>can</em> be used with deceptive intent. And yes, it works this way. See below.</p>
<p>Actually, those spam reports trigger a review by a human, so maybe your competitor gets away with it. But search engines also use spam reports to develop spam filters that penalize crawled pages totally automatted. Such filters can fail, and &#8211;trust me&#8211; they do fail often. Once you must optimize your content delivery for particular users or user groups yourself, such a filter could tank your very own stuff by accident. So don&#8217;t snitch on your competitors, because tomorrow they&#8217;ll return the favor.</p>
<h3>Enforcing a &#8220;do not cloak&#8221; policy is evil</h3>
<p>At least Google&#8217;s WebSpam team comes with cojones. They&#8217;ve even <a href="http://searchengineland.com/google-adwords-help-cloaks-to-google-gets-banned-45541">banned their very own help pages</a> for &#8220;<a href="http://google.com/search?hl=en&#038;q=matt+cutts+cloaking&#038;num=13&#038;safe=off">cloaking</a>&#8220;, although those didn&#8217;t serve porn to minors searching for SpongeBob images with safe-search=on.</p>
<p>That&#8217;s overdrawn, because the help files of any Google product aren&#8217;t usable without a search facility. When I click &#8220;help&#8221; in any Google service like AdWords, I get either blank pages, and/or links within the help system are broken because the destination pages were deindexed for cloaking. Plain evil, and counter productive.</p>
<p>Just because Google&#8217;s help software doesn&#8217;t show ads and related links to Googlebot, those pages aren&#8217;t guilty of deceptive cloaking. Ms Googlebot won&#8217;t pull the plastic, so it makes no sense to serve her advertisements. Related links are context sensitive just like ads, so it makes no sense to persist them in Google&#8217;s crawling cache, or even in Google&#8217;s search index. Also, as a user I really don&#8217;t care whether Google has crawled the same heading I see on a help page or not, as long as I get directed to relevant content, that is a paragraph or more that answers my question.</p>
<p>When a search engine doesn&#8217;t deliver the very best search results intentionally, just because those pages violate an outdated and utterly useless policy that rules fraudulent tactics in a shape lastly used in the last century and doesn&#8217;t take into account how the Internet works today, I&#8217;m pissed.</p>
<p>Maybe that&#8217;s not bad at all when applied to Google products? Bullshit, again. The same happens to any other website that doesn&#8217;t fit Google&#8217;s weird idea of &#8220;serving the same content to users and crawlers&#8221;. I mean, as long as Google&#8217;s crawlers come from US IPs only, how can a US based webmaster serve the same content in German language to a user coming from Austria and Googlebot, both requesting a URI like &#8220;/shipping-costs?lang=de&#8221; that has to be different for each user because shipping a parcel to Germany costs $30.00 and a parcel of the same weight shipped to Vienna costs $40.00? Don&#8217;t tell me bothering a user with shipping fees for all regions in CH/AT/DE all on one page is a good idea, when I can reduce the information overflow to a tailored info of just one shipping fee that my user expects to see, followed by a link to a page that lists shipping costs for all european countries, or all countries where at least some folks might speak/understand German.</p>
<p>Back to Google&#8217;s ban of its very own help pages that hid AdSense code from Googlebot. Of course Google wants to see what surfers see in order to deliver relevant search results, and that might include advertisements. However, surrounding ads don&#8217;t necessarily obfuscate the page&#8217;s content. Ads served instead of content do. So when Google wants to detect ad laden thin pages, they need to become smarter. Penalizing pages that don&#8217;t show ads to search engine crawlers is a bad idea for a search engine, because not showing ads to crawlers is a good idea, not only bandwidth-wise, for a webmaster.</p>
<p>Managing this dichotomy is the search engine&#8217;s job. They shouldn&#8217;t expect webmasters to help them solving their very own problems (maintaining search quality). In fact, bothering webmasters with policies solely put because search engine algos are fallible and incapable is plain evil. The same applies to instruments like rel-nofollow (launched to help Google devaluing spammy links but backfiring enormously) or Google&#8217;s war on paid links (as if not each and every link on the whole Internet is paid/bartered for, somehow).</p>
<p>What do you think, should search engines ditch their way too restrictive &#8220;don&#8217;t cloak&#8221; policies? <a href="http://twitter.com/home?status=Hey+@Google+@Bing,+go+ditch+your+outdated+webmaster+guidelines!+http%3A%2F%2Fsebastians-pamphlets.com/cloaking-is-good-for-your-vistors/" target="twitter" title="Stop search engines that tyrannize webmasters!"><b>Click to vote:</b> <img src="http://sebastians-pamphlets.com/img/twitter-icon.gif" width="10" height="10" style="border:none;" alt="Stop search engines that tyrannize webmasters!"  /></a></p>
<p> </p>
<p><b>Update 2010-07-06:</b> Don&#8217;t miss out on Danny Sullivan&#8217;s &#8220;<strong>Google be fair!</strong>&#8221; appeal, posted today: <a href="http://searchengineland.com/why-google-should-ban-its-own-help-pages-45781">Why Google Should Ban Its Own Help Pages — But Also Shouldn’t</a></p>
<p> <!-- Processed by EzStatic --></p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/", "style": "big", "title": "Cloaking is good for you. Just ignore Bing's/Google's guidelines." } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/cloaking-is-good-for-your-vistors/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Get yourself a smart robots.txt</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/</link>
		<comments>http://sebastians-pamphlets.com/smart-robots-txt/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 18:27:16 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/smart-robots-txt/</guid>
		<description><![CDATA[
Crawlers and other Web robots are the plague of today&#8217;s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don&#8217;t. Behaving or not, most bots just steal your content. You don&#8217;t appreciate that, so block them.
This pamphlet is about blocking behaving bots with a smart robots.txt file. [...]]]></description>
			<content:encoded><![CDATA[
<p><img  src="http://sebastians-pamphlets.com/img/posts/block-greedy-bots.png" width="200" height="261" align="right" style="margin-left:5px;" alt="greedy and aggressive web robots steal your content" title="Block greedy Web robots stealing your content, the smart way" />Crawlers and <a href="http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/" title="MJ12bot, dotbot, ...">other Web robots</a> are the <a href="http://sebastians-pamphlets.com/social-media-plague/">plague</a> of today&#8217;s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">REP</a>), others don&#8217;t. Behaving or not, most bots just steal your content. You don&#8217;t appreciate that, so <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-rogue-bot">block</a> them.</p>
<p>This pamphlet is about blocking behaving bots with a <a href="http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/" title="Read this old robots.txt pamphlet to recap the basics">smart robots.txt</a> file. I&#8217;ll show you how you can restrict crawling to bots operated by major search engines &#8211;that bring you nice traffic&#8211; while keeping the nasty (or useless, traffic-wise) bots out of the game.</p>
<p>The basic idea is that blocking all bots &#8211;with very few exceptions&#8211; makes more sense than maintaining kinda <a href="http://www.robotstxt.org/db.html">Web robots who&#8217;s who</a> in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold <b>Disallow: /</b>.</p>
<p>Of course that&#8217;s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That&#8217;s bullshit. You can&#8217;t handle a gazillion bots with a black list. </p>
<p>Even bots that respect the REP can harm your search engine rankings, or <a href="http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/">reveal sensitive information to your competitors</a>. Every minute a new bots turns up. You can&#8217;t manage all of them, and you can&#8217;t trust any (behaving) bot. Or, as the <a href="http://incredibill.blogspot.com/">master of bot control</a> <a href="http://sphinn.com/story/139361/#74924">explains</a>: &#8220;<b>That&#8217;s the only thing I&#8217;m concerned with: what do I get in return. If it&#8217;s nothing, it&#8217;s blocked</b>&#8220;. </p>
<p>Also, large robots.txt files handling tons of bots are fault prone. It&#8217;s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don&#8217;t enlighten your competitors.</p>
<p id="smart-robots-txt-toc">If you&#8217;re a smart webmaster agreeing with this approach, here&#8217;s your ToDo-List: <br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/downloads/smart_robots_txt.zip">Grab the code</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-install">Install</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-customize">Customize</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-test">Test</a><br />&bull;&nbsp;<a href="http://sebastians-pamphlets.com/smart-robots-txt/#smart-robots-txt-launch">Implement</a>.<br />On error read further.</p>
<h3 id="smart-robots-txt-anatomy">The anatomy of a smart robots.txt</h3>
<p>Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you&#8217;re somewhat fucked. The code isn&#8217;t elegant. I&#8217;ve tried to keep it easy to understand even for noobs &#8212; at the expense of occasional lengthiness and redundancy.</p>
<h4 id="smart-robots-txt-install">Install</h4>
<p>First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that&#8217;s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you&#8217;d have to add a leading <code>&lt;?php ?&gt;</code> string to all of them. Hence you add this code snippet to your root&#8217;s .htaccess file:<code><br />
&lt;FilesMatch ^robots\.txt$&gt;<br />
SetHandler application/x-httpd-php<br />
&lt;/FilesMatch&gt;</code><br />
As long as you&#8217;re testing and customizing my script, make that <code>^smart_robots\.txt$</code>.</p>
<p>Next <a href="http://sebastians-pamphlets.com/downloads/smart_robots_txt.zip">grab the code</a> and extract it into your document root directory. <strong>Do not rename /smart_robots.txt to /robots.txt until you&#8217;ve customized the PHP code!</strong></p>
<p>For testing purposes you can use the logRequest() function. Probably it&#8217;s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don&#8217;t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.</p>
<h4 id="smart-robots-txt-customize">Customize</h4>
<p>Search the code for <code>#EDIT</code> and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you&#8217;ve launched your smart robots.txt.</p>
<p>Let&#8217;s look at <strong>/smart_robots.txt</strong>. First of all, it sets the canonical server name, change that to yours. After routing <em>robots.txt request logging</em> to a flat file (change that to a database table!) it includes /smart_robots_inc.php. </p>
<p>Next it sends some HTTP headers that you shouldn&#8217;t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn&#8217;t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.</p>
<p style="margin-left:15px;">As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its <a href="http://google.com/search?q=cache:sebastians-pamphlets.com/robots.txt">cached copy</a>. To test your own robots.txt with Googlebot, you can login to <a href="https://www.google.com/webmasters/tools/">GWC</a> and fetch it as Googlebot. It&#8217;s a shame that the other search engines don&#8217;t provide a feature like that.</p>
<p>When you implement the <strong>whitelisted crawler</strong> method, you really should provide a contact page for crawling requests. So please change the &#8220;In order to gain permissions to crawl blocked site areas&#8230;&#8221; comment.</p>
<p>Next up are the search engine specific crawler directives. You put them as <code><br />if (isGooglebot()) {<br />
   $content .= &quot;<br />
User-agent: Googlebot<br />
Disallow:<br />
&#8230;<br />
\n\n&quot;;<br />
}</code><br />
If your URIs contain double quotes, escape them as <code>\"</code> in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)</p>
<p>Please note that you need to output at least one empty line before each <code>User-agent:</code> section.  Repeat that for each accepted crawler, before you output <code><br />
$content .= &quot;User-agent: *<br />
Disallow: /<br />
\n\n&quot;;   </code><br />
Every behaving Web robot that&#8217;s not whitelisted will bounce at the  <code>Disallow: /</code>. </p>
<p>Before <code>$content</code> is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all <a href="http://www.seoconsultants.com/tools/headers/">SEO tools</a> that spoof the user agent. Make sure that you do not output a single byte &#8211;for example leading whitespaces, a debug message, or a #comment&#8211; before the <code>print $content;</code> statement.</p>
<p style="margin-left:15px;">Blocking rogue bots is important. If you discover a rogue bot &#8211;for example a scraper that pretends to be Googlebot&#8211; during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can&#8217;t access your content!</p>
<p>Bear in mind that each and every piece of content served from your site should implement <a href="http://sebastians-pamphlets.com/http-request-handler-with-integrated-short-uri-support/#sUri-impl-rogue-bot">rogue bot detection</a>, that&#8217;s doable even with non-HTML resources like images or PDFs.</p>
<p>Finally we deliver the user agent specific robots.txt and terminate the connection.</p>
<p>Now let&#8217;s look at <strong>/smart_robots_inc.php</strong>. Don&#8217;t fuck-up the variable definitions and routines that populate them or deal with the requestor&#8217;s IP addy.</p>
<p>Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that&#8217;s maintained 24/7 and updated every 6 hours: <a href="http://fantomaster.com/fasvsspy01.html">here you go</a> (don&#8217;t use it for deceptive cloaking, promised?).</p>
<p>/smart_robots_inc.php comes with functions that detect <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=80553">Googlebot</a>, <a href="http://www.bing.com/community/blogs/search/archive/2006/11/29/search-robots-in-disguise.aspx">MSNbot</a>, and <a href="http://www.ysearchblog.com/2007/06/05/yahoo-search-crawler-slurp-has-a-new-address-and-signature-card/">Slurp</a>.</p>
<p>Most search engines tell how you can <a href="http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html">verify</a> their crawlers and which crawler directives <a href="http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html">their user agents</a> support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading &#8220;spider&#8221; and trailing &#8220;.yandex.ru&#8221; string and inbetween an integer, like in the isSlurp() function.</p>
<h4 id="smart-robots-txt-test">Test</h4>
<p>Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don&#8217;t need to wait for crawler visits, you can use the &#8220;Fetch as Googlebot&#8221; thingy in your webmaster console.</p>
<p>Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from &#8220;.search.live.com&#8221; by accident, or that someone at Yahoo starts an ancient test bot that still uses an &#8220;inktomisearch.com&#8221; host name.</p>
<p>Don&#8217;t rely on my crawler detection routines. They&#8217;re dumped from memory in a hurry, I&#8217;ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It&#8217;s up to you to make it smart.</p>
<h4 id="smart-robots-txt-launch">Launch</h4>
<p>Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.</p>
<h3 id="smart-robots-txt-output">The output of a smart robots.txt</h3>
<p>When you download a <a href="http://sebastians-pamphlets.com/smart_robots.txt">smart robots.txt</a> with your browser, wget, or any other tool that comes with user agent spoofing, you&#8217;ll see a 403 or something like:</p>
<blockquote><p><code><br />
HTTP/1.1 200 OK<br />
Date: Wed, 24 Feb 2010 16:14:50 GMT<br />
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1<br />
X-Powered-By: sebastians-pamphlets.com<br />
X-Robots-Tag: noindex, noarchive, nosnippet<br />
Connection: close<br />
Transfer-Encoding: chunked<br />
Content-Type: text/plain;charset=iso-8859-1</p>
<p># In order to gain permissions to crawl blocked site areas<br />
# please contact the webmaster via<br />
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot </p>
<p>User-agent: *<br />
Disallow: /<br />
</code>(the contact form URI above doesn&#8217;t exist)</p></blockquote>
<p>whilst a real search engine crawler like Googlebot gets slightly different contents:</p>
<blockquote><p><code><br />
HTTP/1.1 200 OK<br />
Date: Wed, 24 Feb 2010 16:14:50 GMT<br />
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1<br />
X-Powered-By: sebastians-pamphlets.com<br />
X-Robots-Tag: noindex, noarchive, nosnippet<br />
Connection: close<br />
Transfer-Encoding: chunked<br />
Content-Type: text/plain; charset=iso-8859-1</p>
<p># In order to gain permissions to crawl blocked site areas<br />
# please contact the webmaster via<br />
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot </p>
<p>User-agent: Googlebot<br />
Allow: /<br />
Disallow: </p>
<p>Sitemap: http://sebastians-pamphlets.com/sitemap.xml</p>
<p>User-agent: *<br />
Disallow: /<br />
</code></p></blockquote>
<h3 id="smart-robots-txt-rant">Search engines hide important information from webmasters</h3>
<p>Unfortunately, most search engines don&#8217;t provide enough information about their crawling. For example, last time I looked Google doesn&#8217;t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for &#8220;<span title="Googlebot-Mobile/2.1">Googlebot-</span>&#8221; and you&#8217;ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.</p>
<p>I&#8217;ve nudged Google and their response was that they don&#8217;t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn&#8217;t even announce properly that they&#8217;ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.</p>
<p>Since search engines don&#8217;t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents &#8220;on behalf of&#8221; particular user agents.</p>
<p>&nbsp;</p>
<p><strong>Enjoy.</strong></p>
<p>&nbsp;</p>
<div id="smart-robots-txt-changelog" style="margin-left:10px;">
<p><strong>Changelog:</strong></p>
<p>2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log  /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.</p>
</div>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/smart-robots-txt/", "style": "big", "title": "Get yourself a smart robots.txt" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/smart-robots-txt/feed/</wfw:commentRss>
		</item>
		<item>
		<title>SEO Bullshit: Mimicking a file system in URIs</title>
		<link>http://sebastians-pamphlets.com/file-names-in-uris-are-bullshit/</link>
		<comments>http://sebastians-pamphlets.com/file-names-in-uris-are-bullshit/#comments</comments>
		<pubDate>Fri, 05 Feb 2010 20:57:09 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/file-names-in-uris-are-bullshit/</guid>
		<description><![CDATA[
Way back in the WWW&#8217;s early Jurassic, micro computer based Web development tools sneakily begun poisoning the formerly ideal world of the Internet. All of a sudden we saw &#8216;.htm&#8217; URIs, because CP/M and later on PC-DOS file extensions were limited to 3 characters. Truncating the &#8216;language&#8217; part of HTML was bad enough. Actually, fucking [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/file-system-uri-bullshit.png" width="250" height="244" align="right" style="margin-left:5px;" alt="file system like URIs" title="CRAPPY URI"   />Way back in the WWW&#8217;s early Jurassic, micro computer based Web development tools sneakily begun poisoning the formerly ideal world of the Internet. All of a sudden we saw &#8216;.htm&#8217; URIs, because CP/M and later on PC-DOS file extensions were limited to 3 characters. Truncating the &#8216;language&#8217; part of <abbr title="HyperText Markup Language">HTM<b>L</b></abbr> was bad enough. Actually, fucking with well established naming conventions wasn&#8217;t just a malady, but a symptom of a <b>w</b>orse <b>w</b>orld <b>w</b>ide pandemic.</p>
<p>Unfortunately, in order to bring Web publishing to the mere mortals (folks who could afford a micro computer), software developers invented DOS-like restrictions the Web wasn&#8217;t designed for. Web design tools maintained files on DOS file systems. FTP clients managed to convert backslashes originating from DOS file systems to slashes on UNIX servers, and vice versa (long before NT 3.51 and IIS). Directory names / file names equalled URIs. Most Web sites were static.</p>
<p>None of those cheap but fancy PC based Web design tools came with a mapping of objects (locally stored as files back then) to URIs pointing to Web resources. Despite Tim Berners-Lee&#8217;s warnings (<a href="http://www.w3.org/Provider/Style/URI">like</a> &#8220;<i>It is the the duty of a Webmaster to allocate URIs which you will be able to stand by in 2 years, in 20 years, in 200 years. This needs thought, and organization, and commitment.</i>&#8220;). The technology used to create a resource named its unique identifier (URI). That&#8217;s as absurd as wearing diapers a whole live long. </p>
<p>Newbie Web designers grew up with this flawed concept, and never bothered to research the Web&#8217;s fundamentals. In their limited view of the Web, a URI was a mirrored version of a file name and its location on their local machine, and everything served from <code>/cgi-bin/</code> had to be blocked in robots.txt, because all dynamic stuff was evil.</p>
<p>Today, those former newbies consider themselves oldtimers. Actually, they&#8217;re still greenhorns, because they&#8217;ve never learned that URIs have nothing to do with files, directories, or a Web resources&#8217;s (current) underlying technology (as in .php3 for PHP version 3.x, .shtml for SSI, &#8230;).</p>
<p style="margin-left:10px;"><strong>Technology evolves, even changes, but (valuable) contents tend to stay. URIs should solely address a piece of content, they must not change when the technology used to serve those contents changes. That means strings like &#8216;.html&#8217; or folder names must not be used in URIs.</strong></p>
<p>Many of those notorious greenhorns offer their equally ignorant clients Web development and SEO services today. They might have managed to handle dynamic contents by now (thanks to osCommerce, WordPress and other CMSs), but they&#8217;re still stuck with ancient paradigms that were never meant to exist on the Internet.</p>
<p>They might have discovered that search engines are capable of crawling and indexing dynamic contents (URIs with query strings) nowadays, but they still treat them as dumb bots &#8212; as if Googlebot or Slurp weren&#8217;t more sophisticated than Altavista&#8217;s Scooter of 1998. </p>
<p>They might even develop trendy crap (version 2.0 with nifty rounded corners) today, but they still don&#8217;t get IT. Whatever IT is, it doesn&#8217;t deserve an URI like <code>/category/vendor/product/color/size/crap.htm</code>.</p>
<p>Why hierarchical URIs (expressing breadcrums or whatnot) are utter crap (SEO-wise as well as from a developer&#8217;s POV) is explained here:</p>
<h3><a href="http://seobullshit.com/seo-toxin-directory-like-uri-structures/"><big>SEO Toxin</big></a></h3>
<p>&nbsp;</p>
<p><a href="http://seobullshit.com/seo-toxin-directory-like-uri-structures/"><img src="http://sebastians-pamphlets.com/img/crap/SEObullshit-stop-the-madness-125x125.jpg" width="125" height="125" align="left" style="margin-right:5px;" alt="SEO Bullshit" title="SEO Bullshit" /></a>I&#8217;ve published my rant &#8220;<a href="http://seobullshit.com/seo-toxin-directory-like-uri-structures/"><b>Directory-Like URI Structures Are SEO Bullshit</b></a>&#8221; on <b>SEO Bullshit dot com</b> for a reason.</p>
<p>You should keep an eye on this new blog. Subscribe to its <a href="http://feeds.feedburner.com/SeoBullshit">RSS feed</a>. Watch its <a href="https://twitter.com/bullshitradar">Twitter account</a>.</p>
<p>If it&#8217;s about SEO and it&#8217;s there, it&#8217;s most probably bullshit. If it&#8217;s bullshit, avoid it.</p>
<p>If you plan to spam the SEO blogosphere with your half-assed newbie thoughts (especially when you&#8217;re an unconvinceable &#8216;oldtimer&#8217;), consider obeying this rule of thumb:</p>
<p><strong>The top minus one reason to publish SEO stupidity is: You&#8217;ll end up <a href="http://seobullshit.com/">here</a>.</strong></p>
<p>Of course that doesn&#8217;t mean newbies shouldn&#8217;t speak out. I&#8217;m just sick of newbies who sell their half-assed brain farts as SEO advice to anyone. Noobs should read, ask, listen, learn, practice, evolve. Until they become pros. As a plain Web developer, I can tell from my own experience that listening to SEO professionals is worth every minute of your time.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/file-names-in-uris-are-bullshit/", "style": "big", "title": "SEO Bullshit: Mimicking a file system in URIs" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/file-names-in-uris-are-bullshit/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How do Majestic and LinkScape get their raw data?</title>
		<link>http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/</link>
		<comments>http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/#comments</comments>
		<pubDate>Thu, 21 Jan 2010 21:00:59 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Tools]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/</guid>
		<description><![CDATA[
Does your built-in bullshit detector cry in agony when you read announcements of link analysis tools claiming to have crawled Web pages in the trillions? Can a tiny SEO shop, or a remote search engine in its early stages running on donated equipment, build an index of that size? It took Google a decade to [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/lincscape-data-acquisition.png" width="216" height="361" align="right" alt="LinkScape data acquisition" title="Data sources of link analysis tool always shady?" style="margin-left:5px;" />Does your built-in bullshit detector cry in agony when you read announcements of link analysis tools claiming to have crawled Web pages in the trillions? Can a tiny SEO shop, or a remote search engine in its early stages running on donated equipment, build an index of that size? It took Google a decade to reach these figures, and Google&#8217;s webspam team alone outnumbers the staff of <a href="http://seomoz.com/">SEOmoz</a> and <a href="http://www.majestic12.co.uk/">Majestic</a>, not to speak of infrastructure.</p>
<p>Well, it&#8217;s not as shady as you might think, although there&#8217;s some serious bragging and willy whacking involved. </p>
<p>First of all, both SEOmoz and Majestic do not own an indexed copy of the Web. They process markup just to extract hyperlinks. That means they parse Web resources, mostly HTML pages, to store linkage data. Once each link and its attributes (HREF and REL values, anchor text, &#8230;) are stored under a Web page&#8217;s URI, the markup gets discarded. That&#8217;s why you can&#8217;t search these indexes for keywords. There&#8217;s no full text index necessary to compute link graphs.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/majestic-index-size.png" width="139" height="106" align="left" alt="Majestic index size" title="Majestic's advertised index size" style="margin-right:5px;" />The storage requirements for the Web&#8217;s link graph are way smaller than for a full text index that major search engines have to handle. In other words, it&#8217;s plausible.</p>
<p>Majestic clearly <a href="http://www.majestic12.co.uk/projects/dsearch/technology.php">describes</a> this process, and openly tells that they <a href="http://www.majesticseo.com/faq.php#NoFullText">index only links</a>.</p>
<p>With SEOmoz that&#8217;s a completely different story. They obfuscate information about the technology behind LinkScape to a level that could be described as near-snake-oil. Of course one could argue that they might be totally clueless, but I don&#8217;t buy that. You can&#8217;t create a tool like LinkScape being a moron with an IQ slighly below an amoeba. As a matter of fact, I do know that LinkScape was developed by extremely bright folks, so we&#8217;re dealing with a <a href="http://www.opensiteexplorer.org/">misleading sales pitch</a>:</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/linkscape-index-size.png" width="385" height="137" align="center" alt="Linkscape index size" title="Linkscape's advertised index size" style="margin:5px;" /></p>
<p>Let&#8217;s throw in a <a href="http://sphinn.com/story/79700#70724">comment at Sphinn</a>, where a SEOmoz rep posted &#8220;<b>Our bots, our crawl, our index</b>&#8220;.</p>
<p>Of course that&#8217;s utter bullshit. SEOmoz does not have the resources to accomplish such a task. In other words, if &#8211;and that&#8217;s a big IF&#8211; they do work as described above, they&#8217;re operating something extremely sneaky that breaks Web standards and my understanding of fairness and honesty. Actually, that&#8217;s not so, but because it is not so, LinkScape and OpenSiteExplorer in its current shape must die (see below why).</p>
<p>They do insult your intelligence as well as mine, and that&#8217;s obviously not the right thing to do, but I assume they do it solely for marketing purposes. Not that they need to cover up their operation with a smokescreen like that. LinkScape could succeed with all facts on the table. I&#8217;d call it a neat SEO tool, if it just would be legit.</p>
<p><strong>So what&#8217;s wrong with SEOmoz&#8217;s statements above, and LinkScape at all?</strong></p>
<p>Let&#8217;s start with &#8220;Crawled in the past 45 days: 700 billion links, 55 billion URLs, 63 million root domains&#8221;. That translates to &#8220;crawled &#8230; 55 billion Web pages, including 63 million root index pages, carrying 700 billion links&#8221;. 13 links per page is plausible. <a href="http://sebastians-pamphlets.com/crawling-vs-indexing/">Crawling</a> 55 billion URIs requires sending out HTTP GET requests to fetch 55 billion Web resources within 45 days, that&#8217;s roughly 30 terabyte per day. Plausible? Perhaps.</p>
<p>True? Not as is. Making up numbers like &#8220;crawled 700 billion links&#8221; suggests a comprehensive index of 700 billion URIs. I highly doubt that SEOmoz did &#8216;crawl&#8217; 700 billion URIs.</p>
<p>When SEOmoz would really crawl the Web, they&#8217;d have to respect Web standards like the Robots Exclusion Protocol (REP). You would find their crawler in your logs. An organization crawling the Web must</p>
<ul>
<li>do that with a user agent that identifies itself as crawler, for example &#8220;Mozilla/5.0 (compatible; Seomozbot/1.0; +http://www.seomoz.com/bot.html)&#8221;,</li>
<li>fetch robots.txt at least daily,</li>
<li>provide a method to block their crawler with robots.txt,</li>
<li>respect indexer directives like &#8220;noindex&#8221; or &#8220;nofollow&#8221; both in META elements as well as in HTTP response headers.</li>
</ul>
<p>SEOmoz obeys only <code>&lt;META NAME="SEOMOZ" CONTENT="NOINDEX" /&gt;</code>, according to their <a href="http://www.seomoz.org/linkscape/help/sources">sources page</a>. And exactly this page reveals that they purchase their data from various services, including search engines. <strong>They do not crawl a single Web page.</strong></p>
<p>Savvy SEOs should know that <a href="http://sebastians-pamphlets.com/crawling-vs-indexing/">crawling, parsing, and indexing</a> are different processes. Why does SEOmoz insist on the term &#8220;crawling&#8221;, taking all the <a href="http://sphinn.com/story/79700">flak</a> they can get, when they obviously don&#8217;t crawl anything? </p>
<p>Two claims out of three in &#8220;Our bots, our crawl, our index&#8221; are blatant lies. If SEOmoz performs any crawling, in addition to processing bought data, without following and communicating the procedure outlined above, that would be sneaky. I really hope that&#8217;s not happening.</p>
<p>As a matter of fact, I&#8217;d like to see SEOmoz crawling. I&#8217;d be very, very happy if they would not purchase a single byte of 3rd party crawler results. Why? Because I could block them in robots.txt. If they don&#8217;t access my content, I don&#8217;t have to worry whether they obey my indexer directives (robots meta &#8216;tag&#8217;) or not.</p>
<p>As a side note, requiring a &#8220;SEOMOZ&#8221; robots META element to opt out of their link analysis is plain theft. Adding such code bloat to my pages takes a lot of time, and that&#8217;s expensive. Also, serving an additional line of code in each and every HEAD section sums up to a lot of wasted bandwidth &#8211;$$!&#8211; over time. Am I supposed to invest my hard earned bucks just to prevent me from revealing my outgoing links to my competitors? For that reason alone I should report SEOmoz to the FTC requesting them to shut LinkScape down asap.</p>
<p>They don&#8217;t obey the X-Robots-Tag (&#8221;noindex&#8221;/&#8221;nofollow&#8221;/&#8230; in the HTTP header) for a reason. Working with purchased data from various sources they can&#8217;t guarantee that they even get those headers. Also, why the fuck should I serve MSNbot, Slurp or Googlebot an HTTP header addressing SEOmoz? This could put my search engine visibility at risk.</p>
<p>If they&#8217;d crawl themselves, serving their user agent a &#8220;noindex&#8221; X-Robots-Tag and a 403 might be doable, at least when they pay for my efforts. With their current setup that&#8217;s technically impossible. They could switch to <a href="http://80legs.com/whatitis.html">80legs.com</a> completely, that&#8217;ll solve the problem, provided 80legs works 100% by the REP and crawls as &#8220;SEOmozBot&#8221; or so.</p>
<p>With <a href="http://www.majesticseo.com/">MajesticSEO</a> that&#8217;s not an issue, because I can <a href="http://www.majestic12.co.uk/projects/dsearch/mj12bot.php">block their crawler</a> with<code><br />User-agent: MJ12bot<br />
Disallow: /<br />
</code></p>
<p>Yahoo&#8217;s site explorer also delivers too much data. I can&#8217;t block it without losing search engine traffic. Since it will probably die when Microsoft overtakes search.yahoo.com, I don&#8217;t rant much about it. Google and Bing don&#8217;t reveal my linkage data to everyone. </p>
<p>I have an issue with SEOmoz&#8217;s LinkScape, and OpenSiteExplorer as well. It&#8217;s serious enough that I say they have to close it, if they&#8217;re not willing to change their architecture. And that has nothing to do with misleading sales pitches, or arrogant behavior, or sympathy (respectively, a possibly lack of sympathy).</p>
<p>The competitive link analysis OpenSiteExplorer/LinkScape provides, without giving me a real chance to opt out, puts my business at risk. As much as I appreciate an opportunity to analyze my competitors, vice versa it&#8217;s downright evil. Hence just kill it.</p>
<p>Is my take too extreme? Please enlighten me in the comments.</p>
<p>Update: A <a href="http://smackdown.blogsblogsblogs.com/2010/01/22/why-the-renewed-interest-in-the-linkscape-scams-and-deception/">follow-up post from Michael VanDeMar</a> and its <a href="http://sphinn.com/story/139472">Sphinn discussion</a>, the <a href="http://sphinn.com/story/79700#70724">first LinkScape thread at Sphinn</a>, and <a href="http://sphinn.com/story/139361">Sphinn comments to this pamphlet</a>.<br /><a href="http://sebastians-pamphlets.com/smart_robots.txt">&#160;</a></p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/", "style": "big", "title": "How do Majestic and LinkScape get their raw data?" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/linkscape-opensiteexplorer-majestic-data-sources-shady-or-not/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Sanitize links in your content feeds</title>
		<link>http://sebastians-pamphlets.com/feed-link-sanitizer/</link>
		<comments>http://sebastians-pamphlets.com/feed-link-sanitizer/#comments</comments>
		<pubDate>Sat, 02 Jan 2010 13:02:25 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Blogging]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Tools]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/feed-link-sanitizer/</guid>
		<description><![CDATA[
Here&#8217;s a WordPress plug-in that sanizites relative links and on-the-page links in your content feeds: feedLinkSanitizer. Why do you need it?
Because you end up with invalid links like http://feeds.sebastians-pamphlets.com/SebastiansPamphlets#tos if you don&#8217;t use it. Once the post phases out of the main page, the link points to nowhere in feedreaders and reprints.
In feeds, absolute links [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/feedLinkSanitizer-prevents-you-from-broken-rss-links.png" width="180" height="151" align="right" style="margin-left:5px;" alt="Don't bother your visitors with broken links in feeds" title="Don't lose traffic due to broken links in feeds"  />Here&#8217;s a WordPress plug-in that sanizites relative links and on-the-page links in your content feeds: <a href="http://sebastians-pamphlets.com/downloads/wordpress/feedLinkSanitizer.zip">feedLinkSanitizer</a>. <strong>Why do you need it?</strong></p>
<p><strong>Because you end up with invalid links</strong> like <code>http://feeds.sebastians-pamphlets.com/SebastiansPamphlets#tos</code> <strong>if you don&#8217;t use it.</strong> Once the post phases out of the main page, the link points to nowhere in feedreaders and reprints.</p>
<p>In feeds, absolute links are mandatory. Make sure that not a single on-the-page link or relative link slips out of your site.</p>
<h3>Relative links</h3>
<p>When you put all links to your own stuff as <code>/perma-link/</code> instead of <code>http://your-blog/permalink/</code> you can serve your blog&#8217;s content from a different server / base URI (dev, move, &#8230;) without editing all internal links.</p>
<p>The downside is, that for various very good reasons (scrapers, search engines, whatnot) <strong>thou must not have relative links in your HTML</strong>. You might disagree, but read on.</p>
<p>The simple solution is: store relative links in your WordPress database, but output absolute links. Follow the hint in feedLinkSanitizer.txt to activate link sanitizing in your HTML. By default it changes only feed contents.</p>
<p>The plug-in changes <code>/perma-link/</code> to <code>http://example.com/perma-link/</code> in your posts, using the blog URI provided in your WordPress settings. It takes the current server name if this value is missing.</p>
<h3>Fragment links</h3>
<p>You can link to any DOM-ID in an HTML page, for example <code>&lt;a href="#tos"&gt;Table of contents&lt;/a&gt;</code> where &#8216;tos&#8217; is the DOM-ID of an HTML element like <code>&lt;h2 id="tos"&gt;Table of contents&lt;/h2&gt;</code>. These on-the-page links even come with some <a href="http://googleblog.blogspot.com/2009/09/jump-to-information-you-want-right-from.html">SEO value</a>, just in case you don&#8217;t care much about <a href="http://www.seoconsultants.com/html/fragment/">usability</a>.</p>
<p>The plug-in changes <code>#tos</code> to <code>http://example.com/perma-link/#tos</code> in your posts. If you&#8217;ve set <code>$sanitizeAllLinks = TRUE;</code> in the plugin-code, an on-the-page link clicked on the blog&#8217;s main page will open the post, positioning to the DOM-ID.</p>
<h3>Download <a href="http://sebastians-pamphlets.com/downloads/wordpress/feedLinkSanitizer.zip">feedLinkSanitizer</a></h3>
<p>I&#8217;m a launch-early kind of guy, so test it yourself. And: <i>Use at your own risk. No warranty expressed or implied is provided.</i></p>
<p>If you use another CMS, download the plug-in anyway and <del>steal</del> <ins>adapt</ins> its code.</p>
<p>Credits for previous work go to <a href="http://jon.thysell.us/software/rss-base/">Jon Thysell</a> and <a href="http://www.gerd-riesselmann.net/wordpress-plugins/no-more-relative-links-in-wordpress-rss-feeds">Gerd Riesselmann</a>.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/feed-link-sanitizer/", "style": "big", "title": "Sanitize links in your content feeds" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/feed-link-sanitizer/feed/</wfw:commentRss>
		</item>
		<item>
		<title>URI canonicalization with an X-Canonical-URI HTTP header</title>
		<link>http://sebastians-pamphlets.com/x-canonical-uri-http-header/</link>
		<comments>http://sebastians-pamphlets.com/x-canonical-uri-http-header/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 16:53:02 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/x-canonical-uri-http-header/</guid>
		<description><![CDATA[
Dear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I&#8217;m desperately in need of a good reason to praise your wisdom and whatnot. From this year&#8217;s x-mas wish list:
All search engines obey the X-Canonical-URI HTTP header
The rel=canonical link element is a great tool, at [...]]]></description>
			<content:encoded><![CDATA[
<p><img  src="http://sebastians-pamphlets.com/img/posts/x-canonial-uri-http-header-please.png" width="230" height="411" align="right" style="margin-left:5px;" alt="X-Canonical-URI HTTO Header" title="P L E A S E !!" />Dear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I&#8217;m desperately in need of a good reason to praise your wisdom and whatnot. From this year&#8217;s x-mas wish list:</p>
<h3>All search engines obey the X-Canonical-URI HTTP header</h3>
<p>The <a href="http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">rel=canonical link element</a> is a great tool, at least if <a href="http://www.audettemedia.com/blog/link-canonical-is-breaking-sites/">applied properly</a>, but sometimes it&#8217;s a royal pain in the ass.</p>
<p>Inserting rel=canonical link elements into huge conglomerates of cluttered scripts and static files is a nightmare. Sometimes the scripts creating the most URI clutter are compiled, and there&#8217;s no way to get a hand on the source code to change them.</p>
<p>Also, lots of resources can&#8217;t be stuffed with HTML&#8217;s link elements, for example dynamically created PDFs, plain text files, or images.</p>
<p>It&#8217;s not always possible to revamp old scripts, some projects just lack a suitable budget. And in some cases 301 redirects aren&#8217;t a doable option, for example when the destination URI is #5 in a redirect chain that can&#8217;t get shortened because the redirects are performed by a 3rd party that doesn&#8217;t cooperate.</p>
<p>This one, on the other hand, is elegant and scalable:</p>
<p><code>if (messedUp($_SERVER["REQUEST_URI"])) {</code><br />
<code>    <b>header(&#8221;X-Canonical-URI: $canonicalUri&#8221;);</b></code><br />
<code>}</code></p>
<p><a href="http://sebastians-pamphlets.com/x-canonical-uri-http-header/#comment-2089">Or</a>:<br />
<code>    <b>header(&#8221;Link: &lt;http://example.com/canonical-uri/&gt;; rel=canonical&#8221;);</b></code></p>
<p>Coding an HTTP request handler that takes care of URI canonicalization before any script gets invoked, and before any static file gets served, is the way to go for such fuddy-duddy sites.</p>
<p>By the way, having all URI canonicalization routines in one piece of code is way more transparent, and way better manageable, than a bazillion of isolated link elements spread over tons of resources. So that might be a feasible procedure for non-ancient sites, too.</p>
<p id="no-crap-tweets-4-9-days"><img  src="http://sebastians-pamphlets.com/img/posts/sexy-santa-threat.png" width="150" height="128" align="left" style="margin-right:5px;" alt="red crab blackmailing search engines" title="Just Do It!" />Dear search engines, if you make that happen, I promise that I don&#8217;t tweet your products with a &#8220;#crap&#8221; hashtag for the whole rest of <span title="2009">this year</span>. Deal?</p>
<p>And yes, I know I&#8217;m somewhat late, two days before x-mas, but you&#8217;ve got smart developers, haven&#8217;t you? So please, go get your <span title="Sorry, couldn't resist with regard to recent crap you've thoughtlessly launched">&#8216;code monkeys&#8217;</span> to work and surprise me. Thanks.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/x-canonical-uri-http-header/", "style": "big", "title": "URI canonicalization with an X-Canonical-URI HTTP header" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/x-canonical-uri-http-header/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The anatomy of a deceptive Tweet spamming Google Real-Time Search</title>
		<link>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/</link>
		<comments>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/#comments</comments>
		<pubDate>Thu, 10 Dec 2009 10:12:44 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[Twitter]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/</guid>
		<description><![CDATA[
Minutes after the launch of Google&#8217;s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.
I&#8217;m sure Google&#8217;s WebSpam team will pull the plug sooner or later, but as of today Google&#8217;s real time search results are extremely vulnerable to questionable content.
The somewhat [...]]]></description>
			<content:encoded><![CDATA[
<p><img  src="http://sebastians-pamphlets.com/img/posts/spamming-google-real-time-search.png" width="250" height="345" align="right" style="margin-left:5px;" alt="Google real time search spammed and abused" title=""  />Minutes after the <a href="http://googleblog.blogspot.com/2009/12/relevance-meets-real-time-web.html?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">launch</a> of Google&#8217;s <a href="http://searchengineland.com/search-real-time-madness-31668">famous</a> Real Time Search, the Internet marketing community <a href="http://sphinn.com/story/135685">began</a> to <a href="http://outspokenmedia.com/seo/google-real-time-spam/">spam</a> the <a href="http://www.google.com/search?hl=en&#038;safe=off&#038;esrch=RTSearch&#038;tbo=1&#038;num=100&#038;q=spam&#038;tbs=rltm:1">scrolling SERPs</a>. Google gave birth to a <a href="http://www.seo-theory.com/2009/12/07/google-launches-a-new-spam-industry/">new spam industry</a>.</p>
<p>I&#8217;m sure Google&#8217;s <a href="http://friendfeed.com/dannysullivan/d973e438/real-time-spam-google-says-been-fighting-so-long">WebSpam</a> team will pull the plug sooner or later, but as of today Google&#8217;s real time search results are extremely vulnerable to questionable content.</p>
<p>The somewhat shady approach to make creative use of real time search I&#8217;m outlining below will not work forever. It can be used for really evil purposes,  and Google is aware of the problem. Frankly, if I&#8217;d be the Googler in charge, I&#8217;d dump the whole real-time thingy until the spam defense lines are rock solid.</p>
<p id="rtss-recipe"><strong>Here&#8217;s the recipe from Dr Evil&#8217;s WebSpam-Cook-Book:</strong></p>
<h3 id="rtss-ingredients">Ingredients</h3>
<ul>
<li>1 <a href="http://www.google.com/trends?q=spam+google">popular topic</a> that pulls lots of searches, but not so many that the results scroll down too fast.</li>
<li>1 <a href="http://www.google.com/products?q=spam+google&#038;hl=en&#038;aq=f">landing page</a> that makes the punter pull out the plastic in no time.</li>
<li>1 <a href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&#038;answer=93713">trusted authority page</a> totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.</li>
<li>1 <a href="http://goo.gl/">short</a> domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.</li>
</ul>
<h3 id="rtss-preparation">Preparation</h3>
<p>Create a new text file and name it <code>hot-topic.php</code> or so. Then code:<code><br />
&lt;?php<br />
$landingPageUri = "http://affiliate-program.com/?your-aff-id";<br />
$trustedPageUri = "http://google.com/something.py";<br />
if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) {<br />
   header("HTTP/1.1 307 Here you go today", TRUE, 307);<br />
   header("Location: $trustedPageUri");<br />
}<br />
else {<br />
   header("HTTP/1.1 301 Happy shopping", TRUE, 301);<br />
   header("Location: $landingPageUri");<br />
}<br />
exit;<br />
?&gt;</code></p>
<p>Provided you&#8217;re a savvy spammer, your crawler detection routine will be a little more <a href="http://fantomaster.com/fasvsspy01.html">complex</a>.</p>
<p>Save the file and upload it, then test the URI <code>http://youspamaw.ay/hot-topic.php</code> in your browser.</p>
<h3 id="rtss-serving">Serving</h3>
<ul>
<li>Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don&#8217;t swear, and sail around phrases like &#8216;buy cheap viagra&#8217; with synonyms like &#8216;brighten up your girl friend&#8217;s romantic moments&#8217;.</li>
<li>On their SERPs, Google will display the text from the trusted page&#8217;s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.</li>
<li>Just for entertainment, closely monitor Google&#8217;s real time SERPs, and your real-time sales stats as well.</li>
<li>Be happy and get rich by end of the week.</li>
</ul>
<p>Google removes links to untrusted destinations, that&#8217;s why you need to abuse authority pages. As long as you don&#8217;t launch f-bombs, Google&#8217;s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.</p>
<p>Hey <a href="http://twitter.com/GoogleWebspam">Google</a>, for the sake of our children, take that as a spam report!</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/", "style": "big", "title": "The anatomy of a deceptive Tweet spamming Google Real-Time Search" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-spam-google-real-time-search-via-twitter/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Hard facts about URI spam</title>
		<link>http://sebastians-pamphlets.com/troubles-made-by-utm-variables-from-google-analytics/</link>
		<comments>http://sebastians-pamphlets.com/troubles-made-by-utm-variables-from-google-analytics/#comments</comments>
		<pubDate>Tue, 01 Dec 2009 20:00:33 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Duplicate Content]]></category>

		<category><![CDATA[Analytics]]></category>

		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Copy+Paste-Penalties]]></category>

		<category><![CDATA[AdSense]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/troubles-made-by-utm-variables-from-google-analytics/</guid>
		<description><![CDATA[
I stole this pamphlet&#8217;s title (and more) from Google&#8217;s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That&#8217;s the URI from the link above:
http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&#038;utm_medium=feed&#038;utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29
I&#8217;ve bolded the canonical URI, everything after the questionmark is clutter added by Google.
When your Google [...]]]></description>
			<content:encoded><![CDATA[
<p>I stole this pamphlet&#8217;s title (and more) from Google&#8217;s post <a href="http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&#038;utm_medium=feed&#038;utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29">Hard facts about comment spam</a> for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That&#8217;s the URI from the link above:</p>
<p><code><b title="Canonical URI" style="color:black;">http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html</b><i title="Google's query string clutter" style="color:red;">?utm_source=feedburner&#038;utm_medium=feed<br />&#038;utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster<r />+Central+Blog%29</i></code></p>
<p><img src="http://sebastians-pamphlets.com/img/posts/ga-kraken.png" width="260" height="301" style="margin-left:5px;" align="right" alt="GA Kraken" title="Google Analytics fucks your canonical URIs" />I&#8217;ve bolded the canonical URI, everything after the questionmark is <a href="http://analytics.blogspot.com/2009/11/integration-with-feedburner.html?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">clutter added by Google</a>.</p>
<p>When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, <a href="http://sebastians-pamphlets.com/troubles-made-by-utm-variables-from-google-analytics/#utm-opt-out">see below</a>).</p>
<h3 id="utm-bad">Why is it bad?</h3>
<p>FACT: <strong>Google&#8217;s method to track traffic from feeds to URIs creates new URIs.</strong> And lots of them. Depending on the number of possible values for each query string variable (<code>utm_source</code> <code>utm_medium</code> <code>utm_campaign</code> <code>utm_content</code> <code>utm_term</code>) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.</p>
<p>FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google&#8217;s search index is flooded with <a href="http://www.google.com/search?hl=en&#038;q=inurl:utm_source&#038;utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">28,900,000 cluttered URIs</a> mostly originating from copy+paste links. <a href="http://www.bing.com/search?q=inurl:utm_source">Bing</a> and <a href="http://search.yahoo.com/search?p=inurl:utm_source">Yahoo</a> didn&#8217;t index GA tracking parameters yet.</p>
<p>That&#8217;s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. <a href="http://mattcutts.com/blog/">Matt Cutts</a> <a href="http://friendfeed.com/mattcutts/6309e560/graywolf-i-think-johnmu-suggestions-were-solid">said</a> &#8220;I don&#8217;t think utm will cause dupe issues&#8221; and points to <a href="http://johnmu.com/">John Müller</a>&#8217;s <a href="http://www.seroundtable.com/archives/021170.html">helpful advice</a> (<a href="http://www.cre8asiteforums.com/forums/index.php?showtopic=73804">methods</a> a site owner can apply to tidy up Google&#8217;s mess).</p>
<p>Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that <a href="http://googlewebmastercentral.blogspot.com/2009/08/optimize-your-crawling-indexing.html?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">advocated</a> URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It&#8217;s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.</p>
<p>So far that&#8217;s just disappointing. To understand why it&#8217;s downright evil, lets look at the implications from a technical point of view.</p>
<h3 id="utm-evil">Spamming URIs with <i>utm</i> tracking variables breaks lots of things</h3>
<p>Look at this URI: <code>http://www.<span title="This URI exists with another server name">example</spam>.com/search.aspx<b style="color:red;">?</b>Query=musical+mobile<b style="color:red;">?</b>utm_source=Referral&#038;utm_medium=Internet&#038;utm_campaign=celebritybabies</code></p>
<p>Google added a query string to a query string. Two URI segment delimiters (<a href="http://www.w3.org/Addressing/URL/4_URI_Recommentations.html">&#8220;?&#8221;</a>) can cause all sorts of troubles at the landing page.</p>
<p>Some scripts will process only variables from Google&#8217;s query string, because they extract GET input from the URI&#8217;s last questionmark to the fragment delimiter &#8220;#&#8221; or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names &#8230; the number of possible errors caused by amateurish extended query strings is infinite. Even if there&#8217;s only one &#8220;?&#8221; delimiter in the URI.</p>
<p>In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn&#8217;t even show a 5xx error.</p>
<p>Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone&#8217;s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.</p>
<p>Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this <a href="http://sw-guide.de/wordpress/plugins/simple-trackback-validation/">plug-in</a>, the comparision can fail and the trackback gets deleted on arrival, without notice. If I&#8217;d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google&#8217;s UTM clutter.</p>
<p>Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn&#8217;t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google&#8217;s UTM clutter has impact on lots of tools that make sense <em>in addition</em> to Google Analytics. All broken.</p>
<p>What a glorious mess. Frankly, I&#8217;m somewhat puzzled. Google has hired tens of thousands of this planet&#8217;s brightest minds &#8211;I really mean that, literally!&#8211;, and they came out with half-assed crap like that? Un-fucking-believable.</p>
<h3 id="utm-opt-out">What can I do to avoid URI spam on my site?</h3>
<p><strong>Boycott Google&#8217;s poor man&#8217;s approach to link feed traffic data to Web analytics.</strong> Go to <a href="http://feedburner.google.com/?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris">Feedburner</a>. For each of your feeds click on &#8220;Configure stats&#8221; and uncheck &#8220;Track clicks as a traffic source in Google Analytics&#8221;. Done. Wait for a suitable solution.</p>
<p>If you really can&#8217;t live with traffic sources gathered from a somewhat <a href="http://sebastians-pamphlets.com/webkit-please-rescue-the-http_referer/">unreliable HTTP_REFERER</a>, and you&#8217;ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!</p>
<p>As a matter of fact, Google is responsible for this royal pain in the ass. Don&#8217;t fix Google&#8217;s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There&#8217;s absolutely no reason why a gazillion of webmasters and developers should do Google&#8217;s job, <a href="http://sebastians-pamphlets.com/rip-rel-nofollow-funeral-party/">again and again</a>.</p>
<h3 id="utm-alternatives">What can Google do?</h3>
<p>Well, that&#8217;s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.</p>
<p>Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user&#8217;s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately. </p>
<p>Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.</p>
<h3 id="utm-speak-out">Speak out!</h3>
<p>So, if you don&#8217;t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:</p>
<p><textarea readonly style="width:500px; height:55px; background:white; color:black; font-size:11pt;" wrap="virtual">?utm_source=sebastian&#038;utm_medium=pamphlet&#038;utm_campaign=thou+shalt+not+fuck+with+my+uris</textarea></p>
<p>I mean, nicely designed canonical URIs should be the search engineer&#8217;s porn, so perhaps somebody at Google will listen. Will ya?</p>
<p><b>Update:</b><a href="http://www.semmys.org/2010/search-tech-all-2010-nominees/"><img id="semmy2010" style="border:0;" align="right" src="http://www.semmys.org/dm/badges/10/LBnom.gif" alt="2010 SEMMY Nominee" /></a></p>
<p>I&#8217;ve just added a <a href="http://sebastians-pamphlets.com/stuff/utm-killer/">&#8220;UTM Killer&#8221; tool</a>, where you can enter a screwed URI and get a clean URI &#8212; all &#8216;utm_&#8217; crap and multiple &#8216;?&#8217; delimiters removed &#8212; in return. That&#8217;ll help when you copy URIs from your feedreader to use them in your blog posts.</p>
<p>By the way, please <a href="http://www.semmys.org/category/search-tech/">vote up this pamphlet</a> so that I get the 2010 SEMMY Award. Thanks in advance!</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/troubles-made-by-utm-variables-from-google-analytics/", "style": "big", "title": "Hard facts about URI spam" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/troubles-made-by-utm-variables-from-google-analytics/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Derek Powazek outed himself big-mouthed and ignorant, and why that&#8217;s a pity</title>
		<link>http://sebastians-pamphlets.com/detox-your-web-development-team/</link>
		<comments>http://sebastians-pamphlets.com/detox-your-web-development-team/#comments</comments>
		<pubDate>Fri, 16 Oct 2009 18:30:16 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/detox-your-web-development-team/</guid>
		<description><![CDATA[
With childish attacks on his colleagues, Derek Powazek didn&#8217;t do professional Web development &#8211;as an industry&#8211; a favor. As a matter of fact, Derek Powazek insulted savvy Web developers, Web designers, even search engine staff, as well as useability experts and search engine specialists, who team-up in countless projects helping small and large Web sites [...]]]></description>
			<content:encoded><![CDATA[
<p><img src="http://sebastians-pamphlets.com/img/posts/derek-powazek-1.png" width="200" height="270" align="right" style="margin-left:3px;" alt="Derek Powazek" title="Pathetic assclown" />With <a href="http://powazek.com/posts/2090" rel="nofollow crap">childish attacks</a> on his colleagues, Derek Powazek didn&#8217;t do professional Web development &#8211;as an industry&#8211; a favor. As a matter of fact, Derek Powazek <a href="http://powazek.com/posts/2101" rel="nofollow crap">insulted</a> savvy Web developers, Web designers, <a href="http://twitter.com/mattcutts/statuses/4840676463">even</a> search engine staff, as well as useability experts and search engine specialists, who team-up in countless projects helping small and large Web sites succeed. </p>
<p>I seriously can&#8217;t understand how Derek Powazek &#8220;has survived 13 years in the web biz&#8221; (<a href="http://powazek.com/about/" rel="nofollow" title="scroll down to the footer">source</a>) without detailled knowledge of <em>how things get done in Web projects</em>. I mean, if a developer really has worked 13 years in the Web biz, he should know that the task of optimizing a Web site&#8217;s findability, crawlability, and accessibility for all user agents out there (SEO) is usually not performed by &#8220;spammers evildoers and opportunists&#8221;, but by highly professional experts who just master Web development better than the average designer, copy-writer, publisher, developer or marketing guy.</p>
<p>Boy, what an ego. Derek Powazek truly believes that if &#8220;[all SEO/SEM techniques are] not obvious to you, and you make websites, you need to get informed&#8221; (<a href="http://powazek.com/posts/2101" rel="nofollow crap">source</a>). That translates to &#8220;if you aren&#8217;t 100% perfect in all aspects of Web development and Internet marketing, don&#8217;t bother making Web sites &#8212; go get a life&#8221;. </p>
<p><img src="http://sebastians-pamphlets.com/img/posts/derek-powazek-2.png" width="150" height="200" align="right" style="margin-left:3px;" alt="Derek Powazek" title="Pathetic assclown" />Well, I consider very few folks capable of mastering everything in Web development and Internet marketing. Clearly, Derek Powazek is not a member of this elite. With one clueless, uninformed and way too offensive rant he has ruined his reputation in a single day. Shortly after his first thoughtless blog post libelling fellow developers and consultants, Google&#8217;s search result page for [<a href="http://www.google.com/search?q=Derek+Powazek&#038;num=100">Derek Powazek</a>] is flooded with <a href="http://workbench.cadenhead.org/news/3563/dear-derek-powazek-seo-legitimate">reasonable reactions</a> revealing that Derek Powazek&#8217;s pathetic calls for ego food are factually wrong.  </p>
<p>Of course calm and knowledgable <a href="http://www.seobythesea.com/?p=2986">experts in the field</a> setting the records straight, like <a href="http://searchengineland.com/an-open-letter-to-derek-powazek-on-the-value-of-seo-27680">Danny</a><a href="http://searchengineland.com/seo-faq-thats-not-from-the-land-of-unicorns-27695"> Sullivan</a> (search result #1 and #4 for [Derek Powazek] today) and <a href="http://www.seobook.com/seo-scam" title="Peter's post on Aaron Wall's SEObook">Peter da Vanzo</a> (SERP position #9), can outrank a widely unknown guy like Derek Powazek at all major search engines. Now, for the rest of his presence on this planet, Derek Powazek has to live with search results that tell the world what kind of an &#8220;expert&#8221; he really is (<a href="http://www.ask-kalena.com/seo/dumbass-of-the-week-derek-powazek/" title="Dumbass of the week">example</a> <a href="http://managinggreatness.com/2009/10/14/why-derek-powazeks-posts-were-reprehensible/">&#8230;</a>).</p>
<p>He should have read <a href="http://www.blogger.com/profile/10553935900881505139">Susan Moskwa</a>&#8217;s <a href="http://googleblog.blogspot.com/2009/10/managing-your-reputation-through-search.html">very informative article about reputation management</a> on Google&#8217;s Blog a day earlier. Not that reputation management doesn&#8217;t count as an SEO skill &#8230; actually, that&#8217;s SEO basics (as well as <a href="http://snydeysense.com/2009/10/17/my-short-response-to-derek-powazek/">URI canonicalization</a>).</p>
<p>Dear Derek Powazek, guess what all the bright folks you&#8217;ve bashed so cleverly will do when you ask them to take down their responses to your uncalled-for dirty talk?</p>
<p></p>
<p>So what can we learn from this gratuitous debacle? <b>Do not piss in someone&#8217;s roses</b> when</p>
<ul>
<li>you suffer from an oversized ego,</li>
<li>you&#8217;ve not the slightest clue what you&#8217;re talking about,</li>
<li>you can&#8217;t make a point with proven facts, so you&#8217;ve to use false pretences and clueless assumptions,</li>
<li>you tend to insult people when you&#8217;re out of valid arguments,</li>
<li>willy whacking is not for you, because your dick is, well, somewhat undersized.</li>
</ul>
<p>Ok, it&#8217;s Friday evening, so I&#8217;m supposed to enjoy TGIF&#8217;s. Why the fuck am I wasting my valuable spare time writing this pamphlet? Here&#8217;s why: </p>
<p>Having worked in, led, and coached WebDev teams on crawlability and best practices with regard to search engine crawling and indexing for ages now, I was faced with brain amputated wannabe geniuses more than once. Such assclowns are able to shipwreck great projects. From my experience the one and only way to keep teams sane and productive is sacking troublemakers at the moment you realize they&#8217;re unconvinceable. This Powazek dude has perfectly proven that his ignorance is persistent, and that his anti-social attitude is irreversible. He&#8217;s the prime example of a guy I&#8217;d never hire (except if I&#8217;d work for my worst enemy). Go figure. </p>
<hr width="50px;" />
<p id="derek-powazek-lame-excuse">Update 2009-10-19: <a href="http://powazek.com/posts/2146" rel="crap nofollow">I consider this a lame excuse</a>. Actually, it&#8217;s even more pathetic than the malicious slamming of many good folks in his previous posts. If Derek Powazek really didn&#8217;t know what &#8220;SEO&#8221; means in the first place, his brain farts attacking something he didn&#8217;t understand at the time of publishing his rants are indefensible, provided he was anything south of sane then. <a href="http://searchengineland.com/thoughts-on-web-developers-seo-reputation-problems-28047">Danny Sullivan</a> doesn&#8217;t agree, and he&#8217;s right when he says that every industry has some black sheep, but as much as I dislike comment spammers, I dislike bullshit and baseness.</p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/detox-your-web-development-team/", "style": "big", "title": "Derek Powazek outed himself big-mouthed and ignorant, and why that's a pity" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/detox-your-web-development-team/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Search engines should make shortened URIs somewhat persistent</title>
		<link>http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/</link>
		<comments>http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 17:29:26 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Social Web]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[URI shortening]]></category>

		<category><![CDATA[Twitter]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/</guid>
		<description><![CDATA[
URI shorteners are crap. Each and every shortened URI expresses a design flaw. All &#8211;or at least most&#8211; public URI shorteners will shut down sooner or later, because shortened URIs are hard to monetize. Making use of 3rd party URI shorteners translates to &#8220;put traffic at risk&#8221;. Not to speak of link love (PageRank, Google [...]]]></description>
			<content:encoded><![CDATA[
<p><a href="http://tag.us.com/uri-shorteners-suck-ass.htm">URI shorteners are crap</a>. Each and every shortened URI expresses a design flaw. All &#8211;or at least most&#8211; public URI shorteners will shut down sooner or later, because shortened URIs are hard to monetize. Making use of 3rd party URI shorteners translates to &#8220;put traffic at risk&#8221;. Not to speak of link love (PageRank, Google juice, link popularity) lost forever.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/se-rescue-short-url.png" width="250" height="222" align="right" alt="SEs could rescue tiny URLs" title="Dear search engines, please make our shortened URIs persistent!" style="margin-left:3px;" />Search engines could provide a way out of the <strong>sURL dilemma</strong> that Twitter &amp; Co created with their crappy, thoughtless and shortsighted software designs. Here&#8217;s how:</p>
<p>Most browsers support search queries in the address bar, as well as suggestions (aka search results) on DNS errors, and sometimes even 404s or other HTTP response codes other than 200/3x. That means browsers &#8220;ask a search engine&#8221; when an HTTP request fails.</p>
<p>When a <acronym title="Top Level Domain, that's .com/.net/.org...">TLD</acronym> is out of service, search engines could have crawled a 301 or meta refresh from a page formerly living on a <code>.yu</code> domain for example. They know the new address and can lead the user to this (working) URI.</p>
<p>The same goes for shortened URIs created ages ago by URI shortening services that died in the meantime. Search engines have transferred all the link juice from the shortened URI to the destination page already, so why not point users that request a dead <i>short URI</i> to the right destination?</p>
<p>Search engines have all the data required for rescuing short URIs that are out of service in their datebases. Not de-indexing &#8220;outdated&#8221; URIs belonging to URI shorteners would be a minor tweak. At least Google has stored attributes and behavior of all links on the Web since the past century, and most probably other search engines are operated by data rats too.</a></p>
<p>URI shorteners can be identified by simple patterns. They gather tons of inbound links from foreign domains that get redirected (not always using a 301!) to URIs on other 3rd party domains. Of course that applies to some AdServers too, but rest assured search engines do know the differences.</p>
<p><strong>So why the heck didn&#8217;t Google, <strike>Yahoo/MSN</strike> Bing, and Ask offer such a service yet? I thought it&#8217;s all about users, but I might have misread something. Sigh.</strong></p>
<p><small>By the way, I&#8217;ve recorded search engine misbehavior with regard to shortened URIs that could arouse Jack The Ripper, but that&#8217;s a completely other story.</small></p>
<hr />Copyright &copy; 2010 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span><div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em;"><!-- { "url": "http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/", "style": "big", "title": "Search engines should make shortened URIs somewhat persistent" } --></div>
]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/dear-search-engines-please-rescue-our-shortened-urls/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
