<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; Search Quality</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Mon, 30 Jun 2008 20:12:40 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!</title>
		<link>http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/</link>
		<comments>http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/#comments</comments>
		<pubDate>Mon, 25 Feb 2008 11:08:34 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[URL removal]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/</guid>
		<description><![CDATA[Matt Cutts asks us How should Google handle NOINDEX? That&#8217;s a tough question worth thinking twice before you submit a comment to Matt&#8217;s post. Here is Matt&#8217;s question, all the background information you need, and my opinion.
What is NOINDEX?
Noindex is an indexer directive defined in the Robots Exclusion Protocol (REP) from 1996 for use in [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/google/dear-google-please-respect-noindex.png" width="250" height="230" align="right" style="margin-left:4px;" alt="Dear Google, please respect NOINDEX" title="Dear Google, please respect NOINDEX, it means don't mention on SERPs!" />Matt Cutts asks us <a href="http://www.mattcutts.com/blog/google-noindex-behavior/">How should Google handle NOINDEX?</a> That&#8217;s a tough question worth thinking twice before you <a href="http://www.mattcutts.com/blog/google-noindex-behavior/#postcomment">submit a comment to Matt&#8217;s post</a>. Here is Matt&#8217;s question, all the background information you need, and my opinion.</p>
<h3>What is NOINDEX?</h3>
<p><a href="http://www.robotstxt.org/meta.html">Noindex</a> is an indexer directive defined in the <a href="http://sebastians-pamphlets.com/links/categories/?cat=crawler-directives">Robots Exclusion Protocol</a> (REP) from 1996 for use in <a href="http://sebastians-pamphlets.com/links/categories/?cat=robots-meta-tags">robots meta tags</a>. Putting a <b>NOINDEX</b> value in a page&#8217;s robots meta tag or <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">X-Robots-Tag</a> <b>tells search engines that they shall not index the page content</b>, but may follow links provided on the page.</p>
<p>To <a href="http://sebastians-pamphlets.com/robots-exclusion-protocol-round-up-2008-01/">get a grip on NOINDEX&#8217;s role in the REP</a> please read my <a href="http://www.seomoz.org/blog/robots-exclusion-protocol-101">Robots Exclusion Protocol summary at SEOmoz</a>. Also, <a href="http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/">Google experiments with NOINDEX as crawler directive</a> in <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a>, more on that later.</p>
<h3>How major search engines treat NOINDEX</h3>
<p>Of course you could <a href="http://sebastians-pamphlets.com/links/categories/?definitions=TRUE">read a ton of my pamphlets</a> to extract this information, but <a href="http://www.mattcutts.com/blog/handling-noindex-meta-tags/">Matt&#8217;s summary</a> is still accurate and easier to digest:</p>
<blockquote><ul>[Matt Cutts on August 30, 2006]
<li>Google doesn’t show the page in any way.</li>
<li>Ask doesn’t show the page in any way.</li>
<li>MSN shows a URL reference and cached link, but no snippet. Clicking the cached link doesn’t return anything.</li>
<li>Yahoo! shows a URL reference and cached link, but no snippet. Clicking on the cached link returns the cached page.</li>
</ul>
<p>Personally, I’d prefer it if every search engine treated the noindex meta tag by not showing a page in the search results at all. [Meanwhile Matt might have a slightly different opinion.]</p>
</blockquote>
<p>Google&#8217;s experimental support of NOINDEX as crawler directive in robots.txt also includes the DISALLOW functionality (an instruction that forbids crawling), and most probably URIs tagged with NOINDEX in robots.txt cannot accumulate PageRank. In my humble opinion <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#existing-rep-tags">the DISALLOW behavior of NOINDEX in robots.txt is completely wrong</a>, and without any doubt in no way compliant to the Robots Exclusion Protocol.</p>
<h3>Matt&#8217;s question: How should Google handle NOINDEX in the future?</h3>
<p>To simplify <a href="http://www.mattcutts.com/blog/wp-content/plugins/democracy/democracy.php?dem_action=show_vote_screen&#038;dem_poll_id=6">Matt&#8217;s poll</a>, lets assume he&#8217;s talking about NOINDEX as <b>indexer directive</b>, regardless where a Webmaster has put it (robots meta tag, X-Robots-Tag, or robots.txt).</p>
<blockquote><p>The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?</p>
</blockquote>
<p>Here are the arguments, or pros and cons, for each variant:</p>
<dl>
<dt>Google should completely drop a NOINDEX’ed page from their search results</dt>
<dd>
<p>Obviously that&#8217;s what most Webmasters would prefer:</p>
<blockquote><p>This is the behavior that we&#8217;ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way &#8212; in fact, one of the only ways &#8212; to completely remove all traces of a site from Google (another way is our <a href="http://www.google.com/webmasters/tools/removals">url removal tool</a>). That&#8217;s incredibly useful for webmasters.</p>
</blockquote>
<p><b>NOINDEX means don&#8217;t index</b>, search engines must respect such directives, even when the content isn&#8217;t <a href="http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/">password protected</a> or <a href="http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/">cloaked away</a> (redirected or hidden for crawlers but not for visitors). </p>
<p>The corner case that Google discovers a link and lists it on their SERPs before the page that carries a NOINDEX directive is crawled and deindexed isn&#8217;t crucial, and could be avoided by a (new) NOINDEX indexer directive in robots.txt, which is requested by search engines quite frequently. Ok, maybe Google&#8217;s <abbr title="Ms. Googlebot">BlitzCrawler&trade;</abbr> has to request robots.txt more often then.</p>
</dd>
<dt>Google should show a reference to NOINDEX&#8217;ed pages on their SERPs</dt>
<dd>
<p>Search quality and user experience are strong arguments:</p>
<blockquote><p>Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don&#8217;t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google&#8217;s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like [3 linked examples] aren&#8217;t showing up in Google because of the NOINDEX meta tag, that&#8217;s bad for users (and thus for Google).</p>
</blockquote>
<p>Search quality and searchers&#8217; user experience is also a strong argument for totally delisting NOINDEX&#8217;ed pages, because most Webmasters use this indexer directive to keep stuff that doesn&#8217;t provide value for searchers out of the search indexes. &lt;polemic&gt;I mean, how much weight have a few Korean sites when it comes to decisions that affect the whole Web?&lt;/polemic&gt;</p>
<p>If a Webmaster puts a NOINDEX directive by accident, that&#8217;s easy to spot in the site&#8217;s stats, considering the volume of traffic that Google controls. I highly doubt that a simple URI reference with an anchor text scrubbed from external links on Google SERPs would heal such a mistake. Also, Matt said that Google could add a NOINDEX check to the Webmaster Console.</p>
<p>The reference to the URI removal tools is out of context, because these tools remove an URI only for a short period of time and all removal requests have to be resubmitted repeatedly every few weeks. NOINDEX on the other hand is a way to keep an URI out of the index as long as this crawler directive is provided. </p>
<p>I&#8217;d say the sole argument for listing references to NOINDEX&#8217;ed pages that counts is misleading navigational searches. Of course that does not mean that Google may ignore the NOINDEX directive to show &#8211;with a linked reference&#8211; that they know a resource, despite the fact that the site owner has strictly forbidden such references on SERPs.</p>
</dd>
<dt>Something in between, Google should find a reasonable way to please both Webmasters and searchers</dt>
<dd>
<p>Quoting Matt again:</p>
<blockquote><p>The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don&#8217;t want to show up in Google). Users are most discouraged when they search for a well-known site and can&#8217;t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don&#8217;t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.</p>
</blockquote>
<p>Whether or not a site is popular must not impact a search engine&#8217;s respect for a Webmaster&#8217;s decision to keep search engines, and their users, out of her realm. That reads like &#8220;Hey, Google is popular, so we&#8217;ve the right to go to Mountain View to pillage the Googleplex, acquiring everything we can steal for the public domain&#8221;. Neither Webmasters nor search engines should mimic Robin Hood. Also, lots of Webmasters highly doubt that Google&#8217;s idea of (link) popularity should rule the Web. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Whether or not a site is listed in the ODP directory is definitely not an indicator that can be applied here. Last time I looked the majority of the Web&#8217;s content wasn&#8217;t listed at DMOZ due to the lack of editors and various other reasons, and that includes gazillions of great and useful resources. I&#8217;m not bashing DMOZ here, but as a matter of fact it&#8217;s not comprehensive enough to serve as indicator for anything, especially not importance and popularity.</p>
<p>I strongly believe that there&#8217;s no such thing as a criterion suitable to mark out a two class Web.</p>
</dd>
</dl>
<h3>My take: Yes, No, Depends</h3>
<p>Google could enhance navigational queries &#8211;and even &#8220;I feel lucky&#8221; queries&#8211; that lead to a NOINDEX&#8217;ed page with a message like &#8220;The best matching result for this query was blocked by the site&#8221;. I wouldn&#8217;t mind if they mention the URI as long as it&#8217;s not linked.</p>
<p>In fact, the problem is the granularity of the existing indexer directives. NOINDEX is neither meant for nor capable of serving that many purposes. It is wrong to assign DISALLOW semantics to NOINDEX, and it is wrong to create two classes of NOINDEX support. Fortunately, we&#8217;ve more REP indexer directives that could play a role in this discussion.</p>
<p>NOODP, NOYDIR, NOARCHIVE and/or NOSNIPPET in combination with NOINDEX on a site&#8217;s home page, that is either a domain or subdomain, could indicate that search engines must not show references to the URI in question. Otherwise, if no other indexer directives elaborate NOINDEX, search engines could show references to NOINDEX&#8217;ed main pages. The majority of navigational search queries should lead to main pages, so that would solve the search quality issues.</p>
<p>Of course that&#8217;s not precise enough due to the lack of a specific directive that deals with references to forbidden URIs, but it&#8217;s way better than ignoring NOINDEX in its current meaning. </p>
<h3>A fair solution: NOREFERENCE</h3>
<p>If I&#8217;d make the decision at Google and couldn&#8217;t live with a <em>best matching search result blocked</em>&nbsp; message, I&#8217;d go for a new REP tag:</p>
<p>&#8220;NOINDEX, NOREFERENCE&#8221; in a robots meta tag &#8211;respectively Googlebot meta tag&#8211; or X-Robots-Tag forbids search engines to show a reference on their SERPs. In robots.txt this would look like <code><br />
<b>NOINDEX: /<br />
NOINDEX: /blog/<br />
NOINDEX: /members/<br />
&#8230;<br />
NOREFERENCE: /<br />
NOREFERENCE: /blog/<br />
NOREFERENCE: /members/<br />
&#8230;</b></code><br />
Search engines would crawl these URIs, and follow their links as long as there&#8217;s no NOFOLLOW directive either in robots.txt or a page specific instruction.</p>
<p>NOINDEX without a NOREFERENCE directive would instruct search engines not to index a page, but allows references on SERPs. Supporting this indexer directive both in robots.txt as well as on-the-page (respectively in the HTTP header for X-Robots-Tags) makes it easy to add NOREFERENCE on sites that hate search engine traffic. Also, a syntax variant like <code><b>NOINDEX=NOREFERENCE</b></code> for robots.txt could tell search eniges how they have to treat NOINDEX statements on site level, or even on site area level.</p>
<p>Even more appealing would be <code><b>NOINDEX=REFERENCE</b></code>, because only the very few Webmasters that would like to see their NOINDEX&#8217;ed URIs on Google&#8217;s SERPs would have to add a directive to their robots.txt at all. Unfortunately, that&#8217;s not doable for Google unless they can convice three well known Korean sites to edit their robots.txt. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>&nbsp;</p>
<p>By the way, don&#8217;t miss out on my draft asking for <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/">REP tag support in robots.txt</a>!</p>
<p>Anyway: <b>Dear Google, please don&#8217;t touch NOINDEX!</b> <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The hacker tool MSN-LiveSearch is responsible for brute force attacks</title>
		<link>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/</link>
		<comments>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 15:36:08 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Testing]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/</guid>
		<description><![CDATA[A while ago I&#8217;ve staged a public SEO contest, asking whether the 401 HTTP response code prevents from search engine indexing or not. 
Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 [...]]]></description>
			<content:encoded><![CDATA[<p><img  src="http://sebastians-pamphlets.com/img/posts/401-private-property-keep-out.png" width="200" height="133" align="right" style="margin-left:4px;" alt="401 = Private Property, keep out!" title="401 = Private Property, keep out!" />A while ago I&#8217;ve staged a public <a href="http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/">SEO contest</a>, asking whether the 401 HTTP response code prevents from search engine indexing or not. </p>
<p>Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 HTTP response code that translates to a polite &#8220;Authorization Required&#8221;, meaning &#8220;Forbidden unless you provide valid authorization&#8221;. </p>
<p>Experience of life and common sense tell search engines, that when a Webmaster protects content with a user/password query, this content is not available to the public. Search engines that respect Webmasters/site owners do not point their users to protected content. </p>
<p>Also, that makes no sense for the search engine. Searchers submitting a query with keywords that match a protected URL would be pissed when they click the promising search result on the SERP, but the linked site responds with an unfriendly &#8220;Enter user and password in order to access [title of the protected area]&#8221;, that resolves to a harsh error message because the searcher can&#8217;t provide such information, and usually can&#8217;t even sign up from the 401 error page<sup><a href="#401-error-document-footnote">1</a></sup>. </p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/evil-use-of-search-results.png" width="200" height="255" align="right" style="margin-left:4px;" alt="Evil use of search results" title="The evil variant of search results " />Unfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example <code>/members/</code>. Even when those systems can handle individual URLs, many Webmasters leave default URLs in place that are either guessable or well known on the Web. </p>
<p>Developing a script that searches for a string like <code>/members/</code> in URLs and then &#8220;tests&#8221; the search results with brute force attacks is a breeze. Also, such scripts are available (for a few bucks or even free) at various places. Without the help of a search engine that provides the lists of protected URLs, the hacker&#8217;s job is way more complicated. In other words, search engines that list protected URLs on their SERPs willingly support and encourage hacking, content theft, and DOS-like server attacks.</p>
<p>Ok, lets look at the test results. All search engines have casted their votes now. <b>Here are the winners:</b> </p>
<h3>Google <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Once my test was out, <a href="http://mattcutts.com/blog/">Matt Cutts</a> from <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40207">Google</a> researched the question and told me:</p>
<blockquote><p>My belief from talking to folks at Google is that 401/forbidden URLs that we crawl won&#8217;t be indexed even as a reference, so .htacess password-protected directories shouldn&#8217;t get indexed as long as we crawl enough to discover the 401. Of course, if we discover an URL but didn&#8217;t crawl it to see the 401/Forbidden status, that URL reference could still show up in Google.</p>
</blockquote>
<p>Well, that&#8217;s exactly the expected behavior, and I wasn&#8217;t surprised that my test results confirm Matt&#8217;s statement. Thanks to Google&#8217;s BlitzIndexing&trade; Ms. Googlebot spotted the 401 so fast, that the URL never showed up on Google&#8217;s SERPs. Google reports the <a href="http://sebastians-pamphlets.com/porn/">protected URL</a> in my <a href="http://google.com/webmasters/tols/">Webmaster Console</a> account for this blog as not indexable.</p>
<h3>Yahoo <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Yahoo&#8217;s crawler Slurp also fetched the protected URL in no time, and Yahoo did the right thing too. I wonder whether or not that&#8217;s going to change if <a href="http://searchengineland.com/080201-064343.php">M$ buys Yahoo</a>. </p>
<h3>Ask <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Ask&#8217;s crawler isn&#8217;t the most diligent Web robot out there. However, somehow Ask has managed not to index a reference to my password protected URL.</p>
<p><b style="font-size:110%;">And here is the ultimate loser:</b></p>
<h3>MSN LiveSearch <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </h3>
<p>Oh well. Obviously MSN LiveSearch is a must have in a deceitful cracker&#8217;s toolbox:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/msn-indexes-401-protected-urls.png" width="467" height="223" align="center" style="" alt="MSN LiveSearch indexes password protected URLs" title="MSN LiveSearch indexes password protected URLs" /></p>
<p>As if indexing references to password protected URLs wouldn&#8217;t be crappy enough, MSN even indexes sitemap files that are referenced in robots.txt only. Sitemaps are machine readable URL submission files that have absolute no value for humans. Webmasters make use of sitemap files to mass submit their URLs to search engines. The <a href="http://sitemaps.org/">sitemap protocol</a>, that MSN officially supports, defines a communication channel between Webmasters and search engines - not searchers, and especially not scrapers that can use indexed sitemaps to steal Web contents more easily. Here is a screen shot of an MSN SERP:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/msn-lists-unlinked-porn-sitemap-file-2008-01.png" width="460" height="54" align="center" style="" alt="MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)" title="MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)" /><br />
<img  src="http://sebastians-pamphlets.com/img/posts/msn-indexes-unlinked-porn-sitemap-file-2008-01.png" width="460" height="58" align="center" style="" alt="MSN LiveSearch indexes unlinked sitemaps files (MSN Webmaster Tools)" title="MSN LiveSearch indexes unlinked sitemaps files (MSN Webmaster Tools)" /></p>
<p>All the other search engines got the sitemap submission of the test URL too, but none of them fell for it. Neither Google, Yahoo, nor Ask have indexed the sitemap file (they never index submitted sitemaps that have no inbound links by the way) or its protected URL.</p>
<h3>Summary</h3>
<p><b style="font-size:110%;">All major search engines except MSN respect the 401 barrier.</b></p>
<p>Since <a href="http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/">MSN LiveSearch is well known for spamming</a>, it&#8217;s not a big surprise that they support hackers, scrapers and other content thieves. </p>
<p>Of course MSN search is still an experiment, operating in a not yet ready to launch stage, and the big players made their mistakes in the beginning too. But MSN has a history of ignoring Web standards as well as Webmaster concerns. It took them two years to implement the pretty simple sitemaps protocol, they still can&#8217;t handle 301 redirects, their sneaky stealth bots spam the referrer logs of all Web sites out there in order to fake human traffic from MSN SERPs (MSN traffic doesn&#8217;t exist in most niches), and so on. Once pointed to such crap, they don&#8217;t even fix the simplest bugs in a timely manner. I mean, not complying to the HTTP 1.1 protocol from the last century is an evidence of incapacity, and that&#8217;s just one example.</p>
<p>&nbsp;</p>
<p><b>Update Feb/06/2008:</b> Last night I&#8217;ve received an email from Microsoft confirming the 401 issue. The MSN Live Search engineer said they are currently working on a fix, and he provided me with an email address to report possible further issues. Thank you, <a href="http://nathanbuggia.com/">Nathan Buggia</a>! I&#8217;m still curious how MSN Live Search will handle sitemap files in the future.</p>
<p>&nbsp;</p>
<hr width="128" color="silver" align="center" />
<p id="401-error-document-footnote"><sup>1</sup>&nbsp;<small>Smart Webmasters provide sign up as well as login functionality on the page referenced as ErrorDocument 401, but the majority of all failed logins leave the user alone with the short hard coded 401 message that Apache outputs if there&#8217;s no 401 error document. Please note that you shouldn&#8217;t use a PHP script as 401 error page, because this might disable the user/password prompt (due to a PHP bug). With a <a href="http://sebastians-pamphlets.com/error401.html">static 401 error page</a> that fires up on invalid user/pass entries or a hit on the cancel button, you can perform a meta refresh to redirect the visitor to a signup page. Bear in mind that in .htaccess you <b>must not</b> use absolute URLs (http://&#8230; or https://&#8230;) in the ErrorDocument 401 directive, and that on the error page you <b>must</b> use absolute URLs for CSS, images, links and whatnot because relative URIs don&#8217;t work there!</small></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google removes the #6 penalty/filter/glitch</title>
		<link>http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/</link>
		<comments>http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/#comments</comments>
		<pubDate>Tue, 29 Jan 2008 06:50:51 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/</guid>
		<description><![CDATA[After the great #6 Penalty SEO Panel Google&#8217;s head of the webspam dept. Matt Cutts digged out a misbehaving algo and sent it back to the developers. Two hours ago he stated: 
When Barry asked me about &#8220;position 6&#8243; in late December, I said that I didn&#8217;t know of anything that would cause that. But [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/google/google-pos-six-penalty-removed.png" width="200" height="299" align="right" style="margin-left:4px;" alt="Google removed the position six penalty" title="Google takes back the #6 filter/glitch/penalty"  />After the great <a href="http://sebastians-pamphlets.com/the-fictive-numeric-google-penalty-seo-hero-panel/">#6 Penalty SEO Panel</a> Google&#8217;s head of the webspam dept. <a href="http://mattcutts.com/blog/">Matt Cutts</a> digged out a misbehaving algo and sent it back to the developers. Two hours ago he <a href="http://sphinn.com/story/24687#c29022">stated</a>: </p>
<blockquote><p>When Barry <a href="http://www.seroundtable.com/archives/015799.html">asked</a> me about &#8220;position 6&#8243; in late December, I <a href="http://www.seroundtable.com/archives/015799.html#comment-673474">said</a> that I didn&#8217;t know of anything that would cause that. But about a week or so after that, my attention was brought to something that could exhibit that behavior.</p>
<p>We&#8217;re in the process of changing the behavior; I think the change is live at some datacenters already and will be live at most data centers in the next few weeks.</p>
</blockquote>
<p>&nbsp;</p>
<p>So everything is fine now. Matt penalizes the position-six software glitch, and lost top positions will revert to their former rankings in a while. Well, not really. Nobody will compensate income losses, nor the time Webmasters spent on forums discussing a suspected penalty that actually was a bug or a weird side effect. However, kudos to Google for listening to concerns, tracking down and fixing the algo. And thanks for the update, Matt.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Do search engines index references to password protected smut?</title>
		<link>http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/</link>
		<comments>http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/#comments</comments>
		<pubDate>Wed, 16 Jan 2008 14:42:15 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Testing]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Fun]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/</guid>
		<description><![CDATA[Recently Matt Cutts said that Google doesn&#8217;t index password protected content. I wasn&#8217;t sure whether or not that goes for all search engines. I thought that they might index at least references to protected URLs, like they all do with other uncrawlable content that has strong inbound links.
Well, SEO tests are dull and boring, so [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/how-prudish-are-search-engines.png" width="150" height="521" align="right" style="margin-left:6px;" alt="how prudish are search engines" title="How prudish are search engines?"  />Recently <a href="http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/#does-google-index-protected-smut">Matt Cutts said</a> that Google doesn&#8217;t index password protected content. I wasn&#8217;t sure whether or not that goes for all search engines. I thought that they might index at least <a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=site:sebastians-pamphlets.com+inurl:/porn/">references to protected URLs</a>, like they all do with other uncrawlable content that has strong inbound links.</p>
<p>Well, SEO tests are dull and boring, so I thought I could have some fun with this one.</p>
<p>I&#8217;ve <a href="http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/#does-google-index-protected-smut">joked</a> that I should use someone&#8217;s favorite smut collection to test it. Unfortunately, nobody was willing to trade porn passwords for link love or so. I&#8217;m not a hacker, hence I&#8217;ve created my own tiny collection of password protected <a href="http://sebastians-pamphlets.com/porn/">SEO porn</a> (this link is not exactly considered safe at work) as test case.</p>
<p>I was quite astonished that according to <a href="http://www.seomoz.org/ugc/kids-plug-your-ears-i-am-going-to-talk-about-seo-p-rn" title="SEOMOZ: Kids, Plug Your Ears! I Am Going to Talk About SEO Porn!">this post about SEO porn</a> next to nobody in the SEOsphere optimizes adult sites (of course that&#8217;s not true). From the comments I figured that some folks at least <strike>surf for SEO porn</strike> evaluate the optimization techniques applied by adult Webmasters.</p>
<p>Ok, lets extend that. <b>Out yourself as SEO porn savvy Internet marketer</b>. Leave your email addy in the comments (dont forget to tell me why I should believe that you&#8217;re over 18), and I&#8217;ll email you the super secret password for my <a href="http://sebastians-pamphlets.com/porn/">SEO porn members area</a> (!SAW). Trust me, it&#8217;s worth it, and perfectly legit due to the strictly scientific character of this experiment. If you&#8217;re somewhat shy, use a funny pseudonym.</p>
<p>I&#8217;d very much appreciate a little help with linkage too. Feel free to link to <code><b title="The finest SEO porn on this planet!">http://sebastians-pamphlets.com/porn/</b></code> with an adequate anchor text of your choice, and of course without <a href="http://link-condom.com/">condom</a>. </p>
<p><a href="http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/#respond" style="font-size:120%; font-weight:bold; color:red;">Get the finest SEO porn available on this planet!</a></p>
<p><a href="http://sebastians-pamphlets.com/porn/">I&#8217;ve got the password, now let me in!</a></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/feed/</wfw:commentRss>
		</item>
		<item>
		<title>No more RSS feeds in Google&#8217;s search results</title>
		<link>http://sebastians-pamphlets.com/google-to-remove-feeds-from-web-search-results/</link>
		<comments>http://sebastians-pamphlets.com/google-to-remove-feeds-from-web-search-results/#comments</comments>
		<pubDate>Tue, 18 Dec 2007 08:34:44 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Blogging]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/google-to-remove-feeds-from-web-search-results/</guid>
		<description><![CDATA[Folks try all sorts of naughty things when by accident a blog&#8217;s feed outranks the HTML version of a post. Usually that happened mostly to not that popular blogs, or with very old posts and categorized feeds that contain ancient articles.
The problem seems to be that Google&#8217;s Web search doesn&#8217;t understand the XML structure of [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/google/google-killing-rss-feeds.png" width="250" height="250" align="right" style="margin-left:4px;" alt="Google killing RSS feeds" title="Google terminates rss/atom-feeds"  />Folks try all sorts of naughty things when by accident a blog&#8217;s feed outranks the HTML version of a post. Usually that happened mostly to not that popular blogs, or with very old posts and categorized feeds that contain ancient articles.</p>
<p>The problem seems to be that Google&#8217;s Web search doesn&#8217;t understand the XML structure of feeds, so that a feed&#8217;s textual contents get indexed like stuff from text files. Due to &#8220;subscribe&#8221; buttons and other links, feeds can gather more PageRank than some HTML pages. Interestingly .xml is considered an unknown file type, and advanced search doesn&#8217;t provide a way to search within XML files.</p>
<p>Now that has changed<sup><a href="#danny-not-news-reminder">1</a></sup>. Googler Bogdan St&#259;nescu posts on the German Webmaster blog<sup><a href="#google-feed-removal-english-post">2</a></sup> <a href="http://googlewebmastercentral-de.blogspot.com/2007/12/wir-entfernen-feeds-aus-unseren.html"><b>We remove feeds from our search results</b></a>:</p>
<blockquote><p>As Webmasters many of you were probably worried that your RSS or Atom feeds could outrank the accompanying HTML pages in Google&#8217;s search results. The emergence of feeds in our search results could be a poor user experience:  </p>
<p>1. Feeds increase the probability that the user gets the same search result twice.</p>
<p>2. Users who click on the feed link on a SERP may miss out on valuable content, which is only available on the HTML page referenced in the XML file.</p>
<p>For these reasons, we have removed feeds from our Web search results - with the exception of podcasts (feeds with media files). </p>
<p>[&#8230;] We are aware that in addition to the podcasts out there some feeds exist that are not linked with an HTML page, and that is why it is not quite ideal to remove all feeds from the search results. We&#8217;re still open for feedback and suggestions for improvements to the handling of feeds. We look forward to your comments and questions in the <a href="http://groups.google.com/group/Google_Webmaster_Help-Indexing/topics">crawling, indexing and ranking section</a> of our <a href="http://groups.google.com/group/Google_Webmaster_Help">discussion forum for Webmasters</a>. [Translation mine]</p>
</blockquote>
<p>I&#8217;m not yet sure whether or not that&#8217;s ending in a ban of all/most XML documents. I hope they suppress RSS/Atom feeds only, and provide improved ways to search for and within other XML resources.</p>
<p>So what does that mean for blog SEO? Unless Google provides a procedure to prevent feeds from accumulating PageRank whilst allowing access for blog search crawlers that request feeds (I believe something like that is in the works), it&#8217;s still a good idea to nofollow all feed links, but there&#8217;s absolutely no reason to block them in robots.txt any more.</p>
<p>I think that&#8217;s a great move into the right direction, but a preliminary solution, though. The XML structure of feeds isn&#8217;t that hard to parse, and there are only so many ways to extract the URL of the HTML page. Then when a relevant feeds lands in a raw result set, Google should display a link to the HTML version on the SERP. What do you think?</p>
<hr width="128" />
<p id="danny-not-news-reminder"><sup>1</sup> <a href="http://daggle.com/">Danny</a> <a href="http://sphinn.com/story.php?id=19239#c22568">reminded me</a> that according to <a href="http://mattcutts.com/blog/">Matt Cutts</a> that&#8217;s going on for a few months now. </p>
<p id="google-feed-removal-english-post"><sup>2</sup> 24 hours later Google published the <a href="http://googlewebmastercentral.blogspot.com/2007/12/taking-feeds-out-of-our-web-search.html">announcement</a> in English language too. </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/google-to-remove-feeds-from-web-search-results/feed/</wfw:commentRss>
		</item>
		<item>
		<title>MSN spam to continue says the Live Search Blog</title>
		<link>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/</link>
		<comments>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/#comments</comments>
		<pubDate>Wed, 05 Dec 2007 08:58:46 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Spoofing]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[Cloaking]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/</guid>
		<description><![CDATA[It seems MSN/LiveSearch has tweaked their rogue bots and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and news.
Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/msn-live-search-clueless-webspam-detection.png" width="250" height="352" style="margin-left:4px;" align="right" alt="MSN Live Search clueless webspam detection" title="MSN Live Search is totally clueless when it comes to spam detection"  />It seems MSN/LiveSearch has tweaked their <a href="http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/">rogue bots</a> and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and <a href="http://blogs.msdn.com/webmaster/archive/2007/12/04/live-search-and-cloaking-detection.aspx">news</a>.</p>
<p>Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot downloads everything from a page, that is images and other objects, external CSS/JS files, and ad blocks rendering even contextual advertising from Google and Yahoo. It fakes MSN SERP referrers diluting the search term stats with generic and unrelated keywords. Webmasters running non-adult sites wondered why a database tutorial suddenly ranks for [oral sex] and why MSN sends visitors searching for [<acronym title="Mothers I Like (to) Fuck">MILF</acronym> pix] to a teenager&#8217;s diary. Webmasters assumed that MSN is after deceitful cloaking, and laughed out loud because their webspam detection method was that primitive and easy to fool.</p>
<p>Now MSN admits <a href="http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/">all their sins</a> &#8211;except the launch of a porn affiliate program&#8211; and posted a <a href="http://blogs.msdn.com/webmaster/archive/2007/12/04/live-search-and-cloaking-detection.aspx">vague excuse on their Webmaster Blog</a> telling the world that they discovered the evil cloakers and their index is somewhat spam free now. <a href="http://www.seo-scoop.com/2007/12/04/msnlive-ponies-up-about-the-referrer-spam/">Donna has chatted with the MSN spam team about their spambot</a> and reports that blocking its IP addresses is a bad idea, even for sites that don&#8217;t cloak. <a href="http://www.vanessafoxnude.com/">Vanessa Fox</a> summarized MSN&#8217;s poor man&#8217;s cloaking detection at <a href="http://searchengineland.com/071204-150233.php">Search Engine Land</a>:</p>
<blockquote><p>And one has to wonder how effective methods like this really are. Those savvy enough to cloak may be able to cloak for this new cloaker detection bot as well.</p>
</blockquote>
<p>They say that they no longer spam sites that don&#8217;t cloak, but reverse this statement telling Donna</p>
<blockquote><p>we need to be able to identify the legitimate and illegitimate content</p>
</blockquote>
<p>and Vanessa </p>
<blockquote><p>sites that are cloaking may continue to see some amount of traffic from this bot. This tool crawls sites throughout the web &#8212; both those that cloak and those that don&#8217;t &#8212; but those not found to be cloaking won&#8217;t continue to see traffic.</p>
</blockquote>
<p>Here is an excerpt from yesterdays referrer log of a site that does not cloak, and never did: <code><br />
http://search.live.com/results.aspx?q=webmaster&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=smart&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=search&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=progress&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=google&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=google&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=domain&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=database&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=content&#038;mrt=en-us&#038;FORM=LIVSOP<br />
http://search.live.com/results.aspx?q=business&#038;mrt=en-us&#038;FORM=LIVSOP</code><br />
Why can&#8217;t the MSN dudes tell the truth, not even when they apologize?</p>
<p>Another lie is &#8220;we obey robots.txt&#8221;. Of course the spambot doesn&#8217;t request it to bypass bot traps, but according to MSN it uses a copy served to the LiveSearch crawler &#8220;msnbot&#8221;:</p>
<blockquote><p>Yes, this robot does follow the robots.txt file. The reason you don’t see it download it, is that we use a fresh copy from our index. The tool does respect the robots.txt the same way that MSNBot does with a caveat; the tool behaves like a browser and some files that a crawler would ignore will be viewed just like real user would.</p>
</blockquote>
<p>In reality, it doesn&#8217;t help to block CSS/JS files or images in robots.txt, because MSN&#8217;s spambot will download them anyway. The long winded statement above translates to &#8220;We promise to obey robots.txt, but if it fits our needs we&#8217;ll ignore it&#8221;. </p>
<p>Well, MSN is not the only search engine running <a href="http://www.webshoppehosting.com/weblog/?p=17">stealthy bots</a> to detect cloaking, but they aren&#8217;t clever enough to do it in a less abusive and detectable way. </p>
<p>Their insane spambot led all cloaking specialists out there to their not that obvious spam detection methods. They may have caught a few cloaking sites, but considering the short life cycle of Webspam on throwaway domains they shot themselves in both feet. What they really have achieved is that the cloaking scripts are MSN spam detection immune now. </p>
<p>Was it really necessary to annoy and defraud the whole Webmaster community and to burn huge amounts of bandwidth just to catch a few cloakers who launched new scripts on new throwaway domains hours after the first appearance of the MSN spam bot?</p>
<p>Can cosmetic changes with regard to their useless spam activities restore MSN&#8217;s lost reputation? I doubt it. They&#8217;ve admitted their miserable failure five months too late. Instead of dumping the spambot, they announce that they&#8217;ll spam away for the foreseeable future. How silly is that? I thought Microsoft is somewhat profit orientated, why do they burn their and our money with such amateurish projects?</p>
<p>Besides all this crap MSN has good news too. Microsoft Live Search told Search Engine Roundtable that <a href="http://www.seroundtable.com/archives/015534.html">they&#8217;ll spam our sites with keywords related to our content</a> from now on, at least they&#8217;ll try it. And they have a <a href="http://forums.microsoft.com/webmaster/ShowForum.aspx?ForumID=1984&#038;SiteID=79">forum</a> and a <a href="https://feedback.live.com/default.aspx?productkey=livesearchwebmastercenter&#038;mkt=en-us">contact form</a> to gather complaints. Crap on, so much bureaucratic efforts to administer their ridiculous spam fighting funeral. They&#8217;d better build a search engine that actually sends human traffic.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Microsoft funding bankrupt Live Search experiment with porn spam</title>
		<link>http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/</link>
		<comments>http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/#comments</comments>
		<pubDate>Fri, 16 Nov 2007 13:20:08 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Spam]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Crap]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/</guid>
		<description><![CDATA[If only this headline would be linkbait &#8230; of course it&#8217;s not sarcastic. 
Rumors are out that Microsoft will launch a porn affiliate programm soon. The top secret code name for this project is &#8220;pornbucks&#8221;, but analysts say that it will be launched as &#8220;M$ SMUT CASH&#8221; next year or so. 
Since Microsoft just can&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>If only this headline would be linkbait &#8230; of course it&#8217;s not sarcastic. </p>
<p><img src="http://sebastians-pamphlets.com/img/posts/ms-cash-banner.png" width="220" height="331" border="0" align="right" style="margin-left:4px;" alt="M$ PORN CASH" title="LiveSearch launching an porn affiliate program soon?" />Rumors are out that Microsoft will launch a porn affiliate programm soon. The top secret code name for this project is &#8220;pornbucks&#8221;, but analysts say that it will be launched as &#8220;M$ SMUT CASH&#8221; next year or so. </p>
<p>Since Microsoft just can&#8217;t ship anything in time, and the usual delays aren&#8217;t communicated internally, their search dept. began to promote <a href="http://webmaster.live.com/">it</a> to Webmasters this summer. </p>
<p>Surprisingly, Webmasters across the globe <a href="http://www.seo-scoop.com/2007/11/13/past-time-for-msn-to-pony-up-to-the-real-truth-about-referrer-spam/">weren&#8217;t that excited</a> to find promotinal messages from Live Search in their log files, so a somewhat confused MSN dude posted a <a href="http://www.webmasterworld.com/msn_microsoft_search/3424476-2-30.htm#msg3442263">lame excuse</a> to a large Webmaster forum. </p>
<p>Meanwhile we found out that Microsoft Live Search does not only target the adult entertainment industry, they&#8217;re testing the waters with other money terms like travel or pharmaceutic products too. </p>
<p>Anytime soon the Live Search menu bar will be updated to something like this:<br />
<img src="http://sebastians-pamphlets.com/img/posts/live-search-spam-menu-bar.png" width="498" height="40" border="0" alt="Live Search Porn Spam Menu" title="LiveSearch: Porn, Viagra, Health Insurance, Debt Consolidation, More Spam ..." /></p>
<p><b>Here is the sad &#8211;but true&#8211; story of a search engine&#8217;s downfall.</b></p>
<p>A few months ago Microsoft Live Search discovered that <a href="http://sebastians-pamphlets.com/when-your-referrer-stats-turn-into-a-porn-tgp/">x-rated referrer spam</a> is a must-have technique in a sneaky smut peddlar&#8217;s marketing toolbox. </p>
<p>Since August 2007 a bogus Web robot follows Microsoft&#8217;s search engine crawler &#8220;MSNbot&#8221; to spam the referrer logs of all Web sites out there with URLs pointing to <a href="http://pocketseo.com/msn/176">MSN search result pages featuring porn</a>. </p>
<p>Read your referrer logs and you&#8217;ll find spam from Microsoft too, but perhaps they peeve you with viagra spam, offer you unwanted but cheap payday loans, or try to enlarge your penis. Of course they know every trick in the book on spam, so check for harmless catchwords too. Here is an example URL: <code><br />
<b>http://search.live.com/results.aspx?q= <em style="color:red;">spammy-keyword</em> &#038;mrt=en-us&#038;FORM=LIVSOP</b></code></p>
<p>Microsoft&#8217;s spam bot not only <a href="http://ekstreme.com/thingsofsorts/blogging/yell-if-microsofts-livecom-spammed-you-too">leaves bogus URLs in log files</a>, hoping that Webmasters will click them on their referrer stats pages and maybe sign up for something like &#8220;M$ Porn Bucks&#8221; or so. It <a href="http://smackdown.blogsblogsblogs.com/2007/11/13/microsoft-needs-to-quit-fucking-with-my-adsense-scripts/">downloads and renders even adverts powered by their rival Google</a>, lowering their CTR; obviously to make programs like AdSense less attractive im comparison with Microsoft&#8217;s own ads (sorry, no link love from here).  </p>
<p>Let&#8217;s look at Microsoft&#8217;s <a href="http://www.webmasterworld.com/msn_microsoft_search/3424476-2-30.htm#msg3442263">misleading statement</a>:</p>
<blockquote><p>The traffic you are seeing is part of a quality check we run on selected pages. While we work on addressing your conerns, we would request that you do not actively block the IP addreses used by this quality check; blocking these IP addresses could prevent your site from being included in the Live Search index.</p></blockquote>
<ul>
<li>That&#8217;s not traffic, <a href="http://ekstreme.com/thingsofsorts/blogging/yell-if-microsofts-livecom-spammed-you-too">that&#8217;s bot activity</a>: These hits come within seconds of being indexed by MSNBot. The pattern is like this: the page is requested by MSNBot (which is authenticated, so it&#8217;s genuine) and within a few seconds, the very same page is requested with a live.com search result URL as referer by the MSN spam bot faking a human visitor.</li>
<li>If that&#8217;s really a quality check to detect cloaking, that&#8217;s more than just lame. The IP addresses don&#8217;t change, the bogus bot uses a static user agent name, and there are other footprints which allow every cloaking script out there to serve this sneaky bot the exact same spider fodder that MSNbot got seconds before. This flawed technique might catch poor man&#8217;s cloaking every once in a while, but it can&#8217;t fool savvy search marketers.</li>
<li>The FUD &#8220;could prevent your site from being included in the Live Search index&#8221; is laughable, because in most niches MSN search traffic is not existent.</li>
</ul>
<p>All major search engines, including MSN, promise that they obey the robots exclusion standard. Obeying robots.txt is the holy grail of search engine crawling. A search engine that ignores robots.txt and other <a href="http://sebastians-pamphlets.com/links/categories/?cat=crawler-directives">normed crawler directives</a> cannot be trusted. The crappy MSN bot not even bothers to read robots.txt, so there&#8217;s no chance to block it with standardized methods. Only <a href="http://www.reubenyau.com/live-search-referrer-spamming/">IP blocking</a> <a href="http://www.kichus.in/2007/11/14/msn-live-sending-referral-spams/">can keep it out</a>, but then it still seems to download ads from Google&#8217;s AdSense servers by executing the JavaScript code that the MSN crawler gathered before (not obeying <a href="http://sebastians-pamphlets.com/about-noindex-crawler-directives-in-robots-txt/">Google&#8217;s AdSense robots.txt</a> as well).</p>
<p>This unethical spam bot downloading all images, external CSS and JS files, and whatnot also burns bandwidth. That&#8217;s plain theft. </p>
<p>Since this method cannot detect (most) cloaking, and the so called &#8220;search quality control bot&#8221; doesn&#8217;t stop visiting sites which obviously do not cloak, it is a sneaky marketing tool. Whether or not Microsoft Live Search tries to promote cyberspace porn and on-line viagra shops plays no role. Even spamming with safe-at-work keywords is evil. Do these assclowns really believe that such unethical activities will increase the usage of their tiny and pretty unpopular search engine? Of course they do, otherwise they would have shutted down the spam bot months ago. </p>
<p>Dear reader, please tell me: what do you think of a search engine that steals (bandwidth and AdSense revenue), lies, spams away, and is not clever enough to stop their criminal activities when they&#8217;re caught?</p>
<p>Recently a <a href="http://searchengineland.com/071115-085125.php">Live Search rep whined</a> in an <a href="http://www.seomoz.org/blog/an-interview-with-livecoms-eytan-seidman">interview</a> because so many robots.txt files out there block their crawler:</p>
<blockquote><p>One thing that we noticed for example while mining our logs is that there are still a fair number of sites that specifically only allow Googlebot and do not allow MSNBot.</p></blockquote>
<p>There&#8217;s a suitable answer, though. Update your robots.txt:<code style="font-size:18pt;"><br />
<b><br />
User-agent: MSNbot<br />
Disallow: /</b></code></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/microsoft-live-search-the-downfall-of-a-tiny-search-engine/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Act out your sophisticated affiliate link paranoia</title>
		<link>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/</link>
		<comments>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/#comments</comments>
		<pubDate>Tue, 13 Nov 2007 07:09:30 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[E-Commerce]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/</guid>
		<description><![CDATA[My recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google&#8217;s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:
Links purchased for advertising should be designated [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/paranoid-affiliate-link.png" width="250" height="231" border="0" align="right" style="margin-left:4px;" alt="GOOD: paranoid affiliate link" title="Paranoid on affiliate links" />My recent posts on <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">managing affiliate links</a> and <a href="http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/">nofollow cloaking</a> <a href="http://sebastians-pamphlets.com/text-link-broker-woes-smart-paid-links-sniffers-fromgoogle/">paid links</a> led to so many reactions from my readers that I thought explaining possible protection levels could make sense. <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66736">Google&#8217;s request to condomize affiliate links</a> is a bit, well, thin when it comes to technical tips and tricks:<br />
<blockquote>Links purchased for advertising should be designated as such. This can be done in several ways, such as:<br />
    * Adding a rel=&#8221;nofollow&#8221; attribute to the &lt;a&gt; tag<br />
    * Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file</p></blockquote>
<p> Also, Google doesn&#8217;t define <a href="http://sebastians-pamphlets.com/links/categories/?cat=paid-links">paid links</a> that clearly, so try this <a href="http://www.stonetemple.com/blog/?p=196">paid link definition</a> instead before your read on. <b>Here is my linking guide for the paranoid affiliate marketer.</b></p>
<p><a href="http://www.google.com/support/webmasters/bin/answer.py?answer=76465">Google recommends hiding of any content provided by affiliate programs from their crawlers</a>. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program&#8217;s sales tools page count as duplicate content (snippet) in its worst occurance.</p>
<p>Pasting code copied from a merchant&#8217;s site into a page&#8217;s or template&#8217;s HTML is not exactly a smart way to put ads. Those ads aren&#8217;t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you&#8217;re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts. </p>
<p>There are tons of such scripts out there, so I don&#8217;t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you&#8217;ve a WordPress blog you can adapt them with ease. </p>
<h3>Cover your ass with a linking policy</h3>
<p>Googlers as well as hired guns do review Web sites for violations of Google&#8217;s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) <a href="http://sebastians-pamphlets.com/links/full-disclosure/">full disclosure of your linking attitude</a> can help to pass a human review by search engine staff. By the way, having a <a href="http://sebastians-pamphlets.com/about/policies/#commenting">policy for dofollowed blog comments</a> is also a good idea.</p>
<p>Since crawler directives like <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">link condoms</a> are for search engines (only), and those pay attention to your source code and hints addressing search engines like <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a>, you should leave a note <a href="http://sebastians-pamphlets.com/robots.txt" rel="nofollow nocontent">there</a> too, look into the source of this page for an example. <a onclick="showContent('sample-code-disclosure'); this.style.display = 'none'; return false;">View sample HTML comment.</a> <b id="sample-code-disclosure" style="display:none;">Sample HTML comment: <code>&lt;&#33;--</code>This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.<code>--&gt;</code></b> </p>
<h3>Block crawlers from your propaganda scripts</h3>
<p>Put all your stuff related to advertising (scripts, images, movies&#8230;) in a subdirectory and disallow search engine crawling in your <a href="http://www.smart-it-consulting.com/article.htm?node=140&#038;page=46">/robots.txt</a> file: <code><br />
User-agent: *<br />
Disallow: /propaganda/ </code><br />
Of course you&#8217;ll use an innocuous name like &#8220;gnisitrevda&#8221; for this folder, which lacks a default document and can&#8217;t get browsed because you&#8217;ve a <code><br />
Options -Indexes </code><br />
statement in your .htaccess file. (Watch out, Google knows what &#8220;gnisitrevda&#8221; means, so be creative or cryptic.)</p>
<p>Crawlers sent out by major search engines do respect robots.txt, hence it&#8217;s guaranteed that regular spiders don&#8217;t fetch it. As long as you don&#8217;t cheat too much, you&#8217;re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn&#8217;t prevent you from surfing search engine staff, but I don&#8217;t tell you things you&#8217;d better hide from Matt&#8217;s gang.</p>
<h3>Detect search engine crawlers</h3>
<p>Basically there are three common methods to detect requests by search engine crawlers.
<ol>
<li>Testing the user agent name (HTTP_USER_AGENT) for strings like &#8220;Googlebot&#8221;, &#8220;Slurp&#8221;, &#8220;MSNbot&#8221; or so which identify crawlers. That&#8217;s easy to spoof, for example <a href="http://sebastians-pamphlets.com/referrer-spoofing-with-prefbar-341/">PrefBar for FireFox</a> lets you choose from a list of user agents.</li>
<li>Checking the user agent name, and only when it indicates a crawler, verifying the requestor&#8217;s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.</li>
<li>Maintaining a list of all search engine crawler IP addresses known to man,  checking the requestor&#8217;s IP (REMOTE_ADDR) against this list. (That alone isn&#8217;t bullet-proof, but I&#8217;m not going to write a tutorial on industrial-strength <strike>cloaking</strike> IP delivery, I leave that to the real <a href="http://fantomaster.com/fantomNews">experts</a>.)</li>
</ol>
<p>For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won&#8217;t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants&#8217; search engine rankings, and that&#8217;s evil in Google&#8217;s eyes.  </p>
<p>We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine&#8217;s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor&#8217;s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn&#8217;t match a crawler name. </p>
<p><b id="grab-php-code-check-crawler"><a onclick="showContent('php-code-check-crawler'); return false;">View</a>|<a onclick="hideContent('php-code-check-crawler'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)<br />
<code id="php-code-check-crawler" style="display:none;"><b><br />
// file system path to crawler IP log, scripts etc.,<br />
// without trailing slash:<br />
$includePath   = $_SERVER[&quot;DOCUMENT_ROOT&quot;] . &quot;/propaganda&quot;;<br />
// edit &quot;propaganda&quot; and CHMOD 777 the directory !<br />
// file names:<br />
$crawlerIps  = $includePath .&quot;/crawler-ip-addresses.txt&quot;;<br />
// misc. stuff:<br />
$timestamp     = date(&#8217;Y-m-d H:i:s&#8217;);<br />
$ipAddy        = $_SERVER[&quot;REMOTE_ADDR&quot;];<br />
$referrer      = $_SERVER[&quot;HTTP_REFERER&quot;];<br />
$userAgent     = $_SERVER[&quot;HTTP_USER_AGENT&quot;];<br />
$requestUri    = $_SERVER[&quot;REQUEST_URI&quot;];<br />
$queryString   = $_SERVER[&quot;QUERY_STRING&quot;];<br />
$isCrawler     = FALSE;<br />
$crawlerServer = &quot;&quot;;<br />
$delimiter     = &quot;|&quot;;<br />
$idString      = &quot;&quot;;<br />
if (empty($includePath)) {<br />
   $includePath = $_SERVER[&quot;DOCUMENT_ROOT&quot;] . &quot;/propaganda&quot;; // CHMOD 777<br />
}<br />
// Write a file to disk<br />
if (!function_exists(&quot;writeLocalFile&quot;)) {<br />
function writeLocalFile ($file, $content) {<br />
   if (!is_writable($file)) {<br />
      $lok = @chmod ( $file, 0777 );<br />
   }<br />
   // file_put_contents() not avail in PHP 4.3x<br />
   $fp = @fopen(&quot;$file&quot;,&quot;w+&quot;);<br />
   if ($fp) {<br />
       $lOk = @fwrite($fp, $content, strlen($content));<br />
       @fclose($fp);<br />
       // make sure file may get overwritten or removed later on<br />
       $lok = @chmod ( $file, 0777 );<br />
       return TRUE;<br />
   } // endif $fp<br />
   return FALSE;<br />
} // end function writeLocalFile<br />
}<br />
if (!function_exists(&quot;checkCrawlerUA&quot;)) {<br />
function checkCrawlerUA () {<br />
    GLOBAL $userAgent;<br />
    GLOBAL $crawlerServer;<br />
    $crawlerServer = &quot;&quot;;<br />
    $crawlers  = array(&quot;Googlebot&quot;,&quot;Mediapartners&quot;,&quot;Slurp&quot;,&quot;MSNbot&quot;,&quot;Ask&quot;,&quot;Teoma&quot;);<br />
    foreach ($crawlers as $crawler) {<br />
        if (stristr($userAgent,$crawler)) {<br />
            if (stristr($crawler,&quot;Googlebot&quot;) ||<br />
                stristr($crawler,&quot;Mediapartners&quot;)) {<br />
                $crawlerServer = &quot;.googlebot.com&quot;;<br />
            } // Google<br />
            if (stristr($crawler,&quot;Slurp&quot;)) {<br />
                $crawlerServer = &quot;.crawl.yahoo.net&quot;;<br />
            } // Yahoo<br />
            if (stristr($crawler,&quot;MSNbot&quot;)) {<br />
                $crawlerServer = &quot;.search.live.com&quot;;<br />
            } // MSN/Live<br />
            if (stristr($crawler,&quot;Ask&quot;) ||<br />
                stristr($crawler,&quot;Teoma&quot;)) {<br />
                $crawlerServer = &quot;.ask.com&quot;;<br />
            } // Ask<br />
        }<br />
    } // foreach crawlers<br />
    if (!empty($crawlerServer)) return TRUE;<br />
    return FALSE;<br />
} // end function checkCrawlerUA<br />
}<br />
if (!function_exists(&quot;checkCrawlerIP&quot;)) {<br />
function checkCrawlerIP ($idString) {<br />
    GLOBAL $ipAddy;<br />
    GLOBAL $crawlerIps;<br />
    GLOBAL $delimiter;<br />
    GLOBAL $timestamp;<br />
    GLOBAL $userAgent;<br />
    GLOBAL $crawlerServer;<br />
    $isCrawler = checkCrawlerUA();<br />
    if ($isCrawler === FALSE)  return FALSE;<br />
    if (empty($crawlerServer)) return FALSE;<br />
//<br />
// DEBUG: $crawlerServer = &quot;.national-net.com&quot;;<br />
// Use your ISPs host name for testing with a spoofed user agent name<br />
//<br />
    $crawlerIpsContent = @file_get_contents($crawlerIps);<br />
    if (!empty($crawlerIpsContent)) {<br />
        if (stristr($crawlerIpsContent, &quot;\n$ipAddy$delimiter&quot;)) {<br />
            return TRUE;<br />
        }<br />
    }<br />
    $crawlerHost = @gethostbyaddr($ipAddy);<br />
    if (!stristr($crawlerHost,$crawlerServer)) {<br />
        return FALSE;<br />
    }<br />
    if (&quot;$crawlerHost&quot; == &quot;$ipAddy&quot;) {<br />
        return FALSE;<br />
    }<br />
    $ipAddyRev = @gethostbyname($crawlerHost);<br />
    if (&quot;$ipAddyRev&quot; != &quot;$ipAddy&quot;) {<br />
        return FALSE;<br />
    }<br />
    $crawlerIpsContent .= &quot;\n&quot; .$ipAddy .$delimiter<br />
                          .$timestamp   .$delimiter<br />
                          .$crawlerHost .$delimiter<br />
                          .$idString    .$delimiter<br />
                          .$userAgent   .$delimiter;<br />
    $lOk = writeLocalFile ($crawlerIps, $crawlerIpsContent);<br />
    return TRUE;<br />
} // end function checkCrawlerIP<br />
}<br />
</b></code><br />
Grab and implement the PHP source, then you can code statements like <code><br />
$isSpider = checkCrawlerUA ();<br />
...<br />
if ($isSpider) {<br />
    $relAttribute = &quot; rel=\&quot;nofollow\&quot; &quot;;<br />
}<br />
...<br />
$affLink = &quot;&lt;a href=\&quot;$affUrl\&quot; $relAttribute&gt;call for action&lt;/a&gt;&quot;;<br />
</code><br />
or <code><br />
$isSpider = checkCrawlerIP ($sponsorUrl);<br />
...<br />
if ($isSpider) {<br />
    // don't redirect to the sponsor, return a 403 or 410 instead<br />
}</code><br />
More on that later.</p>
<h3>Don&#8217;t deliver your advertising to search engine crawlers</h3>
<p>It&#8217;s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense&#8217;s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can&#8217;t guarantee indexing, because Google doesn&#8217;t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.</p>
<p>Here is how you could output a totally uncrawlable banner ad: <code><br />
...<br />
$isSpider = checkCrawlerIP ($PHP_SELF);<br />
...<br />
print &quot;&lt;div class=\&quot;css-class-sidebar robots-nocontent\&quot;&gt;&quot;;<br />
// output RSS buttons or so<br />
if (!$isSpider) {<br />
    print &quot;&lt;script type=\&quot;text/javascript\&quot; src=\&quot;http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&#038;adServed=banner\&quot;&gt;&lt;/script&gt;&quot;;<br />
    ...<br />
}<br />
...<br />
print &quot;&lt;/div&gt;\n&quot;;<br />
...</code><br />
Lets look at the code above. First we detect crawlers &#8220;without doubt&#8221; (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-&#8217;.crawl.yahoo.net&#8217; host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I&#8217;ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).</p>
<p>Currently the <em>robots-nocontent</em>&nbsp; class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn&#8217;t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google&#8217;s section targeting that&#8217;s a <a href="http://sebastians-pamphlets.com/yahoo-search-going-to-torture-webmasters/">crappy crawler directive</a>, though. However, it doesn&#8217;t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), &#8220;buy now&#8221;/&#8221;view basket&#8221; links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) &#8230; and of course ads.</p>
<p>Ads aren&#8217;t outputted when a crawler requests a page. Of course that&#8217;s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren&#8217;t relevant for indexing that&#8217;s not <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66355">deceitful cloaking</a> in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster&#8217;s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers. </p>
<p>Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren&#8217;t penalized. Since Web robots can&#8217;t pull out the plastic, there&#8217;s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Example from Google</a>: &#8220;Allow search bots to crawl your sites without session IDs or arguments that track their path through the site&#8221; is a call for <a href="http://www.smart-it-consulting.com/article.htm?node=148&#038;page=103">search engine friendly URL cloaking</a>.) </p>
<h3>Is hiding ads from crawlers &#8220;safe with Google&#8221; or not?</h3>
<p><img src="http://sebastians-pamphlets.com/img/posts/uncloaked-affiliate-link.png" width="200" height="188" border="0" align="right" style="margin-left:4px;" alt="BAD: uncloaked affiliate link" title="Uncloaked affiliate link" />Cloaking ads away is a double edged sword from a search engine&#8217;s perspective. Way too strictly interpreted that&#8217;s against the cloaking rule which states &#8220;don&#8217;t show crawlers other content than humans&#8221;, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google&#8217;s section targeting or Yahoo&#8217;s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don&#8217;t count them as on-the-page contents.</p>
<p>Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don&#8217;t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify <acronym title="Made For AdSense/Advertising">MFA</acronym> pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings. </p>
<p>Since in theory the contents of JavaScript ads aren&#8217;t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can&#8217;t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can&#8217;t argue seriously that hiding such tiny signals &#8211;which they use for the sole purposes of possible downranks&#8211; is against their guidelines.</p>
<p>For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn&#8217;t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.</p>
<p><b>Dear search engines, if it does not count, then you cannot demand to keep it crawlable.</b> Well, a few super mega white hat <acronym title="Dougie ...">trolls</acronym> might disagree, and depending on the implementation on individual sites maybe hiding ads isn&#8217;t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don&#8217;t try to hide the fact that I run ads on this blog.</p>
<p>Usually I don&#8217;t argue with fair vs. unfair, because we talk about <strike>war</strike> business here, what means that everything goes. However, Google does everything to talk the whole Internet into <strike>obfuscating</strike> disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> (Of course that&#8217;s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don&#8217;t care about bad press. So please don&#8217;t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)</p>
<h3>Outputting ads with JavaScript, preferably in iFrames</h3>
<p>Delivering adverts with JavaScript does not mean that one can&#8217;t use server sided scripting to adjust them dynamically. With content management systems it&#8217;s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn&#8217;t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval&#8217;ing the PHP code).</p>
<p>Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:<br />
<script type="text/javascript" src="http://sebastians-pamphlets.com/ads/output.js.php?adName=seobook&#038;adServed=banner"></script><br />
And here is the JS call of the PHP script: <code><br />
&lt;script type=&quot;text/javascript&quot; src=&quot;http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&#038;adServed=banner&quot;&gt;&lt;/script&gt;</code></p>
<p>The PHP script <code>/propaganda/output.js.php</code> evaluates the query string to pull the requested ad&#8217;s components. In case it&#8217;s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor&#8217;s location and whatnot, but that&#8217;s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad&#8217;s links. Once the HTML code is ready, it outputs a JavaScript statement: <code><br />
document.write(&lsquo;&lt;a href=&quot;http://sebastians-pamphlets.com/propaganda/router.php? adName=seobook&#038;adServed=banner&quot; title=&quot;DOWNLOAD THE BOOK ON SEO!&quot;&gt;&lt;img src=&quot;http://sebastians-pamphlets.com/propaganda/seobook/468-60.gif&quot; width=&quot;468&quot; height=&quot;60&quot; border=&quot;0&quot; alt=&quot;The only current book on SEO&quot; title=&quot;The only current book on SEO&quot;  /&gt;&lt;/a&gt;&rsquo;); </code> which the browser executes within the <code>script</code> tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag. </p>
<p>Matt Cutts <a href="http://www.stonetemple.com/articles/interview-matt-cutts.shtml">said</a> that <a href="http://www.mattcutts.com/blog/bot-obedience-herding-googlebot/#comment-45561">JavaScript links don&#8217;t prevent Googlebot from crawling</a>, but that <a href="http://www.seomoz.org/blog/the-paid-links-debate-rages-on-ses-san-jose-2007">those links</a> <a href="http://www.mattcutts.com/blog/how-to-report-paid-links/#comment-101482">don&#8217;t count for rankings</a> (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can&#8217;t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it&#8217;s fetchable by crawlers, so I wouldn&#8217;t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That&#8217;s why I use rel-nofollow &#8211;on crawler requests&#8211; even in JS ads.</p>
<p>Change your user agent name to Googlebot or so, install <a href="http://www.mattcutts.com/blog/seeing-nofollow-links/">Matt&#8217;s show nofollow hack</a> or something similar, and you&#8217;ll see that the affiliate-URL gets nofollow&#8217;ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can&#8217;t offend my readers with nofollow&#8217;ed links (although I don&#8217;t care about crawler spoofing, actually that&#8217;s a good procedure to let advertisers check out my linking attitude).</p>
<p>We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don&#8217;t count pages displayed in iFrames as on-page contents, especially not when the iFrame&#8217;s content is hosted on another domain. Here is an example straight from the horse&#8217;s mouth: <code><br />
&lt;iframe name=&quot;google_ads_frame&quot; src=&quot;http://pagead2.googlesyndication.com/pagead/ads? very-long-and-ugly-query-string&quot; marginwidth=&quot;0&quot; marginheight=&quot;0&quot; vspace=&quot;0&quot; hspace=&quot;0&quot; allowtransparency=&quot;true&quot; frameborder=&quot;0&quot; height=&quot;90&quot; scrolling=&quot;no&quot; width=&quot;728&quot;&gt;&lt;/iframe&gt;</code> In a noframes tag we could put a static ad for surfers using browsers which don&#8217;t support frames/iFrames. </p>
<p>If for some reasons you don&#8217;t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael&#8217;s <a href="http://www.bad-neighborhood.com/htmlhashing.htm">Full Text/HTML Obfuscator Tool</a> (hat tip to <a href="http://www.seo-scoop.com/2007/09/13/new-tool-to-hide-stuff/">Donna</a>).</p>
<h3>Always redirect to affiliate URLs</h3>
<p>There&#8217;s absolutely no point in using ugly affiliate URLs on your pages. Actually, that&#8217;s the last thing you want to do for various reasons.
<ul>
<li>For example, affiliate URLs as well as source codes can change, and you don&#8217;t want to edit tons of pages if that happens.</li>
<li>When an affiliate program doesn&#8217;t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you&#8217;d replace outdated ads completely with one mouse click or so.</li>
<li>Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own &#8211;often confusing&#8211; layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don&#8217;t track your outgoing traffic, some sponsors will cheat and you can&#8217;t prove your gut feelings.</li>
<li>Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don&#8217;t smell like affiliate URLs.</li>
<li><b>&#8230;</b></li>
</ul>
<p>When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really <b>manage your affiliate links</b>. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.</p>
<p>Affiliate links provide great footprints. Unique URL parts respectively <b>query string variable names</b> gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The <b>values</b> identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn&#8217;t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that&#8217;s one trackable footprint you can avoid to some degree. </p>
<p>If the redirect-script&#8217;s location is on the same server (in fact it&#8217;s not thanks to symlinks) and not named &#8220;adserver&#8221; or so, chances are that a heuristic check won&#8217;t identify the link&#8217;s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren&#8217;t crawlable with paramaters, or similar &#8220;legit&#8221; stuff. However, you can&#8217;t fool smart algos forever, but if you&#8217;ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine&#8217;s point of view, and users might agree on this).</p>
<p>Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn&#8217;t benefit you (<a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">usually</a>). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.</p>
<h3>The behavior of an adserver URL masking an affiliate link</h3>
<p>Lets look at the redirect-script&#8217;s URL from my code example above:<br />
<a href="http://sebastians-pamphlets.com/ads/router.php?adName=seobook&#038;adServed=banner">/propaganda/router.php?adName=seobook&#038;adServed=banner</a><br />
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert. </p>
<p>Now that we&#8217;ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor&#8217;s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor&#8217;s IP addy.</p>
<p>If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor&#8217;s landing page: <code><br />
$sponsorUrl      = &quot;http://www.seobook.com/262.html&quot;;<br />
$requestProtocol = $_SERVER[&quot;SERVER_PROTOCOL&quot;];<br />
$protocolArr     = explode(&quot;/&quot;,$requestProtocol);<br />
$protocolName    = trim($protocolArr[0]);<br />
$protocolVersion = trim($protocolArr[1]);<br />
if (stristr($protocolName,&quot;HTTP&quot;)<br />
    &#038;&#038; strtolower($protocolVersion) > &quot;1.0&quot; ) {<br />
    $httpStatusCode = 307;<br />
}<br />
else {<br />
    $httpStatusCode = 302;<br />
}<br />
$httpStatusLine = &quot;$requestProtocol $httpStatusCode Temporary Redirect&quot;;<br />
@header($httpStatusLine, TRUE, $httpStatusCode);<br />
@header(&quot;Location: $sponsorUrl&quot;);<br />
exit;</code><br />
A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can&#8217;t use the 301 response code, because 301 redirects are always cachable. (<a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/">More information on HTTP redirects</a>.)</p>
<p>If the requestor is a major search engine&#8217;s crawler, we perform the most brutal bounce back known to man: <code><br />
if ($isSpider) {<br />
    @header(&quot;HTTP/1.1 403 Sorry Crawlers Not Allowed&quot;, TRUE, 403);<br />
    @header(&quot;X-Robots-Tag: nofollow,noindex,noarchive&quot;);<br />
    exit;<br />
}</code><br />
The 403 response code translates to &#8220;kiss my ass and get the fuck outta here&#8221;. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn&#8217;t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine &#8220;forget this URL, don&#8217;t request it again&#8221;. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.</p>
<p>If you don&#8217;t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don&#8217;t need to live in fear of penalties.</p>
<h3 id="recap-hide-afflinks">Recap</h3>
<p><img src="http://sebastians-pamphlets.com/img/posts/prospering-affiliate-link.png" width="200" height="200" border="0" align="right" style="margin-left:4px;" alt="NICE: prospering affiliate link" title="Prospering affiliate link" />Before you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.</p>
<p>Don&#8217;t deliver different (editorial) contents to users and crawlers, but also don&#8217;t serve ads to crawlers. They just don&#8217;t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.</p>
<p>Your ads look plain ugly with dotted borders in firebrick, hence don&#8217;t apply rel=&#8221;nofollow&#8221; to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.</p>
<p>No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers. </p>
<p>Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google&#8217;s job is not exactly funny.</p>
<p>I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you&#8217;re not paranoid, that does not mean Google will not chase you down. You really don&#8217;t need to be a so called black hat to displease Google. Not knowing respectively not understanding <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Google&#8217;s 12 commandments</a> doesn&#8217;t prevent you from being spanked for sins you&#8217;ve never heard of. If you&#8217;re keen on Google&#8217;s nicely targeted traffic, better play by Google&#8217;s rules, leastwise on creawler requests.</p>
<p>Feel free to contribute your tips and tricks in the comments.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Internet marketing is one big popularity contest, and that&#8217;s not a good thing</title>
		<link>http://sebastians-pamphlets.com/internet-marketing-became-one-big-popularity-contest/</link>
		<comments>http://sebastians-pamphlets.com/internet-marketing-became-one-big-popularity-contest/#comments</comments>
		<pubDate>Fri, 09 Nov 2007 07:07:01 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Internet Marketing]]></category>

		<category><![CDATA[Folks]]></category>

		<category><![CDATA[Search Quality]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/internet-marketing-became-one-big-popularity-contest/</guid>
		<description><![CDATA[This is a guest post by Tanner Christensen. 
What are you doing to make Internet marketing a better industry to be a part of? As it sits now: Internet marketing is one big popularity contest, and that&#8217;s not a good thing. Internet marketers are making it nearly impossible for the average person to find valuable [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/social-media-optimization.png" width="300" height="131" align="right" style="margin-left:4px;" alt="SMO - Social Media Optimization" title="Internet Marketing Popularity Contest"  /><i>This is a guest post by <a href="http://www.TannerSite.com">Tanner Christensen</a>.</i> </p>
<p>What are you doing to make Internet marketing a better industry to be a part of? As it sits now: Internet marketing is one big popularity contest, and that&#8217;s not a good thing. Internet marketers are making it nearly impossible for the average person to find valuable content.</p>
<p>The real online content providers - the websites who deserve all of your attention - are becoming harder and harder to discover because of Internet marketers like us. Though Internet marketers - both you and I - can&#8217;t really be blamed, our job is all about getting attention. The more attention we get for our website(s), the more popular our website(s) become, the more money we can make.</p>
<p>But because of the recent surge of interest in Internet marketing and search engine optimization, websites that focus on providing content - rather than getting attention - are being ignored. And because these content-focused websites are being cast into the shadows of attention-focused websites, they too are jumping on the Internet marketing popularity contest bandwagon.</p>
<p>Even though every webmaster and his or her mother is jumping on the bandwagon, it&#8217;s not accurate to say that Internet marketers are making <b>all</b> less-important, less-helpful, and less-useful websites more popular than really helpful website, but there is definitely the possibility of real news and information being masked by attention-seeking content.</p>
<p>So what do we do? What do Internet marketers and search engine optimizers do to make sure that the Internet popularity contest doesn&#8217;t become a contest of lies and attention-seeking tactics; but rather a contest of quality, helpful, interesting, important, groundbreaking content?</p>
<p>The first step is to become a part of the online community. I&#8217;m not talking about the <a href="http://www.sphinn.com">Internet marketing community</a> - it&#8217;s biased in a lot of ways. I&#8217;m talking about the <a href="http://www.newsvine.com">real online communities</a>. Doing so will help create a universal feeling of online morals; or what&#8217;s good information and what is bad information.</p>
<p>And discovering where the real helpful and important websites are online will help Internet marketers such as ourselves learn where the websites we work with really should be ranked.</p>
<p>Sure, there are still those people who don&#8217;t care about quality of content and only care about the all-mighty dollar sign. But poor-content will eventually catch up with them, when websites that really deserve attention in the online popularity contest are lost in the fold and the dollar sign loses it&#8217;s value.</p>
<p><i>Tanner is a Web specialist and designer who writes helpful, inspiring, and creative internet-related articles. A while ago I&#8217;ve contributed an article to his blog <a href="http://internethunger.blogspot.com">Internet Hunger</a>: <a href="http://internethunger.blogspot.com/2007/09/generate-more-attention-that-is-more.html">The anatomy of a debunking post</a>. I think &#8220;can agessive SMO tactics push crap on the long haul&#8221; would be an interesting, and related discussion. I mean, search engines evolve too, not only in Web search, so kinda fair rankings of well linked crap as well as good stuff not on the SM radar might be possible to some extent.</i></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/internet-marketing-became-one-big-popularity-contest/feed/</wfw:commentRss>
		</item>
		<item>
		<title>A pragmatic defence against Google&#8217;s anti paid links campaign</title>
		<link>http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/</link>
		<comments>http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/#comments</comments>
		<pubDate>Fri, 26 Oct 2007 14:39:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/</guid>
		<description><![CDATA[Google&#8217;s recent shot across the bows of a gazillion sites handling paid links, advertising, or internal cross links not compliant to Google&#8217;s imagination of a natural link is a call for action. Google&#8217;s message is clear: &#8220;condomize your commercial links or suffer&#8221; (from deducted toolbar PageRank, links without the ability to pass real PageRank and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://sebastians-pamphlets.com/google-pagerank-deductions-october-2007/">Google&#8217;s recent shot across the bows</a> of a gazillion sites handling <a href="http://sebastians-pamphlets.com/links/categories/?cat=paid-links">paid links</a>, <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">advertising</a>, or <a href="http://sebastians-pamphlets.com/links/categories/?cat=risky-linkage">internal cross links</a> not compliant to <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66736">Google&#8217;s imagination of a natural link</a> is a call for action. Google&#8217;s message is clear: &#8220;<a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">condomize</a> your commercial links or suffer&#8221; (from deducted toolbar PageRank, links without the ability to pass real PageRank and relevancy signals, or perhaps even penalties).</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/paid-links-evil-versus-good.png" width="250" height="116" align="right" style="margin-left:4px;" alt="Paid links: good versus evil" title="Paid links: Google versus Web" />Of course that&#8217;s somewhat evil, because applying <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">nofollow values</a> to all sorts of links is not exactly a natural thing to do; visitors don&#8217;t care about invisible link attributes and sometimes they&#8217;re even pissed when they get redirected to an URL not displayed in their status bar. Also, this requirement forces Webmasters to invest enormous efforts in code maintenance for the sole purpose of satisfying search engines. The argument &#8220;if Google doesn&#8217;t like these links, then they can discount them in their system, without bothering us&#8221; has its merits, but unfortunately that&#8217;s not the way Google&#8217;s cookie crumbles for various reasons. Hence lets develop a pragmatic procedure to handle those links.</p>
<h3>The problem</h3>
<p>Google thinks that uncondomized paid links as well as commercial links to sponsors or affiliated entities aren&#8217;t natural, because the terms &#8220;sponsor|pay for review|advertising|my other site|sign-up|&#8230;&#8221; and &#8220;editorial vote&#8221; are not compatible in the sense of <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35769">Google&#8217;s guidelines</a>. This view at the Web&#8217;s linkage is pretty black vs. white.</p>
<p>Either you link out because a sponsor bought ads, or you don&#8217;t sell ads and link out for free because you honestly think your visitors will like a page. Links to sponsors without condom are black, links to sites you like and which you don&#8217;t label &#8220;sponsor&#8221; are white. </p>
<p>There&#8217;s nothing in between, respectively gray areas like links to hand picked sponsors on a page with a gazillion of links count as black. Google doesn&#8217;t care whether or not your clean links actually pass a reasonable amount of PageRank to link destinations which buy ad space too, the sole possibility that those links <em>could</em>&nbsp; influence search results is enough to qualify you as sort of a link seller. </p>
<p>The same goes for paid reviews on blogs and whatnot, see for example <a href="http://andybeard.eu/2007/10/penalty-confirmed-but-i-dont-sell-pagerank.html">Andy&#8217;s problem</a> with his honest reviews which Google classifies as paid links, and of course all sorts of traffic deals, <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">affiliate links</a>, banner ads and stuff like that. </p>
<p>You don&#8217;t even need to label a clean link as advert or sponsored. If the link destination matches a domain in Google&#8217;s database of on-line advertisers, link buyers, e-commerce sites / merchants etcetera, or Google figures out that you link too much to affiliated sites or other sites you own or control, then your toolbar PageRank is toast and most probably your outgoing links will be penalized. Possibly these penalties have impact on your internal links too, what results in less PageRank landing on subsidiary pages. Less PageRank gathered by your landing pages means less crawling, less ranking, less SERP referrers, less revenue.</p>
<h3>The solution</h3>
<p>You&#8217;re absolutely right when you say that such search engine nitpicking should not force you to throw nofollow crap on your links like confetti. From your and my point of view condomizing links is wrong, but sometimes it&#8217;s better to pragmatically comply to such policies in order to stay in the game.  </p>
<p>Although uncrawlable redirect scripts have advantages in some cases, the simplest procedure to condomize a link is the <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">rel-nofollow</a> <a href="http://sebastians-pamphlets.com/links/categories/?cat=microformats">microformat</a>. Here is an example of a googlified affiliate link:<code><br />
&lt;a href="http://sponsor.com/?affID=1" rel="nofollow"&gt;Sponsor&lt;/a&gt;</code></p>
<h3>Why serve your visitors search engine crawler directives?</h3>
<p>Complying to Google&#8217;s laws does not mean that you must deliver <a href="http://sebastians-pamphlets.com/links/categories/?cat=crawler-directives">crawler directives</a> like rel=&#8221;nofollow&#8221; to your visitors. Since Google is concerned about search engine rankings influenced by uncondomized links with commercial intent, serving crawler directives to crawlers and clean links to users is perfectly in line with Google&#8217;s goals. Actually, initiatives like the <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">X-Robots-Tag</a> make clear that hiding crawler directives from users is fine with Google. To underline that, here is a quote from <a href="http://www.mattcutts.com/blog/hidden-links/">Matt Cutts</a>:<br />
<blockquote>[&#8230;] If you want to sell a link, <b>you should at least provide machine-readable disclosure</b> for paid links by making your link in a way that doesn’t affect search engines. [&#8230;]</p>
<p>The other best practice I’d advise is to provide human readable disclosure that a link/review/article is paid. You could put a badge on your site to disclose that some links, posts, or reviews are paid, but including the disclosure on a per-post level would better. Even something as simple as &#8220;This is a paid review&#8221; fulfills the human-readable aspect of disclosing a paid article. [&#8230;]</p>
<p><b>Google’s quality guidelines are more concerned with the machine-readable aspect of disclosing paid links/posts</b> [&#8230;]</p>
<p>To make sure that you’re in good shape, go with both human-readable disclosure and machine-readable disclosure, using any of the methods [uncrawlable redirects, rel-nofollow] I mentioned above.<br />
[emphasis mine]</p></blockquote>
<p>Since Google devalues paid links anyway, search engine friendly cloaking of rel-nofollow for Googlebot is a non-issue with advertisers, as long as this fact is disclosed. I bet most link buyers look at the magic green pixels anyway, but that&#8217;s their problem.</p>
<h3>How to cloak rel-nofollow for search engine crawlers</h3>
<p>I&#8217;ll discuss a PHP/Apache example, but this method is adaptable to other server sided scripting languages like ASP or so with ease. If you&#8217;ve a static site and PHP is available on your (*ix) host, you need to tell Apache that you&#8217;re using PHP in .html (.htm) files. Put this statement in your root&#8217;s .htaccess file: <code><br />
AddType application/x-httpd-php .html .htm</code></p>
<p>Next create a plain text file, insert the code below, and upload it as &#8220;funct_nofollow.php&#8221; or so to your server&#8217;s root directory (or a subdirectory, but then you need to change some code below). <code><br />
&lt;?php<br />
function makeRelAttribute ($linkClass) {<br />
    $numargs = func_num_args();<br />
    // optional 2nd input parameter: $relValue<br />
    if ($numargs >= 2) {<br />
        $relValue = func_get_arg(1) .&quot; &quot;;<br />
    }<br />
    $referrer                   = $_SERVER[&quot;HTTP_REFERER&quot;];<br />
    $refUrl                     = parse_url($referrer);<br />
    $isSerpReferrer             = FALSE;<br />
    if (stristr($refUrl[host], &quot;google.&quot;) ||<br />
        stristr($refUrl[host], &quot;yahoo.&quot;))<br />
        $isSerpReferrer         = TRUE;<br />
    $userAgent                  = $_SERVER[&quot;HTTP_USER_AGENT&quot;];<br />
    $isCrawler                  = FALSE;<br />
    if (stristr($userAgent, &quot;Googlebot&quot;) ||<br />
        stristr($userAgent, &quot;Slurp&quot;))<br />
        $isCrawler              = TRUE;<br />
    if ($isCrawler  <b>/*</b>|| $isSerpReferrer<b>*/</b> ) {<br />
        if (&quot;$linkClass&quot; == &quot;ad&quot;)   $relValue .= &quot;advertising nofollow&quot;;<br />
        if (&quot;$linkClass&quot; == &quot;paid&quot;) $relValue .= &quot;sponsored nofollow&quot;;<br />
        if (&quot;$linkClass&quot; == &quot;own&quot;)  $relValue .= &quot;affiliated nofollow&quot;;<br />
        if (&quot;$linkClass&quot; == &quot;vote&quot;) $relValue .= &quot;editorial dofollow&quot;;<br />
    }<br />
    if (empty($relValue))<br />
        return &quot;&quot;;<br />
    return &quot; rel=\&quot;&quot; .trim($relValue) .&quot;\&quot; &quot;;<br />
} // end function makeRelValue<br />
?&gt;   </code></p>
<p>Next put the code below in a PHP file you&#8217;ve included in all scripts, for example header.php. If you&#8217;ve static pages, then insert the code at the very top. <code><br />
&lt;?php<br />
@include($_SERVER[&quot;DOCUMENT_ROOT&quot;] .&quot;/funct_nofollow.php&quot;);<br />
?&gt;   </code><br />
Do not paste the function <code>makeRelValue</code> itself! If you spread code this way you&#8217;ve to edit tons of files when you need to change the functionality later on.</p>
<p>Now you can use the function <code>makeRelValue($linkClass,$relValue)</code> within the scripts or HTML pages. The function has an input parameter $linkClass and knows the (self-explanatory) values &#8220;ad&#8221;, &#8220;paid&#8221;, &#8220;own&#8221; and &#8220;vote&#8221;. The second (optional) input parameter is a value for the <a href="http://www.smart-it-consulting.com/article.htm?node=155&#038;page=90#a-rel">A element&#8217;s REL attribute</a> itself. If you provide it, it gets appended, or, if <code>makeRelValue</code> doesn&#8217;t detect a spider, it creates a REL attribute with this value. Examples below. You can add more user agents, or serve rel-nofollow to visitors coming from SERPs by enabling the <code>|| $isSerpReferrer</code> condition (remove the bold <code>/*</code>&amp;<code>*/</code>).</p>
<p>When you code a hyperlink, just add the function to the A tag. Here is a PHP example: <code><br />
print &quot;&lt;a href=\&quot;http://google.com/\&quot;&quot; .makeRelAttribute(&quot;ad&quot;) .&quot;&gt;Google&lt;/a&gt;&quot;; </code><br />
will output<br />
<code>&lt;a href=&quot;http://google.com/&quot; rel=&quot;advertising nofollow&quot; &gt;Google&lt;/a&gt;</code><br />
when the user agent is Googlebot, and<br />
<code>&lt;a href=&quot;http://google.com/&quot;&gt;Google&lt;/a&gt;</code><br />
to a browser.</p>
<p>If you can&#8217;t write nice PHP code, for example because you&#8217;ve to follow crappy guidelines and worst practices with a WordPress blog, then you can mix <span style="color:blue;">HTML</span> and <span style="color:green;">PHP</span> tags: <code><br />
<span style="color:blue;">&lt;a href=&quot;http://search.yahoo.com/&quot;</span><span style="color:green;">&lt;?php print makeRelAttribute(&quot;paid&quot;); ?&gt;</span><span style="color:blue;">&gt;Yahoo&lt;/a&gt;</span>   </code></p>
<p>Please note that this method is not safe with search engines or unfriendly competitors when you want to cloak for other purposes. Also, the link condoms are served to crawlers only, that means search engine staff reviewing your site with a non-crawler user agent name won&#8217;t spot the nofollow&#8217;ed links unless they check the engine&#8217;s cached page copy. An HTML comment in HEAD like &#8220;This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.&#8221; as well as a similar comment line in robots.txt would certainly help to pass reviews by humans.</p>
<h3>A Google-friendly way to handle paid links, affiliate links, and cross linking</h3>
<p>Load this page with different user agents and referrers. You can do this for example with a FireFox extension like <a href="http://sebastians-pamphlets.com/referrer-spoofing-with-prefbar-341/">PrefBar</a>. For testing purposes you can use these user agent names: <code><br />
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)<br />
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) </code><br />
and these SERP referrer URLs: <code><br />
http://google.com/search?q=viagra<br />
http://search.yahoo.com/search?p=viagra&#038;ei=utf-8&#038;iscqry=&#038;fr=sfp </code><br />
Just enter these values in PrefBar&#8217;s user agent respectively referrer spoofing options (click &#8220;Customize&#8221; on the toolbar, select &#8220;User Agent&#8221; / &#8220;Referrerspoof&#8221;, click &#8220;Edit&#8221;, add a new item, label it, then insert the strings above). Here is the code above in action:</p>
<table style="margin-bottom:30px;">
<tr>
<td valign="top"><b>Referrer URL:</b></td>
<td valign="top"></td>
</tr>
<tr>
<td valign="top"><b>User Agent Name:</b></td>
<td valign="top">CCBot/1.0 (+http://www.commoncrawl.org/bot.html)</td>
</tr>
<tr>
<td valign="top"><b>Ad</b> makeRelAttribute(&#8221;ad&#8221;): </td>
<td valign="top"><a href="http://google.com/">Google</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>Paid</b> makeRelAttribute(&#8221;paid&#8221;): </td>
<td valign="top"><a href="http://search.yahoo.com/"  >Yahoo</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>Own</b> makeRelAttribute(&#8221;own&#8221;): </td>
<td valign="top"><a href="http://sebastians-pamphlets.com/"  >Sebastian&#8217;s Pamphlets</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>Vote</b> makeRelAttribute(&#8221;vote&#8221;): </td>
<td valign="top"><a href="http://link-condom.com/"  >The Link Condom</a> <code></code></td>
</tr>
<tr>
<td valign="top"><b>External</b> makeRelAttribute(&#8221;", &#8220;external&#8221;): </td>
<td valign="top"><a href="http://w3.org/"  rel="external"  >W3C</a> <code> rel="external" </code></td>
</tr>
<tr>
<td valign="top"><b>Without parameters</b> makeRelAttribute(&#8221;"): </td>
<td valign="top"><a href="http://sphinn.com/"  >Sphinn</a> <code></code></td>
</tr>
</table>
<p>When you change your browser&#8217;s user agent to a crawler name, or fake a SERP referrer, the REL value will appear in the right column.</p>
<p>When you&#8217;ve developed a better solution, or when you&#8217;ve a nofollow-cloaking tutorial for other programming languages or platforms, please let me know in the comments. Thanks in advance!</p>
<p><!-- Processed by EzStatic --></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/a-pragmatic-defense-against-googles-anti-paid-links-campaign/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
