<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; Google</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Mon, 30 Jun 2008 20:12:40 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Still not yet speechless, just swamped</title>
		<link>http://sebastians-pamphlets.com/still-not-yet-speechless-just-swamped/</link>
		<comments>http://sebastians-pamphlets.com/still-not-yet-speechless-just-swamped/#comments</comments>
		<pubDate>Mon, 30 Jun 2008 20:12:40 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Technorati]]></category>

		<category><![CDATA[Blogging]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/still-not-yet-speechless-just-swamped/</guid>
		<description><![CDATA[Long time no blogging &#8230; sorry folks. I&#8217;m swamped in a huge project that has nothing to do with SEO, and not much with webmastering at all. I&#8217;m dealing with complex backend systems and all my script outputs go to a closed user group, so I can&#8217;t even blog a new SEO finding or insight [...]]]></description>
			<content:encoded><![CDATA[<p><img  src="http://sebastians-pamphlets.com/img/blog/swamped.png" width="250" height="165" border="0" align="right" alt="Sebastian swamped" />Long time no blogging &#8230; sorry folks. I&#8217;m swamped in a huge project that has nothing to do with SEO, and not much with webmastering at all. I&#8217;m dealing with complex backend systems and all my script outputs go to a closed user group, so I can&#8217;t even blog a new SEO finding or insight every now and then. Ok, except experiences like &#8220;Google Maps Premier: &#8216;<a href="http://www.google.com/enterprise/maps/">organizations need more</a>&#8216; &#8230; well &#8230; contact to a salesman within days, not months or years &#8230; and of course prices on the Web site&#8221;. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>However, it&#8217;s an awesome experience to optimize business processes that are considered extremely painful in most companies out there. Time recording, payroll accounting, reimbursing of traveling expenses, project controlling and invoicing of time and material in complex service projects is a nightmare that requires handling of shitloads of paper, importing timesheets from spreadsheets, emails and whatnot, &#8230; usually. No longer. Compiling data from cellphones, PDAs, blackberries, iPhones, HTML forms, somewhat intelligent time clocks and so on in near real time is a smarter way to build the data pool necessary for accounting and invoicing, and allows fully automated creation of travel expense reports, payslips, project reports and invoices with a few mouse clicks in your browser. If you&#8217;re interested, drop me a line and I&#8217;ll link you to the startup company I&#8217;m working for.</p>
<p>Oh well, I&#8217;ve got a long list of topics I wanted to blog, but there&#8217;s no time left because I consider my <a href="http://sebastians-pamphlets.com/seos-home-alone-googles-nightmare/">cute monsters</a> more important than blogging and such stuff. For example, I was going to write a pamphlet about Technorati&#8217;s spam algos (do not ping too many of your worst enemy&#8217;s URLs too often because that&#8217;ll ban her/his blog), <a href="http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html">Google&#8217;s misunderstanding of the Robots Exclusion Protocol (REP)</a> (crawler directives like &#8220;disallow&#8221; in robots.txt do <b>not</b> forbid search engine indexing - the opposite is true), or smart ways to deal with <a href="http://www.mattcutts.com/blog/dont-end-your-urls-with-exe/">unindexable URIs that contain .exe files</a> when you&#8217;re using tools like <a href="http://www.progress.com/openedge/products/openedge_products/version9_datasheets/webspeed_workshop/index.ssp">Progress WebSpeed</a> on Windows boxes with their default settings (hint: <a href="http://httpd.apache.org/docs/1.3/mod/mod_alias.html">Apache&#8217;s script alias</a> ends your pain). Unfortunately, none of these posts will be written (soon). Anywayz, I&#8217;ll try to update you more often, but I can&#8217;t promise anything like that in the near future. Please don&#8217;t unsubscribe, I&#8217;ll come back to SEO topics. As for the comments, I&#8217;m still deleting all &#8220;thanks&#8221; and &#8220;great post&#8221; stuff linked to unusual URIs I&#8217;m not familiar with. <a href="http://sebastians-pamphlets.com/about/policies/">As usual</a>.</p>
<p><b>All the best!<br />
Sebastian</b></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/still-not-yet-speechless-just-swamped/feed/</wfw:commentRss>
		</item>
		<item>
		<title>You can&#8217;t escape from Google-Jail when &#8230;</title>
		<link>http://sebastians-pamphlets.com/stuck-in-google-jail/</link>
		<comments>http://sebastians-pamphlets.com/stuck-in-google-jail/#comments</comments>
		<pubDate>Wed, 27 Feb 2008 11:50:18 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Reciprocal Links]]></category>

		<category><![CDATA[Webspam]]></category>

		<category><![CDATA[Risky Linkage]]></category>

		<category><![CDATA[Spam Report]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/stuck-in-google-jail/</guid>
		<description><![CDATA[&#8230; you&#8217;ve boosted your business Web site&#8217;s rankings with shitloads of crappy links. The 11th SEO commandment: Don&#8217;t promote your white hat sites with black hat link building methods! It may work for a while, but once you find your butt in Google-jail, there&#8217;s no way out. Not even a reconsideration request can help because [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/spammers-stuck-in-google-jail.png" width="200" height="301" align="right" style="margin-left:4px;" alt="spammers stuck in google jail" title="Stuck in Google-Jail" />&#8230; you&#8217;ve boosted your business Web site&#8217;s rankings with shitloads of crappy links. <abbr title="Don't halloo till you're out of the wood!">The 11th SEO commandment</abbr>: <b>Don&#8217;t promote your white hat sites with black hat link building methods!</b> It may work for a while, but once you find your butt in Google-jail, there&#8217;s no way out. Not even a reconsideration request can help because you can&#8217;t provide its prerequisites.</p>
<p>When you&#8217;re caught eventually &#8211;penalized for tons of stinky links&#8211; and have to file a reinclusion request, Google wants you to remove all the shady links you&#8217;ve spread on the Web before they lift your penalty. Here is an <a href="http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/59e796b85327a4b2/">example</a>, well documented in a Google Groups thread started by a penalized site owner with official statements from <a href="http://mattcutts.com/blog/">Matt Cutts</a> and <a href="http://johnmu.com/">John Müller</a> from Google.</p>
<p>The <a id="a-farawayfurniture" href="http://www.farawayfurniture.co.uk/" rel="nofollow until unpenalized">site in question</a>, a small family business from the UK, has used more or less every tactic from a lazy link builder&#8217;s textbook to create 40,000+ inbound links. Sponsored WordPress themes, paid links, comment spam, artificial link exchanges and whatnot. <script type="text/javascript">handle=document.getElementById("a-farawayfurniture"); handle.rel="client-sided dofollow"; </script></p>
<p>Most sites that carry these links are in no way related to the penalized site, which deals with <em>modern teak garden furniture and home furniture sets</em>, for example porn galleries, Web designers, US city guides, obscure oriental blogs, job boards, or cat masturbation guides. (Don&#8217;t get me wrong. Of course not every link has to be topically related. Every link from a trusted page can pass PageRank, and can improve crawling, indexing, and so on.) </p>
<p>Google has absolutely no problem with unrelated links, unless a site&#8217;s link profile consists of way too many spammy and/or unrelated links. That does not mean that spreading a gazillion low-life links pointing to a competitor will get this site penalized or even banned. Negative SEO is not that simple. For an innocent site Google just ignores spammy inbound links, but most probably flags it for further investigations, both manually as well as algorithmically.</p>
<p>If on the other hand Google finds evidence that a site is actively involved in link monkey business of any kind, that&#8217;s a completely different story. Such evidence could be massively linking out to spammy places, hosting reciprocal links pages or FFA directories, unskillful (manual|automated) comment spam, signature links and mentions at places that trade links, textual contents made for (paid) link campaigns when reused too often, buying links from trackable services, (link request emails forwarded via) paid-link/spam reports, and so on. </p>
<p>Below is the &#8220;how to file a successful reconsideration request when your sins include link spam&#8221; from Googlers.</p>
<p><a href="http://groups.google.com/group/Google_Webmaster_Help-Indexing/msg/32db3e8e1fbf54e8">Matt Cutts</a>:</p>
<blockquote><p>The recommendation from your SEO guy led you directly into a pretty high-risk area; I doubt you really want pages like <a rel="nofollow" onclick="return false;" href="http://www.fuckingfilthy.com/filthy-hardcore/amateur-black-teen-sex-freak-is-a-closet-lesbian/"><script type="text/javascript">document.write("http:// www.fuckingfilthy.com/ filthy-hardcore/ amateur-black-teen-sex-freak-is-a-closet-lesbian/");</script> (<b>NSAW</b>)</a> having sponsored links to your furniture site anyway. It&#8217;s definitely possible to extricate your site, but I would make an effort to contact the sites with your sponsored links and request that they remove the links, and then do a reconsideration request. Maybe in the text of your reconsideration request, I&#8217;d include a pointer to this thread as well.</p>
</blockquote>
<p><a href="http://groups.google.com/group/Google_Webmaster_Help-Indexing/msg/6ac5fb93035e9735">John Müller</a>:</p>
<blockquote><p>You may want to consider what you can do to help clean up similar [=spammy] links on other people&#8217;s sites. Blogs and newspaper sites such as <a href="http://media.www.dailypennsylvanian.com/media/storage/paper882/news/2002/09/30/Opinion/Kelly.Lynch.And.Dennis.Tupper.A.Focus.On.Integrity-2156926.shtml">http://media.www.dailypennsylvanian.com</a> sometimes receive short comments such as &#8220;dont agree&#8221;, apparently only for a link back to a site. These comments often use keywords from that site instead of a user name, perhaps &#8220;tree bench&#8221; for a furniture site or &#8220;sexy shoes&#8221; for a footwear site. If this kind of behavior might have taken place for your site, you may want to work on rectifying it and include some information on it in your reconsideration request. Given your situation, the person considering your reconsideration request might be curious about links like that.</p>
</blockquote>
<p>Translation: <b>We&#8217;ll ignore your weekly reconsideration requests unless you&#8217;ve removed all artificial links pointing to your site</b>. You&#8217;re stuck in Google&#8217;s dungeon because they&#8217;ve thrown away the keys.</p>
<p>I&#8217;d guess that for a site that has filed a reinclusion request stating the site was involved in some sort of link monkey business, Google applies a more strict policy than with a site that was attacked by negative SEO methods. I highly doubt that when caught red-handed a lame excuse like &#8220;I didn&#8217;t create those links&#8221; is a tactic I could recommend, because Googlers hate it when an applicant lies in a reinclusion request.</p>
<p>Once caught and penalized, the &#8220;since when do inbound links count as negative votes&#8221; argument doesn&#8217;t apply. It&#8217;s quite clear that removing the traces (admitted as well as not admitted shady links) is a prerequisite for a penalty lift. And that even though Google has already discounted these links. That&#8217;s the same as with penalized doorway pages. Redirecting doorways to legit landing pages doesn&#8217;t count, Google wants to see a 410-Gone HTTP response code (or at least a 404) before they un-penalize a site.</p>
<p>I doubt that&#8217;s common knowledge to folks who promote their white hat sites with black hat methods. Getting links wiped out at places that didn&#8217;t check the intention of inserted links in the first place is a royal PITA, in other words, it&#8217;s impossible to get all shady links removed once you find your butt in Google-jail. That&#8217;s extremely uncomfortable for site owners who fell for questionable forum advice or hired a promotional service (no, I don&#8217;t call such assclowns SEOs) applying shady marketing methods without a clear and written warning that those are extremely risky, fully explained and signed by the client.</p>
<p>Maybe in some cases Google will un-penalize a great site although not all link spam was wiped out. However, the costs and efforts of preparing a successful resonsideration request are immense, not to speak of the massive loss of traffic and income.</p>
<p>As <a href="http://cartoonbarry.com/">Barry</a> mentioned, the thread linked above might be interesting for folks keen on an official confirmation that <a href="http://www.seroundtable.com/archives/016342.html">Google -60 penalties</a> exist. I&#8217;d say such SERP penalties (aka <a href="http://googlewebmastercentral.blogspot.com/2007/03/update-on-spam-reporting.html">red &amp; yellow cards</a>) aren&#8217;t exactly new, and it plays no role to which position a site penalized for guideline violations gets downranked. When I&#8217;ve lost a top spot for gaming Google, that&#8217;s kismet. I&#8217;m not interested in figuring out that 20k spammy links get me a -30 penalty, 40k shady links result in a -60 penalty, and 100k unnatural links qualify me for the famous -950 bashing (the numbers are made up of course). If I&#8217;d spam, then I&#8217;d just move on because I&#8217;d have already launched enough other projects to compensate the losses.</p>
<p>PS: While I was typing, Barry Schwartz posted his <a href="http://www.seroundtable.com/archives/016380.html">Google-Jail story at SE Roundtable</a>.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/stuck-in-google-jail/feed/</wfw:commentRss>
		</item>
		<item>
		<title>@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!</title>
		<link>http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/</link>
		<comments>http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/#comments</comments>
		<pubDate>Mon, 25 Feb 2008 11:08:34 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[URL removal]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/</guid>
		<description><![CDATA[Matt Cutts asks us How should Google handle NOINDEX? That&#8217;s a tough question worth thinking twice before you submit a comment to Matt&#8217;s post. Here is Matt&#8217;s question, all the background information you need, and my opinion.
What is NOINDEX?
Noindex is an indexer directive defined in the Robots Exclusion Protocol (REP) from 1996 for use in [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/google/dear-google-please-respect-noindex.png" width="250" height="230" align="right" style="margin-left:4px;" alt="Dear Google, please respect NOINDEX" title="Dear Google, please respect NOINDEX, it means don't mention on SERPs!" />Matt Cutts asks us <a href="http://www.mattcutts.com/blog/google-noindex-behavior/">How should Google handle NOINDEX?</a> That&#8217;s a tough question worth thinking twice before you <a href="http://www.mattcutts.com/blog/google-noindex-behavior/#postcomment">submit a comment to Matt&#8217;s post</a>. Here is Matt&#8217;s question, all the background information you need, and my opinion.</p>
<h3>What is NOINDEX?</h3>
<p><a href="http://www.robotstxt.org/meta.html">Noindex</a> is an indexer directive defined in the <a href="http://sebastians-pamphlets.com/links/categories/?cat=crawler-directives">Robots Exclusion Protocol</a> (REP) from 1996 for use in <a href="http://sebastians-pamphlets.com/links/categories/?cat=robots-meta-tags">robots meta tags</a>. Putting a <b>NOINDEX</b> value in a page&#8217;s robots meta tag or <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">X-Robots-Tag</a> <b>tells search engines that they shall not index the page content</b>, but may follow links provided on the page.</p>
<p>To <a href="http://sebastians-pamphlets.com/robots-exclusion-protocol-round-up-2008-01/">get a grip on NOINDEX&#8217;s role in the REP</a> please read my <a href="http://www.seomoz.org/blog/robots-exclusion-protocol-101">Robots Exclusion Protocol summary at SEOmoz</a>. Also, <a href="http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/">Google experiments with NOINDEX as crawler directive</a> in <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a>, more on that later.</p>
<h3>How major search engines treat NOINDEX</h3>
<p>Of course you could <a href="http://sebastians-pamphlets.com/links/categories/?definitions=TRUE">read a ton of my pamphlets</a> to extract this information, but <a href="http://www.mattcutts.com/blog/handling-noindex-meta-tags/">Matt&#8217;s summary</a> is still accurate and easier to digest:</p>
<blockquote><ul>[Matt Cutts on August 30, 2006]
<li>Google doesn’t show the page in any way.</li>
<li>Ask doesn’t show the page in any way.</li>
<li>MSN shows a URL reference and cached link, but no snippet. Clicking the cached link doesn’t return anything.</li>
<li>Yahoo! shows a URL reference and cached link, but no snippet. Clicking on the cached link returns the cached page.</li>
</ul>
<p>Personally, I’d prefer it if every search engine treated the noindex meta tag by not showing a page in the search results at all. [Meanwhile Matt might have a slightly different opinion.]</p>
</blockquote>
<p>Google&#8217;s experimental support of NOINDEX as crawler directive in robots.txt also includes the DISALLOW functionality (an instruction that forbids crawling), and most probably URIs tagged with NOINDEX in robots.txt cannot accumulate PageRank. In my humble opinion <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#existing-rep-tags">the DISALLOW behavior of NOINDEX in robots.txt is completely wrong</a>, and without any doubt in no way compliant to the Robots Exclusion Protocol.</p>
<h3>Matt&#8217;s question: How should Google handle NOINDEX in the future?</h3>
<p>To simplify <a href="http://www.mattcutts.com/blog/wp-content/plugins/democracy/democracy.php?dem_action=show_vote_screen&#038;dem_poll_id=6">Matt&#8217;s poll</a>, lets assume he&#8217;s talking about NOINDEX as <b>indexer directive</b>, regardless where a Webmaster has put it (robots meta tag, X-Robots-Tag, or robots.txt).</p>
<blockquote><p>The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?</p>
</blockquote>
<p>Here are the arguments, or pros and cons, for each variant:</p>
<dl>
<dt>Google should completely drop a NOINDEX’ed page from their search results</dt>
<dd>
<p>Obviously that&#8217;s what most Webmasters would prefer:</p>
<blockquote><p>This is the behavior that we&#8217;ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way &#8212; in fact, one of the only ways &#8212; to completely remove all traces of a site from Google (another way is our <a href="http://www.google.com/webmasters/tools/removals">url removal tool</a>). That&#8217;s incredibly useful for webmasters.</p>
</blockquote>
<p><b>NOINDEX means don&#8217;t index</b>, search engines must respect such directives, even when the content isn&#8217;t <a href="http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/">password protected</a> or <a href="http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/">cloaked away</a> (redirected or hidden for crawlers but not for visitors). </p>
<p>The corner case that Google discovers a link and lists it on their SERPs before the page that carries a NOINDEX directive is crawled and deindexed isn&#8217;t crucial, and could be avoided by a (new) NOINDEX indexer directive in robots.txt, which is requested by search engines quite frequently. Ok, maybe Google&#8217;s <abbr title="Ms. Googlebot">BlitzCrawler&trade;</abbr> has to request robots.txt more often then.</p>
</dd>
<dt>Google should show a reference to NOINDEX&#8217;ed pages on their SERPs</dt>
<dd>
<p>Search quality and user experience are strong arguments:</p>
<blockquote><p>Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don&#8217;t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google&#8217;s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like [3 linked examples] aren&#8217;t showing up in Google because of the NOINDEX meta tag, that&#8217;s bad for users (and thus for Google).</p>
</blockquote>
<p>Search quality and searchers&#8217; user experience is also a strong argument for totally delisting NOINDEX&#8217;ed pages, because most Webmasters use this indexer directive to keep stuff that doesn&#8217;t provide value for searchers out of the search indexes. &lt;polemic&gt;I mean, how much weight have a few Korean sites when it comes to decisions that affect the whole Web?&lt;/polemic&gt;</p>
<p>If a Webmaster puts a NOINDEX directive by accident, that&#8217;s easy to spot in the site&#8217;s stats, considering the volume of traffic that Google controls. I highly doubt that a simple URI reference with an anchor text scrubbed from external links on Google SERPs would heal such a mistake. Also, Matt said that Google could add a NOINDEX check to the Webmaster Console.</p>
<p>The reference to the URI removal tools is out of context, because these tools remove an URI only for a short period of time and all removal requests have to be resubmitted repeatedly every few weeks. NOINDEX on the other hand is a way to keep an URI out of the index as long as this crawler directive is provided. </p>
<p>I&#8217;d say the sole argument for listing references to NOINDEX&#8217;ed pages that counts is misleading navigational searches. Of course that does not mean that Google may ignore the NOINDEX directive to show &#8211;with a linked reference&#8211; that they know a resource, despite the fact that the site owner has strictly forbidden such references on SERPs.</p>
</dd>
<dt>Something in between, Google should find a reasonable way to please both Webmasters and searchers</dt>
<dd>
<p>Quoting Matt again:</p>
<blockquote><p>The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don&#8217;t want to show up in Google). Users are most discouraged when they search for a well-known site and can&#8217;t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don&#8217;t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.</p>
</blockquote>
<p>Whether or not a site is popular must not impact a search engine&#8217;s respect for a Webmaster&#8217;s decision to keep search engines, and their users, out of her realm. That reads like &#8220;Hey, Google is popular, so we&#8217;ve the right to go to Mountain View to pillage the Googleplex, acquiring everything we can steal for the public domain&#8221;. Neither Webmasters nor search engines should mimic Robin Hood. Also, lots of Webmasters highly doubt that Google&#8217;s idea of (link) popularity should rule the Web. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Whether or not a site is listed in the ODP directory is definitely not an indicator that can be applied here. Last time I looked the majority of the Web&#8217;s content wasn&#8217;t listed at DMOZ due to the lack of editors and various other reasons, and that includes gazillions of great and useful resources. I&#8217;m not bashing DMOZ here, but as a matter of fact it&#8217;s not comprehensive enough to serve as indicator for anything, especially not importance and popularity.</p>
<p>I strongly believe that there&#8217;s no such thing as a criterion suitable to mark out a two class Web.</p>
</dd>
</dl>
<h3>My take: Yes, No, Depends</h3>
<p>Google could enhance navigational queries &#8211;and even &#8220;I feel lucky&#8221; queries&#8211; that lead to a NOINDEX&#8217;ed page with a message like &#8220;The best matching result for this query was blocked by the site&#8221;. I wouldn&#8217;t mind if they mention the URI as long as it&#8217;s not linked.</p>
<p>In fact, the problem is the granularity of the existing indexer directives. NOINDEX is neither meant for nor capable of serving that many purposes. It is wrong to assign DISALLOW semantics to NOINDEX, and it is wrong to create two classes of NOINDEX support. Fortunately, we&#8217;ve more REP indexer directives that could play a role in this discussion.</p>
<p>NOODP, NOYDIR, NOARCHIVE and/or NOSNIPPET in combination with NOINDEX on a site&#8217;s home page, that is either a domain or subdomain, could indicate that search engines must not show references to the URI in question. Otherwise, if no other indexer directives elaborate NOINDEX, search engines could show references to NOINDEX&#8217;ed main pages. The majority of navigational search queries should lead to main pages, so that would solve the search quality issues.</p>
<p>Of course that&#8217;s not precise enough due to the lack of a specific directive that deals with references to forbidden URIs, but it&#8217;s way better than ignoring NOINDEX in its current meaning. </p>
<h3>A fair solution: NOREFERENCE</h3>
<p>If I&#8217;d make the decision at Google and couldn&#8217;t live with a <em>best matching search result blocked</em>&nbsp; message, I&#8217;d go for a new REP tag:</p>
<p>&#8220;NOINDEX, NOREFERENCE&#8221; in a robots meta tag &#8211;respectively Googlebot meta tag&#8211; or X-Robots-Tag forbids search engines to show a reference on their SERPs. In robots.txt this would look like <code><br />
<b>NOINDEX: /<br />
NOINDEX: /blog/<br />
NOINDEX: /members/<br />
&#8230;<br />
NOREFERENCE: /<br />
NOREFERENCE: /blog/<br />
NOREFERENCE: /members/<br />
&#8230;</b></code><br />
Search engines would crawl these URIs, and follow their links as long as there&#8217;s no NOFOLLOW directive either in robots.txt or a page specific instruction.</p>
<p>NOINDEX without a NOREFERENCE directive would instruct search engines not to index a page, but allows references on SERPs. Supporting this indexer directive both in robots.txt as well as on-the-page (respectively in the HTTP header for X-Robots-Tags) makes it easy to add NOREFERENCE on sites that hate search engine traffic. Also, a syntax variant like <code><b>NOINDEX=NOREFERENCE</b></code> for robots.txt could tell search eniges how they have to treat NOINDEX statements on site level, or even on site area level.</p>
<p>Even more appealing would be <code><b>NOINDEX=REFERENCE</b></code>, because only the very few Webmasters that would like to see their NOINDEX&#8217;ed URIs on Google&#8217;s SERPs would have to add a directive to their robots.txt at all. Unfortunately, that&#8217;s not doable for Google unless they can convice three well known Korean sites to edit their robots.txt. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>&nbsp;</p>
<p>By the way, don&#8217;t miss out on my draft asking for <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/">REP tag support in robots.txt</a>!</p>
<p>Anyway: <b>Dear Google, please don&#8217;t touch NOINDEX!</b> <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/give-google-your-feedback-on-noindex/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Nofollow still means don&#8217;t follow, and how to instruct Google to crawl nofollow&#8217;ed links nevertheless</title>
		<link>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/</link>
		<comments>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/#comments</comments>
		<pubDate>Sat, 23 Feb 2008 14:51:14 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Paid Links]]></category>

		<category><![CDATA[Testing]]></category>

		<category><![CDATA[Anchor Text]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/</guid>
		<description><![CDATA[What was meant as a quick test of rel-nofollow once again (inspired by Michelle&#8217;s post stating that nofollow&#8217;ed comment author links result in rankings), turned out to some interesting observations:

Google uses sneaky JavaScript links (that mask nofollow&#8217;ed static links) for discovery crawling, and indexes the link destinations despite there&#8217;s no hard coded link on any [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/painting-nofollow-dofollow.png" width="250" height="220" align="right" alt="painting a nofollow'ed link dofollow" style="margin-left:4px;" title="How to paint a nofollow'ed link dofollow" />What was meant as a quick test of <a href="http://sebastians-pamphlets.com/links/categories/&amp;cat=nofollow">rel-nofollow</a> once again (inspired by <a href="http://www.michellemacphearson.com/do-nofollow-links-count-redux/">Michelle&#8217;s post</a> stating that nofollow&#8217;ed comment author links result in rankings), turned out to some interesting observations:</p>
<ul>
<li>Google uses sneaky JavaScript links (that mask nofollow&#8217;ed static links) for discovery crawling, and indexes the link destinations despite there&#8217;s no hard coded link on any page on the whole Web.</li>
<li>Google doesn&#8217;t crawl URIs found in nofollow&#8217;ed links only.</li>
<li>Google most probably doesn&#8217;t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.</li>
<li>Google most probably doesn&#8217;t pass anchor text of JavaScript links to the link destination.</li>
<li>Google doesn&#8217;t pass anchor text of (hard coded) nofollow&#8217;ed links to the link destination.</li>
</ul>
<p>As for my inspiration, I guess not all links in Michelle&#8217;s test were truly nofollow&#8217;ed. However, she&#8217;s spot on stating that condomized author links aren&#8217;t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don&#8217;t pay too much attention on REL attributes when you spread your links.</p>
<p>As for my quick test explained below, please consider it an inspiration too. It&#8217;s not a full blown SEO test, because I&#8217;ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn&#8217;t influenced by external noise, for example scraped links and such stuff.</p>
<p>On 2008-02-22 06:20:00 I&#8217;ve put a new nofollow&#8217;ed link onto my sidebar: <a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-2-a" rel="nofollow"><span id="repstuff-something-2-b">Zilchish Crap</span></a> <script type="text/javascript"> handle=document.getElementById("repstuff-something-2-b"); handle.firstChild.data="Nillified, Nil"; handle=document.getElementById("repstuff-something-2-a"); handle.href="http://sebastians-pamphlets.com/repstuff/something.php?nil=js1"; handle.rel="dofollow"; </script><code><small><br />
&lt;a href=&quot;http://sebastians-pamphlets.com/repstuff/something.php&quot; id=&quot;repstuff-something-a&quot; rel=&quot;nofollow&quot;&gt;&lt;span id=&quot;repstuff-something-b&quot;&gt;Zilchish Crap&lt;/span&gt;&lt;/a&gt;<br />
&lt;script type=&quot;text/javascript&quot;&gt;<br />
handle=document.getElementById(&lsquo;repstuff-something-b&rsquo;);<br />
handle.firstChild.data=&lsquo;Nillified, Nil&rsquo;;<br />
handle=document.getElementById(&lsquo;repstuff-something-a&rsquo;);<br />
handle.href=&lsquo;http://sebastians-pamphlets.com/repstuff/something.php?nil=js1&rsquo;;<br />
handle.rel=&lsquo;dofollow&rsquo;;<br />
&lt;/script&gt; </small></code><br />
(The JavaScript code changes the link&#8217;s HREF, REL and anchor text.)</p>
<p>The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow&#8217;ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.</p>
<h3>Google crawls URIs extracted from somewhat sneaky JavaScript code</h3>
<p>20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element&#8217;s HREF: <code><br />
66.249.72.5 	2008-02-22 06:47:07 	200-OK 	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 	/repstuff/something.php?nil=js1</code></p>
<p>Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page: <code><small><br />
handle=document.getElementById(&lsquo;a1&rsquo;);<br />
handle.href=&lsquo;http://sebastians-pamphlets.com/repstuff/something.php?nil=js2&rsquo;;<br />
handle.rel=&lsquo;dofollow&rsquo;; </small></code><br />
From the log: <code><br />
66.249.72.5 	2008-02-22 09:37:11 	200-OK 	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 	/repstuff/something.php?nil=js2</code></p>
<p>So far Google ignored the hidden JavaScript link to <code>/repstuff/something.php?nil=js3</code> on the test page. Its code doesn&#8217;t change a static link, so that makes sense in the context of repeated statements like &#8220;Google ignores JavaScript links / treats them like nofollow&#8217;ed links&#8221; by Google reps.</p>
<p class="excursus">Of course the JS code above is easy to analyze, but don&#8217;t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!</p>
<h3>Google indexes pages that have only JavaScript links pointing to them</h3>
<p>The next day I&#8217;ve checked the search index, and the <a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=zilchish%7Cnillyfiable+site%3Asebastians-pamphlets.com">results</a> are interesting:</p>
<p><img src="http://sebastians-pamphlets.com/img/google/nofollow-zilchish-nullifable-google-serp-24h.png" width="498" height="421" alt="rel-nofollow-test search results" title="Google indexes JS manipulated anchor text and content referenced only in JS links" /></p>
<p>The first search result is the content of the URI with the query string parameter <code>?nil=js1</code>, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI <code>/repstuff/something.php</code> without query string. There&#8217;s not a single real link to this URI elsewhere.</p>
<p>The second search result is a post URI where Google recognized the hard coded anchor text &#8220;zilchish crap&#8221;, but not the JS code that overwrites it with &#8220;Nillified, Nil&#8221;. With the SERP-URI parameter &#8220;&amp;filter=0&#8243; Google shows more posts that are findable with the search term [zilchish]. (Hey <a href="http://mattcutts.com/blog/">Matt</a> and <a href="http://brianwhite.org/">Brian</a>, here&#8217;s room for improvement!)</p>
<h3>Google doesn&#8217;t pass anchor text of nofollow&#8217;ed links to the link destination</h3>
<p>A search for [<a href="http://www.google.com/search?q=zilchish+site:sebastians-pamphlets.com&#038;num=100&#038;hl=en&#038;filter=0&#038;safe=off">zilchish site:sebastians-pamphlets.com</a>] doesn&#8217;t show the testpage that doesn&#8217;t carry this term. In other words, so far the anchor text &#8220;zilchish crap&#8221; of the nofollow&#8217;ed sidebar link didn&#8217;t impact the test page&#8217;s rankings yet. </p>
<h3>Google doesn&#8217;t treat anchor text of JavaScript links as textual content</h3>
<p>A search for [<a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=nillified+site%3Asebastians-pamphlets.com">nillified site:sebastians-pamphlets.com</a>] doesn&#8217;t show any URIs that have &#8220;nil, nillified&#8221; as client sided anchor text on the sidebar, just the test page:</p>
<p><img src="http://sebastians-pamphlets.com/img/google/nofollow-nillified-google-serp-24h.png" width="498" height="277" alt="rel-nofollow-test search results" title="Google indexes content from JS manipulated URIs" /></p>
<h3>Results, conclusions, speculation</h3>
<p>This test wasn&#8217;t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately &#8220;nil&#8221; and &#8220;nillified&#8221; appear both in the JS anchor text as well as on the page, so that&#8217;s for another post. However, it seems the JS anchor text isn&#8217;t indexed for the pages carrying the JS code, at least they don&#8217;t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination&#8217;s relevancy for &#8220;nil&#8221; or &#8220;nillified&#8221; as well. </p>
<p>Maybe Google&#8217;s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows. </p>
<p>However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I&#8217;m quite confident that condomized links still don&#8217;t pass reputation, nor topically relevance. My test page is unfindable for the nofollow&#8217;ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.</p>
<h3>How to safely strip a <a href="http://link-condom.com/">link condom</a></h3>
<p><b>And what&#8217;s the actual &#8220;news&#8221;?</b> Well, say you&#8217;ve links that you must condomize because they&#8217;re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow&#8217;ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don&#8217;t count, so that&#8217;s perfectly in line with Google&#8217;s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don&#8217;t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn&#8217;t that neat?</p>
<h3>Link condoms <strike>with juicy taste</strike> faking good karma</h3>
<p>Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they <i style="border: medium dotted firebrick; color:navy; background:pink;">look plain ugly</i> otherwise.</p>
<p>Here is how you can do this for a complete Web page. <a href="http://example.com/" rel="nofollow example" title="Nofollow'ed and unclickable link example, use 'view source' to check it out" onclick="return false;">This link is nofollow&#8217;ed</a>. The JavaScript code below changed its REL value to &#8220;dofollow&#8221;. When you put this code <em>at the bottom of your pages</em>, it will un-condomize all your nofollow&#8217;ed links. <code><br />
&lt;script type=&quot;text/javascript&quot;&gt;<br />
    if (document.getElementsByTagName) {<br />
        var aElements = document.getElementsByTagName(&quot;a&quot;);<br />
        for (var i=0; i&lt;aElements.length; i++) {<br />
            var relvalue = aElements[i].rel.toUpperCase();<br />
            if (relvalue.match(&quot;NOFOLLOW&quot;) != &quot;null&quot;) {<br />
                aElements[i].rel = &quot;dofollow&quot;;<br />
            }<br />
        }<br />
    }<br />
&lt;/script&gt;   </code></p>
<p><script type="text/javascript">
    if (document.getElementsByTagName) {
        var aelements = document.getElementsByTagName("a");
        for (var i=0; i<aelements.length; i++) {
            var relvalue = aelements[i].rel.toUpperCase();
            if (relvalue.match("NOFOLLOW") != "null") {
                aelements[i].rel = "dofollow";
            }
        }
    }
</script></p>
<p>(You&#8217;ll find still condomized links on this page. That&#8217;s because the JavaScript routine above changes only links placed above it.)</p>
<p>When you add JavaScript routines like that to your pages, you&#8217;ll increase their page loading time. IOW you slow them down. Also, you should add a note to your <a href="http://sebastians-pamphlets.com/links/full-disclosure/">linking policy</a> to avoid confused advertisers who chase toolbar PageRank.</p>
<p><b>Updates:</b> Obviously Google distrusts me, how come? Four days after the link discovery the <abbr title="Googlebot coming from another IP">search quality archangel</abbr> requested the nofollow&#8217;ed URI &#8211;without query string&#8211; possibly to check whether I serve different stuff to bots and people. As if I&#8217;d cloak, laughable. (Or an assclown linked the URI without condom.)<br />
Day five: Google&#8217;s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren&#8217;t interested in client-sided links at all?</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-dynamically-change-nofollow-to-dofollow/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Save bandwidth costs: Dynamic pages can support If-Modified-Since too</title>
		<link>http://sebastians-pamphlets.com/dynamic-pages-can-support-if-modified-since-too/</link>
		<comments>http://sebastians-pamphlets.com/dynamic-pages-can-support-if-modified-since-too/#comments</comments>
		<pubDate>Tue, 19 Feb 2008 19:35:30 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/dynamic-pages-can-support-if-modified-since-too/</guid>
		<description><![CDATA[When search engine crawlers burn way too much of your bandwidth, this post is for you. Crawlers sent out by major search engines (Google, Yahoo and MSN/Live Search) support conditional GETs, that means they don&#8217;t fetch your pages if those didn&#8217;t change since the last crawl.
Of course they must fetch your stuff over and over [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/conditional-http-get-request.png" width="250" height="250" align="right" style="margin-left:4px;" alt="Conditional HTTP GET requests make Webmasters and Crawlers happy" title="Webmasters and crawlers LOVE conditional GET requests!" />When search engine crawlers burn way too much of your bandwidth, this post is for you. Crawlers sent out by major search engines (<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40203">Google</a>, <a href="http://www.ysearchblog.com/archives/000078.html">Yahoo</a> and <a href="http://sebastians-pamphlets.com/live-search-announces-msnbot-1-1/">MSN/Live Search</a>) support conditional GETs, that means they don&#8217;t fetch your pages if those didn&#8217;t change since the last crawl.</p>
<p>Of course they must fetch your stuff over and over again for this comparision, if your Web server doesn&#8217;t play nice with Web robots, as well as with other user agents that <em>can</em>&nbsp; cache your pages and other Web objects like images. The <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html">protocol</a> your Web server and the requestors use to handle caching is quite simple, but its implementation can become tricky. <b>Here is how it works:</b></p>
<dl>
<dt><big>1</big>st request Feb/10/2008 12:00:00</dt>
<dd>
<p>Googlebot requests /some-page.php from your server. Since Google has just discovered your page, there are no unusual request headers, just a plain GET.</p>
<p>You create the page from a database record which was modified on Feb/09/2008 10:00:00. Your server sends Googlebot the full page (5k) with an <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.29">HTTP header</a> <code><br />
Date: Sun, 10 Feb 2008 12:00:00 GMT<br />
<b>Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT</b></code><br />
(lets assume your server is located in Greenwich, UK), the HTTP response code is <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.1">200</a> (OK).</p>
<p>Bandwidth used: 5 kilobytes for the page contents plus less than 500 bytes for the HTTP header.</p>
</dd>
<dt><big>2</big>nd request Feb/17/2008 12:00:00</dt>
<dd>
<p>Googlebot found interesting links pointing to your page, so it requests /some-page.php again to check for updates. Since Google already knows the resource, Googlebot requests it with an additional <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25">HTTP header</a> <code><br />
<b>If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT</b></code><br />
where the date and time is taken from the Last-Modified header you&#8217;ve sent in your response to the previous request.</p>
<p>You didn&#8217;t change the page&#8217;s record in the database, hence there&#8217;s no need to send the full page again. Your Web server sends Googlebot just an <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5">HTTP header</a> <code><br />
Date: Sun, 17 Feb 2008 12:00:00 GMT<br />
Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT</code><br />
The HTTP response code is <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5">304</a> (Not Modified). (Your Web server can suppress the Last-Modified header, because the requestor has this timestamp already.)</p>
<p>Bandwidth used: Less than 500 bytes for the HTTP header.</p>
</dd>
<dt><big>3</big>rd request Feb/24/2008 12:00:00</dt>
<dd>
<p>Googlebot can&#8217;t resist to recrawl /some-page.php, again using the <code><br />
<b>If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT</b></code><br />
header.</p>
<p>You&#8217;ve updated the database on Feb/23/2008 09:00:00 adding a few paragraphs to the article, thus you send Googlebot the full page (now 7k) with this HTTP header <code><br />
Date: Sun, 10 Feb 2008 12:00:00 GMT<br />
<b>Last-Modified: Sat, 23 Feb 2008 09:00:00 GMT</b></code><br />
and an HTTP response code <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.2.1">200</a> (OK).</p>
<p>Bandwidth used: 7 kilobytes for the page contents plus less than 500 bytes for the HTTP header.</p>
</dd>
<dt><big>F</big>urther requests</dt>
<dd>
<p>Provided you don&#8217;t change the contents again, all further chats between Googlebot and your Web server regarding /some-page.php will burn less than 500 bytes of your bandwidth each. Say Googlebot requests this page weekly, that&#8217;s 370k saved bandwidth annually. You do the math. Even with a medium-sized Web site you most likely want to implement proper caching, right? </p>
</dd>
</dl>
<p>Not only Webmasters love conditional GET requests that save bandwidth costs and processing time, search engines aren&#8217;t keen on useless data transfers too. So lets see how you could respond efficiently to conditional GET requests from search engines. Apache handles caching of static files (e.g. .txt or .html files you upload with FTP) differently from dynamic contents (script outputs with or without a query string in the URI).</p>
<h3>Static files</h3>
<p>Fortunately, Apache comes with native support of the Last-Modified / If-Modified-Since / Not-Modified functionality. That means that crawlers and your Web server don&#8217;t produce too much network traffic when a requested <em>static file</em>&nbsp; didn&#8217;t change since the last crawl.</p>
<p>You can <a href="http://www.microsoft.com/search/Tools/">test your Web server&#8217;s conditional GET support</a> with your robots.txt, or, <a href="http://sebastians-pamphlets.com/cloak-the-hell-out-of-your-robots-txt/">if even your robots.txt is a script</a>, create a tiny HTML page with a text editor and upload it via FTP. Another neat tool to check HTTP headers is the <a href="http://livehttpheaders.mozdev.org/">Live Headers Extension for FireFox</a> (bear in mind that testing crawler behavior with Web browsers is fault-prone by design).</p>
<p>If your second request of an unchanged static file results in a 200 HTTP response code, instead of a 304, call your hosting service. If it works and you&#8217;ve only static pages, then <a href="http://del.icio.us/post?url=http://sebastians-pamphlets.com/dynamic-pages-can-support-if-modified-since-too/&amp;title=Save%20bandwidth%20costs:%20Dynamic%20pages%20can%20support%20If-Modified-Since%20too">bookmark this article</a> and move on.</p>
<h3>Dynamic contents</h3>
<p>Everything you output with server sided scripts is dynamic content by definition, regardless whether the URI has a query string or not. Even if you just read and print out a static file &#8211;that never changes&#8211; with PHP, Apache doesn&#8217;t add the Last-Modified header which forces crawlers to perform further requests with an If-Modified-Since header.</p>
<p><b>With dynamic content you can&#8217;t rely on Apache&#8217;s caching support, you must do it yourself.</b></p>
<p>The first step is figuring out where your CMS or eCommerce software hides the timestamps telling you the date and time of a page&#8217;s last modification. Usually a script pulls its stuff from different database tables, hence a page contains more than one area, or block, of dynamic contents. Every block might have a different last-modified timestamp, but not every block is important enough to serve as the page&#8217;s determinant last-modified date. The same goes for templates. Most template tweaks shouldn&#8217;t trigger a full blown recrawl, but some do, for example a new address or phone number if such information is present on every page.</p>
<p>For example a blog has posts, pages, comments, categories and other data sources that can change the sidebar&#8217;s contents quite frequently. On a page that outputs a single post or page, the last-modified date is determined by the post, respectively its last comment. The main page&#8217;s last-modified date is the modified-timestamp of the most recent post, and the same goes for its paginated continuations. A category page&#8217;s last-modified date is determined by the category&#8217;s most recent post, and so on.</p>
<p>New posts can change outgoing links of older posts when you use plugins that list related posts and stuff like that. There are many more reasons why search engines should crawl older posts at least monthly or so. You might need a routine that changes a blog page&#8217;s last-modified timestamp for example when it is a date more than 30 days or so in the past. Also, in some cases it could make sense to have a routine that can reset all timestamps reported as last-modified date for particular site areas, or even the whole site.</p>
<p class="excursus">If your software doesn&#8217;t populate last-modified attributes on changes of all entities, then snap at the chance to consider database triggers, stored procedures, respectively changes of your data access layer. Bear in mind that not all changes of a record must trigger a crawler cache reset. For example a table storing textual contents like articles or product descriptions usually has a number of attributes that don&#8217;t affect crawling, thus it should have an attribute <em>last updated</em>&nbsp; that&#8217;s changeable in the UI and serves as last-modified date in your crawler cache control (instead of the timestamp that&#8217;s changed automatically even on minor updates of attributes which are meaningless for HTML outputs).</p>
<h3>Handling Last-Modified, If-Modified-Since, and Not-Modified HTTP headers with PHP/Apache</h3>
<p>Below I provide example PHP code I&#8217;ve thrown together after midnight in a sleepless night, doped with painkillers. It doesn&#8217;t run on a production system, but it should get you started. Adapt it to your needs and make sure you test your stuff intensively. As always, my stuff comes <em>as is</em>&nbsp; without any guarantees. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>First <a onclick="showContent('php-functions-conditional-gets'); return false;">grab a couple helpers</a> and put them in an include file you&#8217;ve available in all scripts. Since we deal with HTTP headers, you must not output anything before the logic that deals with conditional search engine requests, not even a single white space character, HTML <code>DOCTYPE</code> declaration &#8230;<br />
<b id="grab-php-functions-conditional-gets"><a onclick="showContent('php-functions-conditional-gets'); return false;">View</a>|<a onclick="hideContent('php-functions-conditional-gets'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)</p>
<p title="PHP functions you need to deal with conditional GET requests"><code id="php-functions-conditional-gets" style="display:none;"><b>&lt;?php<br />
&nbsp;<br />
function unixTimestamp2HttpDate ($timestamp) {<br />
    // converts a Unix timestamp to an HTTP date<br />
    $httpDate = @gmdate(&quot;D, d M Y H:i:s&quot;, $timestamp) .&quot; GMT&quot;;<br />
    return $httpDate; // $httpDate === FALSE on invalid input<br />
} // end function unixTimestamp2HttpDate<br />
&nbsp;<br />
function unixTimestamp2MySqlDatetime ($timestamp) {<br />
    // converts a Unix timestamp to a MySQL datetime<br />
    $mySqlDatetime = @date(&quot;Y-m-d H:i:s&quot;, $timestamp);<br />
    return $mySqlDatetime;<br />
} // end function unixTimestamp2MySqlDatetime<br />
&nbsp;<br />
function date2UnixTimestamp ($date) {<br />
    // converts any US English date format to a Unix timestamp<br />
    $timestamp = @strtotime($date);<br />
    if ($timestamp == -1 || $timestamp === FALSE) {<br />
        return FALSE;<br />
    }<br />
    return $timestamp;<br />
} // end function date2UnixTimestamp<br />
&nbsp;<br />
function makeLastModifiedTimestamp ($timestamp) {<br />
    // returns a quite safe last-modified-timestamp or the current time<br />
    $hour = 60 * 60;<br />
    // consider unsynchronized clocks, or<br />
    // local time given as GMT and crap like that:<br />
    $maxDeviant = 13 * $hour;<br />
    $now = @strtotime(&quot;now&quot;);<br />
    $returnTs = $timestamp + $maxDeviant;<br />
    if (intval($returnTs) &gt; intval($now)) {<br />
        return ($now - 5);<br />
    }<br />
    return $returnTs;<br />
} // end function makeLastModifiedTimestamp<br />
&nbsp;<br />
function getHttpRequestHeaders () {<br />
    // returns all HTTP request headers or FALSE<br />
    $headersArr = array();<br />
    if (function_exists(&quot;apache_request_headers&quot;)) {<br />
        $headersArr = apache_request_headers();<br />
    }<br />
    else {<br />
        if (function_exists(&quot;getallheaders&quot;)) {<br />
           $headersArr = getallheaders();<br />
        }<br />
    }<br />
    if (count($headersArr) &gt; 0)  {<br />
        return $headersArr;<br />
    }<br />
    return FALSE;<br />
} // end function getHttpRequestHeaders<br />
&nbsp;<br />
function getIfModifiedSince () {<br />
    // returns If-Modified-Since as Unix Timestamp or FALSE<br />
    GLOBAL $_SERVER;<br />
    $headersArr = getHttpRequestHeaders();<br />
    if ($headersArr !== FALSE &#038;&#038; count($headersArr) &gt; 0)  {<br />
        foreach ($headersArr as $header =&gt; $value) {<br />
            if (stristr($header, &quot;If-Modified-Since&quot;)) {<br />
                $ifModSinceDate      = explode(&quot;;&quot;, $value);<br />
                $ifModifiedSinceDate = $ifModSinceDate[0];<br />
            }<br />
        }<br />
    }<br />
    if (!isset($ifModifiedSinceDate) &#038;&#038;<br />
         isset($_SERVER[&quot;IF-MODIFIED-SINCE&quot;]) &#038;&#038;<br />
        !empty($_SERVER[&quot;IF-MODIFIED-SINCE&quot;]) ) {<br />
        $ifModifiedSinceDate = $_SERVER[&quot;IF-MODIFIED-SINCE&quot;];<br />
    }<br />
    if (!isset($ifModifiedSinceDate)) {<br />
        return FALSE;<br />
    }<br />
    return date2UnixTimestamp($ifModifiedSinceDate);<br />
} // end function getIfModifiedSince ()<br />
&nbsp;<br />
?&gt;   </b></code></p>
<p>In general, all user agents should support conditional GET requests, not only search engine crawlers. If you allow long lasting caching, which is fine with search engines that don&#8217;t need to crawl your latest Twitter message from your blog&#8217;s sidebar, you could leave your visitors with somewhat outdated pages if you serve them 304-Not-Modified responses too.</p>
<p>It might be a good idea to limit 304 responses to conditional GET requests from crawlers, when you don&#8217;t implement way shorter caching cycles for other user agents. The latter includes <a href="http://sebastians-pamphlets.com/referrer-spoofing-with-prefbar-341/">folks that spoof their user agent name</a> as well as scrapers trying to steal your stuff masked as a legit spider. To verify legit search engine crawlers that (should) support conditional GET requests (from <a href="http://www.google.com/support/webmasters/bin/topic.py?topic=8843" title="Google supports conditional GETs">Google</a>, <a href="http://help.yahoo.com/l/us/yahoo/search/webcrawler/" title="Yahoo supports conditional GETs">Yahoo</a>, <a href="http://blogs.msdn.com/webmaster/archive/2008/02/12/announcing-crawling-improvements-for-live-search.aspx" title="Live Search supports conditional GETs">MSN</a> and <a href="http://about.ask.com/en/docs/about/webmasters.shtml" title="Ask doesn't promise support of conditional GETs">Ask</a>) you can <a href="http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/#grab-php-code-check-crawler">grab my crawler detection routines here</a>. Include them as well, then you can code stuff like that:</p>
<p><code style="margin-left:-0.5em;">$isSpiderUA    = checkCrawlerUA ();<br />
$isLegitSpider = checkCrawlerIP (__FILE__);<br />
if ($isSpiderUA &#038;&#038; !$isLegitSpider) {<br />
    @header(&quot;Thou shalt not spoof&quot;, TRUE, 403);<br />
    exit;<br />
    // make sure your 403-Forbidden ErrorDocument directive in<br />
    // .htaccess points to a page that explains the issue!<br />
}<br />
if ($isLegitSpider) {<br />
    // insert your code dealing with conditional GET requests<br />
} </code></p>
<p>Now that you&#8217;re sure that the requestor is a legit crawler from a major search engine, look at the HTTP request header it has submitted to your Web server.</p>
<p><code style="margin-left:-0.5em;">// lookup the HTTP request header for a possible conditional GET<br />
<b>$ifModifiedSinceTimestamp = getIfModifiedSince();</b><br />
// if the request is not conditional, don&#8217;t send a 304<br />
<b>$canSend304 = FALSE;<br />
if ($ifModifiedSinceTimestamp !== FALSE) {<br />
    $canSend304 = TRUE;</b><br />
    // Tells the requestor that you&#8217;ve recognized the conditional GET<br />
    $echoRequestHeader = &quot;X-Requested-If-modified-since: &quot;<br />
                         .unixTimestamp2HttpDate($ifModifiedSinceTimestamp);<br />
    @header($echoRequestHeader, TRUE);<br />
<b>}</b> </code></p>
<p>You don&#8217;t need to echo the If-Modified-Since HTTP-date in the response header, but this custom header makes testing easier.</p>
<p>Next get the page&#8217;s actual last-modified date/time. Here is an (incomplete) code sample for a WordPress single post page.</p>
<p><code>// select the requested post's comment_count, post_modified and<br />
&nbsp;// post_date values, then:<br />
if ($wp_post_modified) {<br />
    $lastModified = date2UnixTimestamp($wp_post_modified);<br />
}<br />
else {<br />
    $lastModified = date2UnixTimestamp($wp_post_date);<br />
}<br />
if (intval($wp_comment_count) > 0) {<br />
    // select last comment from the WordPress database, then:<br />
    $lastCommentTimestamp = date2UnixTimestamp($wp_comment_date);<br />
    if ($lastCommentTimestamp > $lastModified) {<br />
        $lastModified = $lastCommentTimestamp;<br />
    }<br />
} </code></p>
<p>The date2UnixTimestamp() function accepts MySQL datetime values as valid input. If you need to (re)write last-modified dates to a MySQL database, convert the Unix timestamps to MySQL datetime values with unixTimestamp2MySqlDatetime().</p>
<p>Your server&#8217;s clock isn&#8217;t necessarily synchronized with all search engines out there. To cover possible gaps you can use a last-modified timestamp that&#8217;s a little bit fresher than the actual last-modified date. In this example the timestamp reported to the crawler is last-modified + 13 hours, you can change the deviant in makeLastModifiedTimestamp().<code><br />
<b>$lastModifiedTimestamp = makeLastModifiedTimestamp($lastModified); </b></code></p>
<p style="display:none;">If you compare the timestamps later on, and the request isn&#8217;t conditional, don&#8217;t run into the 304 routine.<code><br />
if ($ifModifiedSinceTimestamp === FALSE) {<br />
    // make things equal if the request isn't conditional<br />
    $ifModifiedSinceTimestamp = $lastModifiedTimestamp;<br />
} </code></p>
<p>You may want to allow a full fetch if the requestor&#8217;s timestamp is ancient, in this example older than one month. <code><br />
$tooOld = @strtotime(&quot;now&quot;) - (31 * 24 * 60 * 60);<br />
if ($ifModifiedSinceTimestamp < $tooOld) {<br />
    $lastModifiedTimestamp    = @strtotime(&quot;now&quot;);<br />
    $ifModifiedSinceTimestamp = @strtotime(&quot;now&quot;) - (1 * 24 * 60 * 60);<br />
} </code><br />
Setting the last-modified attribute to yesterday schedules the next full crawl after this fetch in 30 days (or later, depending on the actual crawl frequency).</p>
<p>Finally respond with 304-Not-Modified if the page wasn&#8217;t remarkably changed since the date/time given in the crawler&#8217;s If-Modified-Since header. Otherwise send a Last-Modified header with a 200 HTTP response code, allowing the crawler to fetch the page contents.  <code><b><br />
$lastModifiedHeader = &quot;Last-Modified: &quot; .unixTimestamp2HttpDate($lastModifiedTimestamp);<br />
if ($lastModifiedTimestamp < $ifModifiedSinceTimestamp &#038;&#038;<br />
    $canSend304) {<br />
    @header($lastModifiedHeader, TRUE, 304);<br />
    exit;<br />
}<br />
else {<br />
    @header($lastModifiedHeader, TRUE);<br />
} </b></code></p>
<p>When you&#8217;re testing your version of this script with a browser, it will send a standard HTTP request, and your server will return a 200-OK. From your server&#8217;s response your browser should recognize the &#8220;Last-Modified&#8221; header, so when you reload the page the browser should send an &#8220;If-Modified-Since&#8221; header and you should get the 304 response code <b>if</b> <em>Last-Modified <b>&gt;</b> If-Modified-Since</em>. However, judging from my experience such browser based tests of crawler behavior, respectively responses to crawler requests, aren&#8217;t reliable.</p>
<p>Test it with <a href="http://www.microsoft.com/search/Tools/">this MS tool</a> instead. I&#8217;ve played with it for a while and it works great. With the PHP code above I&#8217;ve created a <a href="http://sebastians-pamphlets.com/tools/last-modified-yesterday.php">200/304 test page</a><br />
<code> http://sebastians-pamphlets.com/tools/last-modified-yesterday.php </code><br />
that sends a &#8220;Last-Modified: Yesterday&#8221; response header, and should return a <b>304-Not Modified</b> HTTP response code when you request it with an &#8220;If-Modified-Since: Today+&#8221; header, otherwise it should respond with <b>200-OK</b> (<a href="http://sebastians-pamphlets.com/tools/last-modified-yesterday.php?no304=TRUE">this version</a> returns 200-OK only but tells when it <em>would</em>&nbsp; respond with a 304). You can use this URI with the MS-tool linked above to test HTTP requests with different If-Modified-Since headers.</p>
<p><b>Have fun and paypal me 50% of your savings.</b> <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/dynamic-pages-can-support-if-modified-since-too/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The hacker tool MSN-LiveSearch is responsible for brute force attacks</title>
		<link>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/</link>
		<comments>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 15:36:08 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Testing]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/</guid>
		<description><![CDATA[A while ago I&#8217;ve staged a public SEO contest, asking whether the 401 HTTP response code prevents from search engine indexing or not. 
Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 [...]]]></description>
			<content:encoded><![CDATA[<p><img  src="http://sebastians-pamphlets.com/img/posts/401-private-property-keep-out.png" width="200" height="133" align="right" style="margin-left:4px;" alt="401 = Private Property, keep out!" title="401 = Private Property, keep out!" />A while ago I&#8217;ve staged a public <a href="http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/">SEO contest</a>, asking whether the 401 HTTP response code prevents from search engine indexing or not. </p>
<p>Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 HTTP response code that translates to a polite &#8220;Authorization Required&#8221;, meaning &#8220;Forbidden unless you provide valid authorization&#8221;. </p>
<p>Experience of life and common sense tell search engines, that when a Webmaster protects content with a user/password query, this content is not available to the public. Search engines that respect Webmasters/site owners do not point their users to protected content. </p>
<p>Also, that makes no sense for the search engine. Searchers submitting a query with keywords that match a protected URL would be pissed when they click the promising search result on the SERP, but the linked site responds with an unfriendly &#8220;Enter user and password in order to access [title of the protected area]&#8221;, that resolves to a harsh error message because the searcher can&#8217;t provide such information, and usually can&#8217;t even sign up from the 401 error page<sup><a href="#401-error-document-footnote">1</a></sup>. </p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/evil-use-of-search-results.png" width="200" height="255" align="right" style="margin-left:4px;" alt="Evil use of search results" title="The evil variant of search results " />Unfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example <code>/members/</code>. Even when those systems can handle individual URLs, many Webmasters leave default URLs in place that are either guessable or well known on the Web. </p>
<p>Developing a script that searches for a string like <code>/members/</code> in URLs and then &#8220;tests&#8221; the search results with brute force attacks is a breeze. Also, such scripts are available (for a few bucks or even free) at various places. Without the help of a search engine that provides the lists of protected URLs, the hacker&#8217;s job is way more complicated. In other words, search engines that list protected URLs on their SERPs willingly support and encourage hacking, content theft, and DOS-like server attacks.</p>
<p>Ok, lets look at the test results. All search engines have casted their votes now. <b>Here are the winners:</b> </p>
<h3>Google <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Once my test was out, <a href="http://mattcutts.com/blog/">Matt Cutts</a> from <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40207">Google</a> researched the question and told me:</p>
<blockquote><p>My belief from talking to folks at Google is that 401/forbidden URLs that we crawl won&#8217;t be indexed even as a reference, so .htacess password-protected directories shouldn&#8217;t get indexed as long as we crawl enough to discover the 401. Of course, if we discover an URL but didn&#8217;t crawl it to see the 401/Forbidden status, that URL reference could still show up in Google.</p>
</blockquote>
<p>Well, that&#8217;s exactly the expected behavior, and I wasn&#8217;t surprised that my test results confirm Matt&#8217;s statement. Thanks to Google&#8217;s BlitzIndexing&trade; Ms. Googlebot spotted the 401 so fast, that the URL never showed up on Google&#8217;s SERPs. Google reports the <a href="http://sebastians-pamphlets.com/porn/">protected URL</a> in my <a href="http://google.com/webmasters/tols/">Webmaster Console</a> account for this blog as not indexable.</p>
<h3>Yahoo <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Yahoo&#8217;s crawler Slurp also fetched the protected URL in no time, and Yahoo did the right thing too. I wonder whether or not that&#8217;s going to change if <a href="http://searchengineland.com/080201-064343.php">M$ buys Yahoo</a>. </p>
<h3>Ask <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Ask&#8217;s crawler isn&#8217;t the most diligent Web robot out there. However, somehow Ask has managed not to index a reference to my password protected URL.</p>
<p><b style="font-size:110%;">And here is the ultimate loser:</b></p>
<h3>MSN LiveSearch <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </h3>
<p>Oh well. Obviously MSN LiveSearch is a must have in a deceitful cracker&#8217;s toolbox:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/msn-indexes-401-protected-urls.png" width="467" height="223" align="center" style="" alt="MSN LiveSearch indexes password protected URLs" title="MSN LiveSearch indexes password protected URLs" /></p>
<p>As if indexing references to password protected URLs wouldn&#8217;t be crappy enough, MSN even indexes sitemap files that are referenced in robots.txt only. Sitemaps are machine readable URL submission files that have absolute no value for humans. Webmasters make use of sitemap files to mass submit their URLs to search engines. The <a href="http://sitemaps.org/">sitemap protocol</a>, that MSN officially supports, defines a communication channel between Webmasters and search engines - not searchers, and especially not scrapers that can use indexed sitemaps to steal Web contents more easily. Here is a screen shot of an MSN SERP:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/msn-lists-unlinked-porn-sitemap-file-2008-01.png" width="460" height="54" align="center" style="" alt="MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)" title="MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)" /><br />
<img  src="http://sebastians-pamphlets.com/img/posts/msn-indexes-unlinked-porn-sitemap-file-2008-01.png" width="460" height="58" align="center" style="" alt="MSN LiveSearch indexes unlinked sitemaps files (MSN Webmaster Tools)" title="MSN LiveSearch indexes unlinked sitemaps files (MSN Webmaster Tools)" /></p>
<p>All the other search engines got the sitemap submission of the test URL too, but none of them fell for it. Neither Google, Yahoo, nor Ask have indexed the sitemap file (they never index submitted sitemaps that have no inbound links by the way) or its protected URL.</p>
<h3>Summary</h3>
<p><b style="font-size:110%;">All major search engines except MSN respect the 401 barrier.</b></p>
<p>Since <a href="http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/">MSN LiveSearch is well known for spamming</a>, it&#8217;s not a big surprise that they support hackers, scrapers and other content thieves. </p>
<p>Of course MSN search is still an experiment, operating in a not yet ready to launch stage, and the big players made their mistakes in the beginning too. But MSN has a history of ignoring Web standards as well as Webmaster concerns. It took them two years to implement the pretty simple sitemaps protocol, they still can&#8217;t handle 301 redirects, their sneaky stealth bots spam the referrer logs of all Web sites out there in order to fake human traffic from MSN SERPs (MSN traffic doesn&#8217;t exist in most niches), and so on. Once pointed to such crap, they don&#8217;t even fix the simplest bugs in a timely manner. I mean, not complying to the HTTP 1.1 protocol from the last century is an evidence of incapacity, and that&#8217;s just one example.</p>
<p>&nbsp;</p>
<p><b>Update Feb/06/2008:</b> Last night I&#8217;ve received an email from Microsoft confirming the 401 issue. The MSN Live Search engineer said they are currently working on a fix, and he provided me with an email address to report possible further issues. Thank you, <a href="http://nathanbuggia.com/">Nathan Buggia</a>! I&#8217;m still curious how MSN Live Search will handle sitemap files in the future.</p>
<p>&nbsp;</p>
<hr width="128" color="silver" align="center" />
<p id="401-error-document-footnote"><sup>1</sup>&nbsp;<small>Smart Webmasters provide sign up as well as login functionality on the page referenced as ErrorDocument 401, but the majority of all failed logins leave the user alone with the short hard coded 401 message that Apache outputs if there&#8217;s no 401 error document. Please note that you shouldn&#8217;t use a PHP script as 401 error page, because this might disable the user/password prompt (due to a PHP bug). With a <a href="http://sebastians-pamphlets.com/error401.html">static 401 error page</a> that fires up on invalid user/pass entries or a hit on the cancel button, you can perform a meta refresh to redirect the visitor to a signup page. Bear in mind that in .htaccess you <b>must not</b> use absolute URLs (http://&#8230; or https://&#8230;) in the ErrorDocument 401 directive, and that on the error page you <b>must</b> use absolute URLs for CSS, images, links and whatnot because relative URIs don&#8217;t work there!</small></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google removes the #6 penalty/filter/glitch</title>
		<link>http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/</link>
		<comments>http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/#comments</comments>
		<pubDate>Tue, 29 Jan 2008 06:50:51 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/</guid>
		<description><![CDATA[After the great #6 Penalty SEO Panel Google&#8217;s head of the webspam dept. Matt Cutts digged out a misbehaving algo and sent it back to the developers. Two hours ago he stated: 
When Barry asked me about &#8220;position 6&#8243; in late December, I said that I didn&#8217;t know of anything that would cause that. But [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/google/google-pos-six-penalty-removed.png" width="200" height="299" align="right" style="margin-left:4px;" alt="Google removed the position six penalty" title="Google takes back the #6 filter/glitch/penalty"  />After the great <a href="http://sebastians-pamphlets.com/the-fictive-numeric-google-penalty-seo-hero-panel/">#6 Penalty SEO Panel</a> Google&#8217;s head of the webspam dept. <a href="http://mattcutts.com/blog/">Matt Cutts</a> digged out a misbehaving algo and sent it back to the developers. Two hours ago he <a href="http://sphinn.com/story/24687#c29022">stated</a>: </p>
<blockquote><p>When Barry <a href="http://www.seroundtable.com/archives/015799.html">asked</a> me about &#8220;position 6&#8243; in late December, I <a href="http://www.seroundtable.com/archives/015799.html#comment-673474">said</a> that I didn&#8217;t know of anything that would cause that. But about a week or so after that, my attention was brought to something that could exhibit that behavior.</p>
<p>We&#8217;re in the process of changing the behavior; I think the change is live at some datacenters already and will be live at most data centers in the next few weeks.</p>
</blockquote>
<p>&nbsp;</p>
<p>So everything is fine now. Matt penalizes the position-six software glitch, and lost top positions will revert to their former rankings in a while. Well, not really. Nobody will compensate income losses, nor the time Webmasters spent on forums discussing a suspected penalty that actually was a bug or a weird side effect. However, kudos to Google for listening to concerns, tracking down and fixing the algo. And thanks for the update, Matt.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/google-removes-position-six-glitch-filter-penalty/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Avoiding the well known #4 SERP-hero-penalty &#8230;</title>
		<link>http://sebastians-pamphlets.com/the-fictive-numeric-google-penalty-seo-hero-panel/</link>
		<comments>http://sebastians-pamphlets.com/the-fictive-numeric-google-penalty-seo-hero-panel/#comments</comments>
		<pubDate>Fri, 25 Jan 2008 14:05:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Ego Food]]></category>

		<category><![CDATA[Folks]]></category>

		<category><![CDATA[Fun]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/the-fictive-numeric-google-penalty-seo-hero-panel/</guid>
		<description><![CDATA[&#8230; I just have to link to North South Media&#8217;s neat collection of Search Action Figures. 
Paul pretty much dislikes folks who don&#8217;t link to him, so Danny Sullivan and Rand Fishkin are well advised to drop a link every now and then, and David Naylor better gives him an interview slot asap.  
Google&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.northsouthmedia.co.uk/wordpress/index.php/2008/01/25/what-if-seo-search-had-action-figures/"><img  src="http://sebastians-pamphlets.com/img/blog/seb-the-red-claw.png" border="0" width="250" height="161" align="right" style="margin-left:4px;" alt="Seb the red claw" title="Seb the red claw links to Paul" /></a>&#8230; I just have to link to <a href="http://www.northsouthmedia.co.uk/wordpress/">North South Media</a>&#8217;s neat collection of <a href="http://www.northsouthmedia.co.uk/wordpress/index.php/2008/01/25/what-if-seo-search-had-action-figures/">Search Action Figures</a>. </p>
<p>Paul pretty much dislikes folks who don&#8217;t link to him, so <a href="http://daggle.com/">Danny Sullivan</a> and <a href="http://seomoz.org/">Rand Fishkin</a> are well advised to drop a link every now and then, and <a href="http://www.davidnaylor.co.uk/">David Naylor</a> better gives him an interview slot asap. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<h3>Google&#8217;s numbered &#8220;penalties&#8221;, esp. #6</h3>
<p>As for <a href="http://sphinn.com/story/24626#comments">numeric penalties</a> in general &#8230; <code><b>repeat(&quot;Sigh&quot;, <big>&infin;</big>)</b></code> &#8230; enjoy this <a href="http://sphinn.com/">brains trust</a> moderated by <a href="http://aimclear.com/">Marty Weintraub</a> (unauthorized):</p>
<p><b><a href="http://sphinn.com/story/24626" style="font-weight:bold;">Marty</a>:</b> Folks, please welcome <a href="http://www.seobook.com/how-i-got-my-google-ranking-6-filter-removed">Aaron Wall</a>, who recently got his #6 penalty removed!</p>
<p><b>Audience:</b> <code><strike>clap(26)</strike> sphinn(26)</code></p>
<p><b><a href="http://sphinn.com/story/24626#c27999" style="font-weight:bold;">The Gypsy</a>:</b> Sorry Marty but come on&#8230; this is complete BS and there is NO freakin #6 filter just like the magical minus 90&#8230;900 bla bla bla. These anomalies NEVER have any real consensus on a large enough data set to even be considered a viable theory.</p>
<p><b><a href="http://sphinn.com/story/24626#c28002" style="font-weight:bold;">A Red Crab</a>:</b> As long as <a href="http://seobythesea.com/">Bill</a> can&#8217;t find a plus|minus-n-raise|penalty patent, or at least a white paper or so leaked out from Google, or for all I care a study that provides proof instead of weird assumptions based on claims of webmasters jumping on todays popular WMW band wagon that aren&#8217;t plausible nor verifiable, such beasts don&#8217;t exist. There are unexplained effects that might look like a pattern, but in most cases it makes no sense to gather a few examples coming with similarities because we&#8217;ll never reach the critical mass of anomalies to discuss a theory worth more than a thumbs-down click.</p>
<p><b><a href="http://sphinn.com/story/24626#c28020" style="font-weight:bold;">Marty</a>:</b> Maybe Aaron is joking. Maybe he thinks he has invented the next light bulb.</p>
<p><b><a href="http://sphinn.com/story/24626#c28022" style="font-weight:bold;">Gamermk</a>:</b> Aaron is grasping at straws on this one. </p>
<p><b><a href="http://sphinn.com/story/24626#c28028" style="font-weight:bold;">Barry Welford</a>:</b> I would like this topic to be seen by many.</p>
<p><b>Audience:</b> <code><strike>clap(29)</strike> sphinn(29)</code></p>
<p><b><a href="http://sphinn.com/story/24626#c28050" style="font-weight:bold;">The Gypsy</a>:</b> It is just some people that have DECIDED on an end result and trying to make various hypothesis fit the situation (you know, like tobacco lobby scientists)&#8230; this is simply bad form IMO.</p>
<p><b><a href="http://sphinn.com/story/24626#c28096" style="font-weight:bold;">Danny Sullivan</a>:</b> Well, I&#8217;ve personally seen this weirdness. Pages that I absolutely thought &#8220;what on earth is that doing at six&#8221; rather than at the top of the page. Not four, not seven &#8212; six. It was freaking weird for several different searches. Nothing competitive, either.</p>
<p>I don&#8217;t know that sixth was actually some magic number. Personally, I&#8217;ve felt like there&#8217;s some glitch or problem with Google&#8217;s ranking that has prevented the most authorative page in some instances from being at the top. But something was going on.</p>
<p>Remember, there&#8217;s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a &#8220;sandbox like&#8221; effect.</p>
<p>The biggest problem I find with these types of theories is they often start with a specific example, sometimes that can be replicated, then they become a catch-all. Not ranking. Oh, it&#8217;s the sandbox. Well no &#8212; not if you were an established site, it wasn&#8217;t. The sandbox was typicaly something that hit brand new sites. But it became a common excuse for anything, producing confusion.</p>
<p><b><a href="http://sphinn.com/story/24626#c28099" style="font-weight:bold;">Jim Boykin</a>:</b> I&#8217;ll jump in and say I truely believe in the 6 filter. I&#8217;ve seen it. I wouldn&#8217;t have believed it if I hadn&#8217;t seen it happen to a few sites.</p>
<p><b>Audience:</b> <code><strike>clap(31)</strike> sphinn(31)</code></p>
<p><b><a href="http://sphinn.com/story/24626#c28101" style="font-weight:bold;">A Red Crab</a>:</b> Such  terms tend to become a life of their own, IOW an excuse for nearly every way a Webmaster can fuck up rankings. Of course Google&#8217;s query engine has thresholds (yellow cards or whatever they call them) that don&#8217;t allow some sites to rank above a particular position, but that&#8217;s a symtom that doesn&#8217;t allow back-references to a particular cause, or causes. It&#8217;s speculation as long as we don&#8217;t know more.</p>
<p><b><a href="http://sphinn.com/story/24626#c28105" style="font-weight:bold;">IncrediBill</a>:</b> I definitely believe it&#8217;s some sort of filter or algo tweak but it&#8217;s certainly not a penalty which is why I scoff at calling it such. One morning you wake up and Matt has turned all the dials to the left and suddenly some criteria bumps you UP or DOWN. Sites have been going up and down in Google SERPs for years, nothing new or shocking about that and this too will have some obvious cause and effect that could probably be identified if people weren&#8217;t using the shotgun approach at changing their site</p>
<p><b><a href="http://sphinn.com/story/24626#c28107" style="font-weight:bold;">G1smd</a>:</b> By the time anyone works anything out with Google,  they will already be in the process of moving the goalposts to another country.</p>
<p><b><a href="http://sphinn.com/story/24626#c28286" style="font-weight:bold;">Slightly Shady SEO</a>:</b> The #6 filter is a fallacy.</p>
<p><b><a href="http://sphinn.com/story/24626#c28290" style="font-weight:bold;">Old School</a>:</b> It certainly occured but only affected certain sites.</p>
<p><b><a href="http://sphinn.com/story/24626#c28292" style="font-weight:bold;">Danny Sullivan</a>:</b> Perhaps it would have been better called a -5 penalty. Consider. Say Google for some reason sees a domain but decides good, but not sure if I trust it. Assign a -5 to it, and that might knock some things off the first page of results, right?</p>
<p>Look &#8212; it could all be coincidence, and it certainly might not necessarily be a penalty. But it was weird to see pages that for the life of me, I couldn&#8217;t understand why they wouldn&#8217;t be at 1, showing up at 6.</p>
<p><b><a href="http://sphinn.com/story/24626#c28329" style="font-weight:bold;">Slightly Shady SEO</a>:</b> That seems like a completely bizarre penalty. Not Google&#8217;s style. When they&#8217;ve penalized anything in the past, it hasn&#8217;t been a &#8220;well, I guess you can stay on the frontpage&#8221; penalty. It&#8217;s been a smackdown to prove a point.</p>
<p><b><a href="http://www.seroundtable.com/archives/015799.html#comment-673474" style="font-weight:bold;">Matt Cutts</a>:</b> Hmm. I&#8217;m not aware of anything that would exhibit that sort of behavior.</p>
<p><b>Audience:</b> Ugh &#8230; oohhhh &#8230; you weren&#8217;t aware of the sandbox, either!</p>
<p><b><a href="http://sphinn.com/story/24626#c28096" style="font-weight:bold;">Danny Sullivan</a>:</b> Remember, there&#8217;s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a &#8220;sandbox like&#8221; effect.</p>
<p><b>Audience:</b> Bah, humbug! We so want to believe in our lame excuses &#8230;</p>
<p><b><a href="http://www.webmasterworld.com/google/3535274.htm" style="font-weight:bold;">Tedster</a>:</b> I&#8217;m not happy with the current level of analysis, however, and definitely looking for more ideas.</p>
<p><b>Audience:</b> <code><strike>clap(40)</strike> sphinn(40)</code></p>
<hr color="silver" size="5" width="128" style="text-align:center;" />
<p>Of course the panel above is fictional, respectively assembled from snippets which in some cases change the message when you read them in their context. So please follow the links.</p>
<p>I wouldn&#8217;t go that far to say there&#8217;s no such thing as a fair amount of Web pages that deserve a #1 spot on Google&#8217;s SERPs, but rank #6 for unknown reasons (perhaps link monkey business, staleness, PageRank flow in disarray, anchor text repetitions, &#8230;). There&#8217;s <a href="http://blog.searchenginewatch.com/blog/080125-071044">something</a> worth investigating.</p>
<p>However, I think that labelling a discussion of glitches or maybe filters that don&#8217;t behave based on a way too tiny dataset &#8220;#6 penalty&#8221; leads to the <b>lame excuse for literally anything</b> phenomenon. </p>
<p>Folks who don&#8217;t follow the various threads closely enough to spot the highly speculative character of the beast, will take it as fact and switch to winter sleep mode instead of <a href="http://www.seobook.com/how-i-got-my-google-ranking-6-filter-removed">enhancing their stuff like Aaron did</a>. I can&#8217;t wait for the first &#8220;How to escape the Google -5 penalty&#8221; SEO tutorial telling the great unwashed that a &#8220;+5&#8243; revisit-after meta tag will heal it.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/the-fictive-numeric-google-penalty-seo-hero-panel/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Getting URLs outta Google - the good, the popular, and the definitive way</title>
		<link>http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/</link>
		<comments>http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/#comments</comments>
		<pubDate>Mon, 14 Jan 2008 21:01:20 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[URL removal]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Duplicate Content]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/</guid>
		<description><![CDATA[There&#8217;s more and more robots.txt talk in the SEOsphere lately. That&#8217;s a good thing in my opinion, because the good old robots.txt&#8217;s power is underestimated. Unfortunately it&#8217;s quite often misused or even abused too, usually because folks don&#8217;t fully understand the REP (by following &#8220;advice&#8221; from forums instead of reading the real thing, or at [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/google/keep-out-google-1-2-3.png" width="247" height="164" align="right" style="margin-left:4px;" alt="Keep out Google" title="How to keep out Google" />There&#8217;s <a href="http://googlewebmastercentral.blogspot.com/2008/01/remove-your-content-from-google.html">more</a> and <a href="http://www.hobo-web.co.uk/seo-blog/index.php/i-robot-with-sebastianx-of-sebastians-pamphlets-robotstxt-help/">more</a> robots.txt talk in the SEOsphere lately. That&#8217;s a good thing in my opinion, because the good old <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/">robots.txt&#8217;s power</a> is underestimated. Unfortunately it&#8217;s quite often misused or even abused too, usually because folks don&#8217;t fully understand the <acronym title="Robots Exclusion Protocol">REP</acronym> (by following &#8220;advice&#8221; from forums instead of reading the <a href="http://www.robotstxt.org/">real</a> <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40360&#038;query=robots.txt&#038;topic=&#038;type=">thing</a>, or at least <a href="http://sebastians-pamphlets.com/links/categories/?cat=crawler-directives">my stuff</a> <img src="http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif" width="15" height="15" style="margin-top:-6px;" />).  </p>
<dl>I&#8217;d like to discuss the REP&#8217;s capabilities assumed to make sure that Google doesn&#8217;t index particular contents from three angles: </p>
<dt><a href="#good-content-termination">The good way</a></dt>
<dd>If the major search engines would support <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/">new robots.txt directives that Webmasters really need</a>, removing even huge chunks of content from Google&#8217;s SERPs &#8211;without collateral damage&#8211; via robots.txt would be a breeze.</dd>
<dt><a href="#popular-content-termination">The popular way</a></dt>
<dd>Shamelessly stealing Matt&#8217;s official advice [Source: <a href="http://googlewebmastercentral.blogspot.com/2008/01/remove-your-content-from-google.html">Remove your content from Google</a> by <a href="http://mattcutts.com/blog/">Matt Cutts</a>]. To obscure the blatant plagiarism, I&#8217;ll add a few thoughts.</dd>
<dt><a href="#definitive-content-termination">The definitive way</a></dt>
<dd>Of course that&#8217;s not the ultimate way, but that&#8217;s the way Google&#8217;s cookies crumble, currently. In other words: Google is working on a leaner approach, but that&#8217;s not yet announced, thus you can&#8217;t use it; you still have to jump through many hoops.</dd>
</dl>
<h3 id="good-content-termination">The good way</h3>
<p><b>Caution:</b> Don&#8217;t implement code from this section, the robots.txt directives discussed here are not (yet/fully) supported by search engines!</p>
<p>Currently all robots.txt statements are crawler directives. That means that they can tell behaving search engines how to crawl a site (fetching contents), but they&#8217;ve no impact on indexing (listing contents on SERPs). I&#8217;ve recently published a <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/">draft discussing possible REP tags for robots.txt</a>. REP tags are indexer directives known from <a href="http://sebastians-pamphlets.com/links/categories/?cat=robots-meta-tags">robots meta tags</a> and <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">X-Robots-Tags</a>, which &#8211;as on-page respectively per-URL directives&#8211; require crawling. </p>
<p>The crux is that REP tags must be assigned to URLs. Say you&#8217;ve a gazillion of printer friendly pages in various directories that you want to deindex at Google, putting the &#8220;noindex,follow,noarchive&#8221; tags comes with a shitload of work. </p>
<p>How cool would be this robots.txt code instead: <code><br />
<b><a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-noindex">Noindex:</a> /*printable<br />
<a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-noarchive">Noarchive:</a> /*printable </b></code><br />
Search engines would continue to crawl, but deindex previously indexed URLs respectively not index new URLs from <code><br />
/articles/printable/*.htm<br />
/manuals/printable/*.pdf<br />
/products/descriptions/*.php?format=printable&amp;product=*<br />
... </code><br />
provided those URLs aren&#8217;t disallow&#8217;ed. They would follow the links in those documents, so that PageRank gathered by printer friendly pages wouldn&#8217;t be completely wasted. To apply an implicit rel-nofollow to all links pointing to printer friendly pages, so that those can&#8217;t accumulate PageRank from internal or external links, you&#8217;d add <code><br />
<b><a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-norank">Norank:</a> /*printable </b></code><br />
to the robots.txt code block above. </p>
<p>If you don&#8217;t like that search engines index stuff you&#8217;ve disallow&#8217;ed in your robots.txt from 3rd party signals like inbound links, and that Google accumulates even PageRank for disallow&#8217;ed URLs, you&#8217;d put: <code><br />
<b><a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#existing-robots-txt-statements">Disallow:</a> /unsearchable/<br />
<a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-noindex">Noindex:</a> /unsearchable/<br />
<a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-norank">Norank:</a> /unsearchable/ </b></code></p>
<p>To fix URL canonicalization issues with PHP session IDs and other tracking variables you&#8217;d write for example <code><br />
<b><a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-truncate-variable">Truncate-variable</a> sessionID: / </b></code><br />
and that would fix the duplicate content issues as well as the problem with PageRank accumulated by throw-away URLs. </p>
<p>Unfortunately, robots.txt is not yet that powerful, so please link to the <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#probs-with-rep-tags-in-robots-txt">REP tags for robotx.txt &#8220;RFC&#8221;</a> to make it popular, and proceed with what you have at the moment.</p>
<h3 id="popular-content-termination">The popular way</h3>
<p>Matt Cutts was kind enough to discuss Google&#8217;s take on contents excluded from search engine indexing in 10 minutes or less here:<br />
<object width="425" height="373">
<param name="movie" value="http://www.youtube.com/v/nM2VDkXPt0I&#038;rel=1&#038;border=1"></param>
<param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/nM2VDkXPt0I&#038;rel=1&#038;border=1" type="application/x-shockwave-flash" wmode="transparent" width="425" height="373"></embed></object><br />
You really should listen, the video isn&#8217;t that long. </p>
<dl>In the following I&#8217;ve highlighted a few methods Matt has talked about:</p>
<dt>Don&#8217;t link (very weak)</dt>
<dd>Although Google usually doesn&#8217;t index unlinked stuff, this can happen due to crawling based on sitemaps. Also, the URL might appear in linked referrer stats on other sites that are crawlable, and folks can link from the cold.</dd>
<dt id="does-google-index-protected-smut">.htaccess / .htpasswd (Matt&#8217;s first recommendation)</dt>
<dd>Since Google cannot crawl password protected contents, Matt declares this method to prevent content from indexing safe. I&#8217;m not sure what will happen when I spread a few strong links to somebody&#8217;s favorite smut collection, perhaps I&#8217;ll test some day whether Google and other search engines list such a reference on their SERPs.</dd>
<dt>robots.txt (weak)</dt>
<dd>Matt rightly points out that Google&#8217;s cool robots.txt validator in the <a href="http://www.vanessafoxnude.com/2008/01/09/the-google-webmaster-central-blog-a-retrospective/" title="Vanessa Fox Memorial">Webmaster Console</a> is a great tool to develop, test and deploy proper robots.txt syntax that effectively blocks search engine crawling. The weak point is, that even when search engines obey robots.txt, they can index uncrawled content from 3rd party sources. Matt is proud of Google&#8217;s smart capabilities to figure out suiteble references like the ODP. I agree totally and wholeheartedly. Hence robots.txt in its current shape doesn&#8217;t prevent content from showing up in Google and other engines as well. Matt didn&#8217;t mention <a href="http://sebastians-pamphlets.com/about-noindex-crawler-directives-in-robots-txt/">Google&#8217;s experiments</a> with <a href="http://sebastians-pamphlets.com/stealthy-rep-experiments-google-jumping-the-shark/#noindex-disallow-peculiarities">Noindex: support in robots.txt</a>, which <a href="http://googlewebmastercentral.blogspot.com/2008/01/remove-your-content-from-google.html#c4352227362682466059">need improvement</a> but could <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-noindex">resolve this dilemma</a>.</dd>
<dt>Robots meta tags (Google only, weak with MSN/Yahoo)</dt>
<dd>The REP tag &#8220;noindex&#8221; in a robots meta element prevents from indexing, and, once spotted, deindexes previously listed stuff - at least at Google. According to Matt Yahoo and MSN still list such URLs as references without snippets. Because only Google obeys &#8220;noindex&#8221; totally by wiping out even URL-only listings and foreign references, robots meta tags should be considered a kinda weak approach too. Also, search engines must crawl a page to discover this indexer directive. Matt adds that robots meta tags are problematic, because they&#8217;re buried on the pages and sometimes tend to get forgotten when no longer needed (Webmasters <strike>might</strike> do forget to take the tag down, respectively add it later on when search engines policies change, or work in progress gets released respectively outdated contents are taken down). Matt forgot to mention the neat X-Robots-Tags that can be used to apply REP tags in the HTTP header of non-HTML resources like images or PDF documents. <a href="http://sebastians-pamphlets.com/links/categories/?cat=x-robots-tag">Google&#8217;s X-Robots-Tag</a> is supported by Yahoo too. </dd>
<dt>Rel-nofollow (kind of weak)</dt>
<dd>Although <a href="http://link-condom.com/" title="rel-nofollow">condoms</a> totally remove links from Google&#8217;s link graphs, Matt says that rel-nofollow should not be used as crawler or indexer directive. Rel-nofollow is for condomizing links only, also other search engines do follow nofollow&#8217;ed links and even Google can discover the link destination from other links they gather on the Web, or grab from internal links inadvertently lacking a link condom. Finally, rel-nofollow requires crawling too.</dd>
<dt>URL removal tool in GWC (Matt&#8217;s second recommendation)</dt>
<dd>Taking Matt&#8217;s enthusiasm while talking about Google&#8217;s neat URL terminator into account, this one should be considered his first recommendation. Google provides tools to remove URLs from their search index since five years at least (way longer IIRC). Recently the <a href="http://kirklandwc.com/">Webmaster Central team</a> has integrated those, as well as new functionality, into the <a href="http://google.com/webmasters/tools">Webmaster Console</a>, donating it a very nice UI. The URL removal tools come with great granularity, and because the user&#8217;s site ownership is verified, it&#8217;s pretty powerful, safe, and shows even the progress for each request (the removal process lasts a few days). Its UI is very flexible and allows even revoking of previous removal requests. The wonderful little tool&#8217;s sole weak point is that it can&#8217;t remove URLs from the search index forever. After <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=59819&#038;ctx=sibling">90 days</a> or possibly six months the erased stuff can pop up again.</dd>
</dl>
<p><b>Summary:</b> If your site isn&#8217;t password protected, and you can&#8217;t live with indexing of disallow&#8217;ed contents, you must remove unwanted URLs from Google&#8217;s search index periodically. However, there are additional procedures that can support &#8211;but not guarantee!&#8211; deindexing. With other search engines it&#8217;s even worse, because those don&#8217;t respect the REP like Google, and don&#8217;t provide such handy URL removal tools.</p>
<h3 id="definitive-content-termination">The definitive way</h3>
<p>Actually, I think Matt&#8217;s advice is very good. As long as you don&#8217;t need a permanent solution, and if you lack the programming skills to develop such a beast that works with all (major) search engines. I mean everybody can insert a robots meta tag or robots.txt statement, and everybody can semiyearly repeat URL removal requests with the neat URL terminator, but most folks are scared when it comes to conditional manipulation of HTTP headers to prevent stuff from indexing. However, I&#8217;ll try to explain quite safe methods that actually work (with Apache, not IIS) in the following examples. </p>
<p>First of all, if you really want that search engines don&#8217;t index your stuff, you must allow them to crawl it. And no, that&#8217;s not an oxymoron. At the moment there&#8217;s no such thing as an indexer directive on site-level. You can&#8217;t forbid indexing in robots.txt. All indexer directives require crawling of the URLs that you want to keep out of the SERPs. Of course that doesn&#8217;t mean you should serve search engine crawlers a book from each forbidden URL.</p>
<p>Lets start with robots.txt. You put <code><br />
<b>User-agent: *<br />
Disallow: /images/<br />
Disallow: /movies/<br />
Disallow: /unsearchable/<br />
&nbsp;<br />
User-agent: Googlebot<br />
Disallow:<br />
Allow: /<br />
&nbsp;<br />
User-agent: Slurp<br />
Disallow:<br />
Allow: / </b></code><br />
The first section is just a fallback.</p>
<p>(Here comes a rather brutal method that you can use to keep search engines out of particular directories. It&#8217;s not suitable to deal with duplicate content, session IDs, or other URL canonicalization. More on that later.)</p>
<p>Next edit your .htaccess file. <code><br />
<b>&lt;IfModule mod_rewrite.c&gt;<br />
RewriteEngine On<br />
RewriteCond %{REQUEST_URI} ^/unsearchable/<br />
RewriteCond %{REQUEST_URI} !\.php<br />
RewriteRule . /unsearchable/output-content.php [L]<br />
&lt;/IfModule&gt;<br />
</b></code><br />
If you&#8217;ve .php pages in /unsearchable/ then remove the second rewrite condition, put output-content.php into another directory, and edit my PHP code below so that it includes the PHP scripts (don&#8217;t forget to pass the query string). </p>
<p>Now grab the PHP code to check for search engine crawlers <a href="http://sebastians-pamphlets.com/linking-guide-for-paranoid-affiliate-marketers/#grab-php-code-check-crawler">here</a> and include it below. Your script /unsearchable/output-content.php looks like: <code><br />
<b>&lt;?php<br />
@include(&quot;crawler-stuff.php&quot;); // defines variables and functions<br />
$isSpider = checkCrawlerIP ($requestUri);<br />
if ($isSpider) {<br />
     @header(&quot;HTTP/1.1 403 Thou shalt not index this&quot;, TRUE, 403);<br />
     @header(&quot;X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir&quot;);<br />
     exit;<br />
}<br />
&nbsp;<br />
$arr            = explode(&quot;#&quot;, $requestUri);<br />
$outputFileName = $arr[0];<br />
$arr            = explode(&quot;?&quot;, $outputFileName);<br />
$outputFileName = $_SERVER[&quot;DOCUMENT_ROOT&quot;] .$arr[0];<br />
if (substr($outputFileName, -1, 1) == &quot;/&quot;) {<br />
    $outputFileName .= &quot;index.html&quot;;<br />
}<br />
if (file_exists($outputFileName)) {<br />
    // send the <a href="http://www.iana.org/assignments/media-types/">content type</a> header<br />
    $contentType = &quot;text/plain&quot;;<br />
    if (stristr($outputFileName, &quot;.html&quot;)) $contentType =&quot;text/html&quot;;<br />
    if (stristr($outputFileName, &quot;.css&quot;))  $contentType =&quot;text/css&quot;;<br />
    if (stristr($outputFileName, &quot;.js&quot;))   $contentType =&quot;text/javascript&quot;;<br />
    if (stristr($outputFileName, &quot;.png&quot;))  $contentType =&quot;image/png&quot;;<br />
    if (stristr($outputFileName, &quot;.jpg&quot;))  $contentType =&quot;image/jpeg&quot;;<br />
    if (stristr($outputFileName, &quot;.gif&quot;))  $contentType =&quot;image/gif&quot;;<br />
    if (stristr($outputFileName, &quot;.xml&quot;))  $contentType =&quot;application/xml&quot;;<br />
    if (stristr($outputFileName, &quot;.pdf&quot;))  $contentType =&quot;application/pdf&quot;;<br />
    @header(&quot;Content-type: $contentType&quot;);<br />
    @header(&quot;X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir&quot;);<br />
    readfile($outputFileName);<br />
    exit;<br />
}<br />
&nbsp;<br />
// That&#8217;s not the canonical way to call the 404 error page. Don&#8217;t copy, adapt:<br />
@header(&quot;HTTP/1.1 307 Oups, I displaced $outputFileName&quot;, TRUE, 307);<br />
@header(&quot;Location: http://sebastians-pamphlets.com/404/&quot;);<br />
exit;<br />
?&gt;</b></code></p>
<p>What does the gibberish above do? In .htaccess we rewrite all requests for resources stored in /unsearchable/ to a PHP script, which checks whether the request is from a search engine crawler or not. </p>
<p>If the requestor is a verified crawler (known IP or IP and host name belong to a major search engine&#8217;s crawling engine), we return an unfriendly X-Robots-Tag and an HTTP response code 403 telling the search engine that access to our content is forbidden. The search engines should assume that a human visitor receives the same response, hence they aren&#8217;t keen on indexing these URLs. Even if a search engine lists an URL on the SERPs by accident, it can&#8217;t tell the searcher anything about the uncrawled contents. That&#8217;s unlikely to happen actually, because the X-Robots-Tag forbids indexing (Ask and MSN might ignore these directives).</p>
<p>If the requestor is a human visitor, or an unknown Web robot, we serve the requested contents. If the file doesn&#8217;t exist, we call the 404 handler.</p>
<p>With dynamic content you must handle the query string and (expected) cookies yourself. PHP&#8217;s readfile() is binary safe, so the script above works with <a href="http://sebastians-pamphlets.com/unsearchable/dear-google-my-diary-is-unsearchable.png">images</a> or <a href="http://sebastians-pamphlets.com/unsearchable/dear-google-my-diary-is-unsearchable.pdf">PDF documents</a> too. </p>
<p>If you&#8217;ve an original search engine crawler coming from a verifiable server feel free to <a href="http://sebastians-pamphlets.com/unsearchable/">test it with this page</a> (user agent spoofing doesn&#8217;t qualify as crawler, come back in a week or so to check whether the engines have indexed the unsearchable stuff linked above).</p>
<p>The method above is not only brutal, it wastes all the juice from links pointing to the unsearchable site areas. To rescue the PageRank, change the script as follows: <code><br />
<b>&#8230;<br />
$urlThatDesperatelyNeedsPageRank = &quot;http://sebastians-pamphlets.com/about/&quot;;<br />
if ($isSpider) {<br />
     @header(&quot;HTTP/1.1 301 Moved permanently&quot;, TRUE, 301);<br />
     @header(&quot;Location: $urlThatDesperatelyNeedsPageRank&quot;);<br />
     exit;<br />
}<br />
&#8230;</b></code><br />
This redirects crawlers to the URL that has won your internal PageRank lottery. Search engines will/shall transfer the reputation gained from inbound links to this page. Of course page by page redirects would be your first choice, but when you block entire directories you can&#8217;t accomplish this kind of granularity.</p>
<p>By the way, when you remove the offensive 403-forbidden stuff in the script above and change it a little more, you can use it to apply various X-Robots-Tags to your HTML pages, images, videos and whatnot. When a search engine finds an X-Robots-Tag in the HTTP header, it should ignore conflicting indexer directives in robots meta tags. That&#8217;s a smart way to steer indexing of bazillions of resources without editing them.</p>
<p>Ok, this was the cruel method; now lets discuss cases where telling crawlers how to behave is a royal PITA, thanks to the lack of indexer directives in robots.txt that provide the required granularity (<a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-truncate-variable">Truncate-variable</a>, <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-truncate-value">Truncate-value</a>, <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#robots-txt-order-arguments">Order-arguments</a>, &#8230;).</p>
<p>Say you&#8217;ve session IDs in your URLs. That&#8217;s one (not exactly elegant) way to track users or <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">affiliate IDs</a>, but strictly forbidden when the requestor is a <a href="http://www.smart-it-consulting.com/article.htm?node=148&#038;page=103">search engine&#8217;s Web robot</a>. </p>
<p>In fact, a site with unprotected tracking variables is a spider trap that would produce infinite loops in crawling, because spiders following internal links with those variables discover new redundant URLs with each and every fetch of a page. Of course the engines found suitable procedures to dramatically reduce their crawling of such sites, what results in less indexed pages. Besides joyless index penetration there&#8217;s another disadvantage - the indexed URLs are powerless duplicates that usually rank beyond the sonic barrier at 1,000 results per search query. </p>
<p>Smart search engines perform high sophisticated URL canonicalization to get a grip on such crap, but Webmasters can&#8217;t rely on Google &amp; Co to fix their site&#8217;s maladies. </p>
<p>Ok, we agree that you don&#8217;t want search engines to index your ugly URLs, duplicates, and whatnot. To properly steer indexing, you can&#8217;t just block the crawlers&#8217; access to URLs/contents that shouldn&#8217;t appear on SERPs. Search engines discover most of those URLs when following links, and that means that they&#8217;re ready to assign PageRank or other scoring of link popularity to your URLs. PageRank / linkpop is a ranking factor you shouldn&#8217;t waste. Every URL known to search engines is an asset, hence handle it with care. Always bother to figure out the canonical URL, then do a page by page <a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#301-moved-permanently">permanent redirect (301)</a>.</p>
<p>For your URL canonicalization you should have an include file that&#8217;s available at the very top of all your scripts, executed before PHP sends anything to the user agent (don&#8217;t hack each script, maintaining so many places handling the same stuff is a nightmare, and fault-prone). In this include file put the crawler detection code and your individual routines that handle canonicalization and other search engine friendly cloaking routines.</p>
<p><a id="view-code-strip-qs-vars" onclick="showContent('php-code-strip-qs-vars'); hideContent('view-code-strip-qs-vars'); return false;">View a </a>Code example (stripping useless query string variables).<code id="php-code-strip-qs-vars" style="display:none;"><br />
<b>&lt;?php<br />
@include(&quot;crawler-stuff.php&quot;); // defines variables and functions<br />
$isSpider = checkCrawlerIP ($requestUri);<br />
if ($isSpider) {<br />
    $canonicalServerName = &quot;sebastians-pamphlets.com&quot;;<br />
    $doRedirect = FALSE;<br />
    $killVars = array(&quot;sessionID&quot;, &quot;affiliateID&quot;, &quot;format&quot;, &quot;nav&quot;);<br />
    $arr = explode(&quot;#&quot;, $requestUri);<br />
    $uri = $arr[0];<br />
    $arr = explode(&quot;?&quot;, $uri);<br />
    $uri = $arr[0];<br />
    if ($queryString) {<br />
        $qs = str_replace(&quot;&amp;amp;&quot;, &quot;&&quot;, $queryString) .&quot;&&quot;;<br />
        $qsArr = explode(&quot;&&quot;, $qs);<br />
        $qs = &quot;&quot;;<br />
        foreach ($qsArr as $qsArg) {<br />
            $arr = explode(&quot;=&quot;, $qsArg);<br />
            $var = $arr[0];<br />
            $val = $arr[1];<br />
            if (empty($val)) {<br />
                $doRedirect = TRUE;<br />
            }<br />
            foreach ($killVars as $killVar) {<br />
                if (&quot;$killVar&quot; == &quot;$var&quot;) {<br />
                    $var = &quot;&quot;;<br />
                    $doRedirect = TRUE;<br />
                }<br />
            }<br />
            if (!empty($var) &#038;&#038; !empty($val)) {<br />
                if (!empty($qs)) $qs .= &quot;&amp;amp;&quot;;<br />
                $qs .= $var .&quot;=&quot; .$val;<br />
            }<br />
        }<br />
        if (!empty($qs)) {<br />
            $uri .= &quot;?&quot; .$qs;<br />
        }<br />
    }<br />
    if ($doRedirect) {<br />
        $canonicalUrl = &quot;http://&quot; .$canonicalServerName .$uri;<br />
        @header(&quot;HTTP/1.1 301 Moved permanently&quot;, TRUE, 301);<br />
        @header(&quot;Location: $canonicalUrl&quot;);<br />
        exit;<br />
    }<br />
} // isSpider<br />
?&gt; </b></code></p>
<p>How you implement the actual canonicalization routines depends on your individual site. I mean, if you&#8217;ve not the coding skills necessary to accomplish that you wouldn&#8217;t read this entire section, wouldn&#8217;t you? </p>
<ul><b style="margin-left:-20px;">Here are a few examples of pretty common canonicalization issues:</b></p>
<li>Session IDs and other stuff used for user tracking</li>
<li>Affiliate IDs and IDs used to track the referring traffic source</li>
<li>Empty values of query string variables</li>
<li>Query string arguments put in different order / not checking the canonical sequence of query string arguments (ordering them alphabetically is always a good idea)</li>
<li>Redundant query string arguments</li>
<li>URLs longer than 255 bytes</li>
<li>Server name confusion, e.g. subdomains like &#8220;www&#8221;, &#8220;ww&#8221;, &#8220;random-string&#8221; all serving identical contents from example.com</li>
<li>Case issues (IIS/clueless code monkeys handling GET-variables/values case-insensitive)</li>
<li>Spaces, punctuation, or other special characters in URLs</li>
<li>Different scripts outputting identical contents</li>
<li>Flawed navigation, e.g. passing the menu item to the linked URL</li>
<li>Inconsistent default values for variables expected from cookies</li>
<li>Accepting undefined query string variables from GET requests</li>
<li>Contentless pages, e.g. outputted templates when the content pulled from the database equals whitespace or is not available</li>
<li><b>&#8230;</b></li>
</ul>
<h3>Summary</h3>
<p>Hiding contents from all search engines requires programming skills that many sites can&#8217;t afford. Even leading search engines like Google don&#8217;t provide simple and suitable ways to deindex content &#8211;respectively to prevent content from indexing&#8211; without collateral damage (lost/wasted PageRank). We desperately need better tools. Maybe my <a href="http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/">robots.txt extensions</a> are worth an inspection.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/getting-urls-out-of-google-the-good-popular-definitive-way/feed/</wfw:commentRss>
		</item>
		<item>
		<title>My plea to Google - Please sanitize your REP revamps</title>
		<link>http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/</link>
		<comments>http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/#comments</comments>
		<pubDate>Fri, 04 Jan 2008 00:02:46 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Robots Meta Tags]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[X-Robots-Tag]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Crawler Directives]]></category>

		<category><![CDATA[XML-Sitemaps]]></category>

		<category><![CDATA[Microformats]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[robots.txt]]></category>

		<category><![CDATA[Nofollow]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/</guid>
		<description><![CDATA[Standardization of REP tags as robots.txt directives
This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It&#8217;s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported [...]]]></description>
			<content:encoded><![CDATA[<h3 id="rep-revamp-serious-post-title">Standardization of REP tags as robots.txt directives</h3>
<p><img src="http://sebastians-pamphlets.com/img/google/google-confused-on-rep-robots.txt.png" width="234" height="250" align="right" style="margin-left:4px;" alt="Google is confules on REP standards and robots.txt" title="Please help Google to get it!" />This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It&#8217;s based on/extends the robots.txt specifications from <a href="http://www.ro