<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.2.3" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: Get yourself a smart robots.txt</title>
	<link>http://sebastians-pamphlets.com/smart-robots-txt/</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Fri, 10 Feb 2012 01:02:36 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>

	<item>
		<title>By: How To Help Search Engines Find Your Content &#124; Van SEO Design</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2770</link>
		<dc:creator>How To Help Search Engines Find Your Content &#124; Van SEO Design</dc:creator>
		<pubDate>Thu, 16 Dec 2010 19:39:52 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2770</guid>
		<description>[...] Get yourself a smart robots.txt  [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Get yourself a smart robots.txt  [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sebastian</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2634</link>
		<dc:creator>Sebastian</dc:creator>
		<pubDate>Wed, 03 Nov 2010 11:13:37 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2634</guid>
		<description>David, cloak the META element for the news crawler, or, even better, serve it via X-Robots-Tag in the header to the news bot only. That'll be enough for a somewhat clean testing. If the Google News service doesn't support it, drop me a line and we can create some noise. ;-)</description>
		<content:encoded><![CDATA[<p>David, cloak the META element for the news crawler, or, even better, serve it via X-Robots-Tag in the header to the news bot only. That&#8217;ll be enough for a somewhat clean testing. If the Google News service doesn&#8217;t support it, drop me a line and we can create some noise. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2633</link>
		<dc:creator>David</dc:creator>
		<pubDate>Wed, 03 Nov 2010 07:26:25 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2633</guid>
		<description>hi Sebastian,
thanks for response.

However, with that googlebot-nosnippet command, you would take away the snippets also from websearch results.

I only want to remove the snippets from Google News....(only for googlebot-news) and even my contact at Google News USA believes, the news-bot does not accept the command...

If I have a chance, I´ll try...</description>
		<content:encoded><![CDATA[<p>hi Sebastian,<br />
thanks for response.</p>
<p>However, with that googlebot-nosnippet command, you would take away the snippets also from websearch results.</p>
<p>I only want to remove the snippets from Google News&#8230;.(only for googlebot-news) and even my contact at Google News USA believes, the news-bot does not accept the command&#8230;</p>
<p>If I have a chance, I´ll try&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sebastian</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2632</link>
		<dc:creator>Sebastian</dc:creator>
		<pubDate>Wed, 03 Nov 2010 07:20:22 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2632</guid>
		<description>&lt;p&gt;David, Google provides &lt;a href="http://googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html"&gt;nosnippet/noarchive syntax&lt;/a&gt; examples for news sites, so try it:&lt;/p&gt;

&lt;blockquote&gt;Usually you want Google to display both the snippet and the cached link. However, there are some cases where you might want to disable one or both of these. For example, say you were a newspaper publisher, and you have a page whose content changes several times a day. It may take longer than a day for us to reindex a page, so users may have access to a cached copy of the page that is not the same as the one currently on your site. In this case, you probably don't want the cached link appearing in our results.

Again, the Robots Exclusion Protocol comes to your aid. Add the NOARCHIVE tag to a web page and Google won't cache copy of a web page in search results:

&#60;META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"&#62;

Similarly, you can tell Google not to display a snippet for a page. The NOSNIPPET tag achieves this:

&#60;META NAME="GOOGLEBOT" CONTENT="NOSNIPPET"&#62;

Adding NOSNIPPET also has the effect of preventing a cache link from being shown, so if you specify NOSNIPPET you automatically get NOARCHIVE too.&lt;/blockquote&gt;

&lt;p&gt;So I'd guess that works fine with the news crawler. Didn't test it myself, tough, since I don't run CNN or so.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>David, Google provides <a href="http://googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html">nosnippet/noarchive syntax</a> examples for news sites, so try it:</p>
<blockquote><p>Usually you want Google to display both the snippet and the cached link. However, there are some cases where you might want to disable one or both of these. For example, say you were a newspaper publisher, and you have a page whose content changes several times a day. It may take longer than a day for us to reindex a page, so users may have access to a cached copy of the page that is not the same as the one currently on your site. In this case, you probably don&#8217;t want the cached link appearing in our results.</p>
<p>Again, the Robots Exclusion Protocol comes to your aid. Add the NOARCHIVE tag to a web page and Google won&#8217;t cache copy of a web page in search results:</p>
<p>&lt;META NAME=&#8221;GOOGLEBOT&#8221; CONTENT=&#8221;NOARCHIVE&#8221;&gt;</p>
<p>Similarly, you can tell Google not to display a snippet for a page. The NOSNIPPET tag achieves this:</p>
<p>&lt;META NAME=&#8221;GOOGLEBOT&#8221; CONTENT=&#8221;NOSNIPPET&#8221;&gt;</p>
<p>Adding NOSNIPPET also has the effect of preventing a cache link from being shown, so if you specify NOSNIPPET you automatically get NOARCHIVE too.</p></blockquote>
<p>So I&#8217;d guess that works fine with the news crawler. Didn&#8217;t test it myself, tough, since I don&#8217;t run CNN or so.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2630</link>
		<dc:creator>David</dc:creator>
		<pubDate>Tue, 02 Nov 2010 06:31:58 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2630</guid>
		<description>hi Sebastian,
do you know if it is possible to combine the meta-robots for googlebot-news with nosnippet?

So will google accept this? 
</description>
		<content:encoded><![CDATA[<p>hi Sebastian,<br />
do you know if it is possible to combine the meta-robots for googlebot-news with nosnippet?</p>
<p>So will google accept this?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jonny</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2467</link>
		<dc:creator>jonny</dc:creator>
		<pubDate>Fri, 16 Jul 2010 01:54:18 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2467</guid>
		<description>I only wish I had the knowledge needed to get this to work. Time to do some digging.</description>
		<content:encoded><![CDATA[<p>I only wish I had the knowledge needed to get this to work. Time to do some digging.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christopher Roberts</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2419</link>
		<dc:creator>Christopher Roberts</dc:creator>
		<pubDate>Mon, 31 May 2010 04:39:34 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2419</guid>
		<description>This is extremely interesting and useful stuff. I really never thought that a Robots.txt could be so complicated, and could do so much!

Trying to get my head round some of it, but the "noindex" stuff has been implemented on my robots.txt.

Yous blog is extremely valuable, and I appreciate that!


Keep up the good work :D

Christopher</description>
		<content:encoded><![CDATA[<p>This is extremely interesting and useful stuff. I really never thought that a Robots.txt could be so complicated, and could do so much!</p>
<p>Trying to get my head round some of it, but the &#8220;noindex&#8221; stuff has been implemented on my robots.txt.</p>
<p>Yous blog is extremely valuable, and I appreciate that!</p>
<p>Keep up the good work <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>Christopher</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sebastian</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2377</link>
		<dc:creator>Sebastian</dc:creator>
		<pubDate>Mon, 26 Apr 2010 15:11:04 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2377</guid>
		<description>Nope Mik. A non-behaving bot can't be blocked by a &lt;code&gt;Disallow: /&lt;/code&gt; statement in robots.txt. 

What you need is an apparatus that detects nasty bot behavior analyzing each and every HTTP request, sometimes triggering a ban of an IP, host, user agent, or a combination of those values on the &lt;i&gt;n&lt;/i&gt;th request. 

Unusual (IOW fraudulent) requests of /robots.txt are just one source that feeds such a black list. For example when you request the robots.txt file of a protected site pretending to be Googlebot from your Cox Communications IP addy, all your future requests run into a 403. The same will happen when you request the root index page, or any other resource (page, image, ...), regardless whether you've fetched robots.txt before or not.

Legit bots sent out from major search engines provide a procedure to validate their requests. Any request with a search engine crawler user agent string and an IP addy that can't be verified as belonging to the particular search engine's crawling engine is fraudulent, and should be blocked. Instead of responding with a 403-GFY you can serve ads or anything else except your content.  

If you make use of somewhat shady optimization techniques, such as some variants of IP delivery, the procedure outlined above is not safe enough, but for most sites it will serve you well.</description>
		<content:encoded><![CDATA[<p>Nope Mik. A non-behaving bot can&#8217;t be blocked by a <code>Disallow: /</code> statement in robots.txt. </p>
<p>What you need is an apparatus that detects nasty bot behavior analyzing each and every HTTP request, sometimes triggering a ban of an IP, host, user agent, or a combination of those values on the <i>n</i>th request. </p>
<p>Unusual (IOW fraudulent) requests of /robots.txt are just one source that feeds such a black list. For example when you request the robots.txt file of a protected site pretending to be Googlebot from your Cox Communications IP addy, all your future requests run into a 403. The same will happen when you request the root index page, or any other resource (page, image, &#8230;), regardless whether you&#8217;ve fetched robots.txt before or not.</p>
<p>Legit bots sent out from major search engines provide a procedure to validate their requests. Any request with a search engine crawler user agent string and an IP addy that can&#8217;t be verified as belonging to the particular search engine&#8217;s crawling engine is fraudulent, and should be blocked. Instead of responding with a 403-GFY you can serve ads or anything else except your content.  </p>
<p>If you make use of somewhat shady optimization techniques, such as some variants of IP delivery, the procedure outlined above is not safe enough, but for most sites it will serve you well.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mik</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2376</link>
		<dc:creator>Mik</dc:creator>
		<pubDate>Mon, 26 Apr 2010 10:01:33 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2376</guid>
		<description>"Once they request robots.txt, they’ll never see any content again."

Does this refer to your Disallow in the smart robots.txt file?  Or does it refer to some other function?  Because it's a little confusing...as the root of the problem is still in the fact that bad bots generally ignore robots.txt.</description>
		<content:encoded><![CDATA[<p>&#8220;Once they request robots.txt, they’ll never see any content again.&#8221;</p>
<p>Does this refer to your Disallow in the smart robots.txt file?  Or does it refer to some other function?  Because it&#8217;s a little confusing&#8230;as the root of the problem is still in the fact that bad bots generally ignore robots.txt.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dena Tasarim</title>
		<link>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2347</link>
		<dc:creator>Dena Tasarim</dc:creator>
		<pubDate>Fri, 19 Mar 2010 08:49:05 +0000</pubDate>
		<guid>http://sebastians-pamphlets.com/smart-robots-txt/#comment-2347</guid>
		<description>Nice article Sebastian. I never knew there was anything this deep to the robots.txt file. My clients usually nag me about stupoid robots stealing their content/images. I'll try to implement this solution from now on :)</description>
		<content:encoded><![CDATA[<p>Nice article Sebastian. I never knew there was anything this deep to the robots.txt file. My clients usually nag me about stupoid robots stealing their content/images. I&#8217;ll try to implement this solution from now on <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
</channel>
</rss>

