The perfect robots.txt for News Corp
I appreciate Google’s brand new News User Agent. It is, however, not a perfect solution, because it doesn’t distinguish indexing and crawling.
Disallow
is a crawler directive, that simply tells web robots “do not fetch my content”. It doesn’t prevent contents from indexing. That means, search engines can index content they’re not allowed to fetch from the source, and send free traffic to disallow’ed URIs. In case of news, there are enough 3rd party signals (links, anchor text, quotes, …) out there to create a neat title and snippet on the SERPs.
Fortunately, Google’s REP implementation allows news sites to refine the suggested robots.txt syntax below. Google supports noindex
in robots.txt.
Below I’ve edited the robots.txt syntax suggested by Google (source).
Include pages in Google web search, but not in News:
User-agent: Googlebot
Disallow:
User-agent: Googlebot-News
Disallow: /
Noindex: /
This robots.txt file says that no files are disallowed from Google’s general web crawler, called Googlebot, but the user agent “Googlebot-News” is blocked from all files on the website. The “Noindex” directive makes sure that Google News cannot use forbidden stuff indexed from 3rd party signals.
Include pages in Google News, but not Google web search:
User-agent: Googlebot
Disallow: /
Noindex: /
User-agent: Googlebot-News
Disallow:
When parsing a robots.txt file, Google obeys the most specific directive. The first two lines tell us that Googlebot (the user agent for Google’s web index) is blocked from crawling any pages from the site. The next directive, which applies to the more specific user agent for Google News, overrides the blocking of Googlebot and gives permission for Google News to crawl pages from the website. The “Noindex” directive makes sure that Google Web Search cannot use forbidden stuff indexed from 3rd party signals.
Of course other search engines might handle this differently. So it is obviously a good idea to add indexer directives on page level, too. The most elegant way to do that is a noindex,noarchive,nosnippet
X-Robots-Tag in the HTTP header, because images, videos, PDFs etc. can’t be stuffed with HTML’s META elements.
See how this works neatly with Web standards? There’s no need for ACrAP!
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb Subscribe to ![]() ![]() ![]() |
Sebastian | Crawler Directives, robots.txt, Google | Related posts
I guess you can be sure you’ve catched a flu when you save a blog post without checking “published”.
Yep, this is an upgrade to the robots protocol which has been needed for a long time!
[…] Die perfekte robots.txt für Rupert Murdoch - na geht doch December 8th, 2009 um 9:11 pm Be social: Add to del.icio.us | digg it […]
was’n das für ein Scheiß-Blog? Aber was soll man von einem pissigen Spammer erwarten?
[Q: “what’s this for a fucking blog? But what can you expect from a PISSIGE spammers?” Source]
[A: “You can expect that this pissige blog’s editor will delete your spammy link drop, assclown.”]
Hi, I am a new reader to this blog and after spending sometime I am totally amazed that someone is still so generous sharing worthy information on the web. I do not have my words to express my gratitude to your efforts in writing such information-packed post.
Although I have my Google Reader full of various blogs but your blog is really on the top of the list now.
May be someday,I will be able to return favor to you! Can you just answer one stupid question of mine? Why you are writing such quality stuff and sharing it for free?
[Thanks. Because the world deserves my pamphlets.]
DoubleClick would also benefit from your suggestion.
There are 200k of 301s indexed here:
http://www.google.com/search?q=site:g.doubleclick.net&num=100&filter=0
Thier robot file needs to be updated to contain: http://www.doubleclick.net/robot.txt
User-agent: *
Disallow: /pagead
Disallow: /~at/
Disallow: /~a/
Disallow: /~ah/
Disallow: /~atfef/
Disallow: /gampad/ads
Noindex: /pagead
Noindex: /~at/
Noindex: /~a/
Noindex: /~ah/
Noindex: /~atfef/
Noindex: /gampad/ads
well… I’m sure we all can agree that it’s a step in the right direction, right !?