The perfect robots.txt for News Corp

News Copy kicking out Google NewsI appreciate Google’s brand new News User Agent. It is, however, not a perfect solution, because it doesn’t distinguish indexing and crawling.

Disallow is a crawler directive, that simply tells web robots “do not fetch my content”. It doesn’t prevent contents from indexing. That means, search engines can index content they’re not allowed to fetch from the source, and send free traffic to disallow’ed URIs. In case of news, there are enough 3rd party signals (links, anchor text, quotes, …) out there to create a neat title and snippet on the SERPs.

Fortunately, Google’s REP implementation allows news sites to refine the suggested robots.txt syntax below. Google supports noindex in robots.txt.

Below I’ve edited the robots.txt syntax suggested by Google (source).

Include pages in Google web search, but not in News:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-News
Disallow: /
Noindex: /

This robots.txt file says that no files are disallowed from Google’s general web crawler, called Googlebot, but the user agent “Googlebot-News” is blocked from all files on the website. The “Noindex” directive makes sure that Google News cannot use forbidden stuff indexed from 3rd party signals.

User-agent: Googlebot
Disallow: /
Noindex: /

User-agent: Googlebot-News
Disallow:

When parsing a robots.txt file, Google obeys the most specific directive. The first two lines tell us that Googlebot (the user agent for Google’s web index) is blocked from crawling any pages from the site. The next directive, which applies to the more specific user agent for Google News, overrides the blocking of Googlebot and gives permission for Google News to crawl pages from the website. The “Noindex” directive makes sure that Google Web Search cannot use forbidden stuff indexed from 3rd party signals.

Of course other search engines might handle this differently. So it is obviously a good idea to add indexer directives on page level, too. The most elegant way to do that is a noindex,noarchive,nosnippet X-Robots-Tag in the HTTP header, because images, videos, PDFs etc. can’t be stuffed with HTML’s META elements.

See how this works neatly with Web standards? There’s no need for ACrAP!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

7 Comments to "The perfect robots.txt for News Corp"

  1. Sebastian on 7 December, 2009  #link

    I guess you can be sure you’ve catched a flu when you save a blog post without checking “published”.

  2. Lea on 7 December, 2009  #link

    Yep, this is an upgrade to the robots protocol which has been needed for a long time!

  3. […] Die perfekte robots.txt für Rupert Murdoch - na geht doch December 8th, 2009 um 9:11 pm Be social: Add to del.icio.us | digg it […]

  4. Franz Ferdinand on 9 December, 2009  #link

    was’n das für ein Scheiß-Blog? Aber was soll man von einem pissigen Spammer erwarten?

    [Q: “what’s this for a fucking blog? But what can you expect from a PISSIGE spammers?” Source]

    [A: “You can expect that this pissige blog’s editor will delete your spammy link drop, assclown.”]

  5. moshin on 9 December, 2009  #link

    Hi, I am a new reader to this blog and after spending sometime I am totally amazed that someone is still so generous sharing worthy information on the web. I do not have my words to express my gratitude to your efforts in writing such information-packed post.

    Although I have my Google Reader full of various blogs but your blog is really on the top of the list now.
    May be someday,I will be able to return favor to you! Can you just answer one stupid question of mine? Why you are writing such quality stuff and sharing it for free? :)

    [Thanks. Because the world deserves my pamphlets.]

  6. Phil on 16 December, 2009  #link

    DoubleClick would also benefit from your suggestion.

    There are 200k of 301s indexed here:
    http://www.google.com/search?q=site:g.doubleclick.net&num=100&filter=0

    Thier robot file needs to be updated to contain: http://www.doubleclick.net/robot.txt

    User-agent: *
    Disallow: /pagead
    Disallow: /~at/
    Disallow: /~a/
    Disallow: /~ah/
    Disallow: /~atfef/
    Disallow: /gampad/ads
    Noindex: /pagead
    Noindex: /~at/
    Noindex: /~a/
    Noindex: /~ah/
    Noindex: /~atfef/
    Noindex: /gampad/ads

  7. Anders Holm - Poweriser Denmark on 18 January, 2010  #link

    well… I’m sure we all can agree that it’s a step in the right direction, right !?

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.