Monthly archive: August, 2009

Debugging robots.txt with Google Webmaster Tools

Although Google’s Webmaster Console is a really neat toolkit, it can mislead the not-that-savvy crowd every once in a while.

When you go to “Diagnostics::Crawl Errors::Restricted by robots.txt” and you find URIs that aren’t disallow’ed or even noindex’ed in your very own robots.txt, calm down.

Google’s cool robots.txt validator withdraws its knowledge of redirects and approves your redirecting URIs, driving you nuts until you check each URI’s HTTP response code for redirects (HTTP response codes 301, 302 and 307, as well as undelayed meta refreshs).

Google obeys robots.txt even in a chain of redirects. If for Google’s user agent(s) an URI given in an HTTP header’s location is disallow’ed or noindex’ed, Googlebot doesn’t fetch it, regardless the position in the current chain of redirects. Even a robots.txt block in the 5th hop stops the greedy Web robot. Those URIs are correctly reported back as “restricted by robots.txt”, Google just refuses to tell you that the blocking crawler directive origins from a foreign server.



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Getting new stuff crawled in real-time

Thanks to PubSubHubbub we’ve got another method to invite Googlebot.

Just add a link to FriendFeed, for example by twittering or bookmarking it. Once the URI hits your FriendFeed account, FriendFeed shares it with Googlebot, and both of them request it before the tweet appears on your timeline:

Date/Time Request URI IP User agent
2009–08–17 15:47:19 Your URI 38.99.68.206 / 38.99.68.206 Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)
2009-08-17 15:47:19 Your robots.txt 66.249.71.141 / crawl-66-249-71-141.googlebot.com Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2009-08-17 15:47:19 Your URI 66.249.71.141 / crawl-66-249-71-141.googlebot.com Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

 

Neat. Of course there are other methods to invite search engine crawlers in lightning speed, but this one comes handy because it can run on autopilot […]. Even when the actual fetch is meant to feed Google Reader or something, you’ve got your stuff into Google’s crawling cache from where other services like the Web indexer can pick it without requesting a crawl.



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments