2009 August

Monthly archive: August, 2009

Debugging robots.txt with Google Webmaster Tools

Posted on 18 August, 2009

Although Google’s Webmaster Console is a really neat toolkit, it can mislead the not-that-savvy crowd every once in a while.

When you go to “Diagnostics::Crawl Errors::Restricted by robots.txt” and you find URIs that aren’t disallow’ed or even noindex’ed in your very own robots.txt, calm down.

Google’s cool robots.txt validator withdraws its knowledge of redirects and approves your redirecting URIs, driving you nuts until you check each URI’s HTTP response code for redirects (HTTP response codes 301, 302 and 307, as well as undelayed meta refreshs).

Google obeys robots.txt even in a chain of redirects. If for Google’s user agent(s) an URI given in an HTTP header’s location is disallow’ed or noindex’ed, Googlebot doesn’t fetch it, regardless the position in the current chain of redirects. Even a robots.txt block in the 5th hop stops the greedy Web robot. Those URIs are correctly reported back as “restricted by robots.txt”, Google just refuses to tell you that the blocking crawler directive origins from a foreign server.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | Crawler Directives, Redirects, Webmaster Central, robots.txt, SEO, Google

Getting new stuff crawled in real-time

Posted on 18 August, 2009

Thanks to PubSubHubbub we’ve got another method to invite Googlebot.

Just add a link to FriendFeed, for example by twittering or bookmarking it. Once the URI hits your FriendFeed account, FriendFeed shares it with Googlebot, and both of them request it before the tweet appears on your timeline:

Date/Time	Request URI	IP	User agent
2009–08–17 15:47:19	Your URI	38.99.68.206 / 38.99.68.206	Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)
2009-08-17 15:47:19	Your robots.txt	66.249.71.141 / crawl-66-249-71-141.googlebot.com	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2009-08-17 15:47:19	Your URI	66.249.71.141 / crawl-66-249-71-141.googlebot.com	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Neat. Of course there are other methods to invite search engine crawlers in lightning speed, but this one comes handy because it can run on autopilot […]. Even when the actual fetch is meant to feed Google Reader or something, you’ve got your stuff into Google’s crawling cache from where other services like the Web indexer can pick it without requesting a crawl.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

14 comments Sebastian | SEO, Google

Sebastian’s Pamphlets

Monthly archive: August, 2009

Debugging robots.txt with Google Webmaster Tools

Getting new stuff crawled in real-time

Categories

Monthly Archives

Links

RSS Feeds