Debugging robots.txt with Google Webmaster Tools
Although Google’s Webmaster Console is a really neat toolkit, it can mislead the not-that-savvy crowd every once in a while.
When you go to “Diagnostics::Crawl Errors::Restricted by robots.txt” and you find URIs that aren’t disallow’ed or even noindex’ed in your very own robots.txt, calm down.
Google’s cool robots.txt validator withdraws its knowledge of redirects and approves your redirecting URIs, driving you nuts until you check each URI’s HTTP response code for redirects (HTTP response codes 301, 302 and 307, as well as undelayed meta refreshs).
Google obeys robots.txt even in a chain of redirects. If for Google’s user agent(s) an URI given in an HTTP header’s location is disallow’ed or noindex’ed, Googlebot doesn’t fetch it, regardless the position in the current chain of redirects. Even a robots.txt block in the 5th hop stops the greedy Web robot. Those URIs are correctly reported back as “restricted by robots.txt”, Google just refuses to tell you that the blocking crawler directive origins from a foreign server.
|
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb Subscribe to |
2 comments Sebastian | Crawler Directives, Redirects, Webmaster Central, robots.txt, SEO, Google