Why proper error handling is important

Misconfigured servers can prevent search engines from crawling and indexing. I admit that’s news of yesterday. However, standard setups and code copied from low quality resources are underestimated –but very popular– points of failure. According to Google a missing robots.txt file in combination with amateurish error handling can result in invisibility on Google’s SERPs. That’s a very common setup by the way.

Googler Jonathon Simon said:

This way [correct setup] when the Google crawler or other search engine checks for a robots.txt file, they get a 200 response if the file is found and a 404 response if it is not found. If they get a 200 response for both cases then it is ambiguous if your site has blocked search engines or not, reducing the likelihood your site will be fully crawled and indexed.

That’s a very carefully written warning, so I try to rephrase the message between the lines:

If you have no robots.txt and your server responds “Ok” (or 302 on a request of robots.txt followed by a 200 response on request of the error page) when Googlebot tries to fetch it, Googlebot might not be willing to crawl your stuff further, hence your pages will not make it in Google’s search index.

If you don’t suffer from IIS (Windows hosting is a horrible nightmare coming with more pitfalls than countable objects in the universe: go find a reliable host) here is a bullet-proof setup.

If you don’t have a robots.txt file yet, create one and upload it today:

User-agent: *
Disallow:

This tells crawlers that your whole domain is spiderable. If you want to exclude particular pages, file-types or areas of your site, refer to the robots.txt manual.

Next look at the .htaccess file in your server’s Web root directory. If your FTP client doesn’t show it, add “-a” to “external mask” in the settings and reconnect. If you find complete URLs in lines starting with “ErrorDocument”, your error handling is screwed up. What happens is that your server does a soft redirect to the given URL, which probably responds with “200-Ok”, and the actual error code gets lost in cyberspace. Sending 401 errors to absolute URLs will slow your server down to the performance of a single IBM-XT hosting Google.com, all other error directives pointing to absolute URLs result in crap. Here is a well formed .htaccess sample:

ErrorDocument 401 /get-the-fuck-outta-here.html
ErrorDocument 403 /get-the-fudge-outta-here.html
ErrorDocument 404 /404-not-found.html
ErrorDocument 410 /410-gone-forever.html
Options -Indexes
<Files “.ht*”>
deny from all
</Files>
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.canonical-server-name\.com [NC]
RewriteRule (.*) http://www.canonical-server-name.com/$1 [R=301,L]

With “ErrorDocument” directives you can capture other clumsiness as well, for example 500 errors with /server-too-buzzy.html or so. Or make the error handling comfortable using /error.php?errno=[insert err#]. In any case avoid relative URLs (src attribute in IMG elements, CSS/feed links, href attributes of A elements …) on all landing pages. You can test actual HTTP response codes with online header checkers.

The other statements above do different neat things. Options -Indexes disallows directory browsing, the next block makes sure that nobody can read your server directives, and the last three lines redirect invalid server names to your canonical server address.

.htaccess is a plain ASCII file, it can get screwed when you upload it in binary mode or when you change it with a word processor. Best edit it with an ASCII/ANSI editor (vi, notepad) as htaccess.txt on your local machine (most FTP clients choose ASCII mode for text files) and rename it to “.htaccess” on the server. Keep in mind that file names are case sensitive.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

9 Comments to "Why proper error handling is important"

  1. Edd on 2 March, 2007  #link

    Any article related to the .htaccess is always a mess, except from this one, really straight.

    nice!

  2. Sebastian on 3 March, 2007  #link

    Thanks :)

  3. Mr Apache on 4 March, 2007  #link

    this article that shows every single Apache Status Code and the actual headers and src returned from that error! Force Apache to output any HTTP Status Code with ErrorDocument

  4. Sebastian on 4 March, 2007  #link
  5. […] Reprinted with permission […]

  6. […] to your theme’s error page. If you don’t blog in the root, learn here how you should handle HTTP errors outside the /blog/ directory. Load 404.php in an ASCII editor to check whether it will actually send a 404 response. If the very […]

  7. Riccardo Giuntoli on 21 November, 2007  #link

    Thank you for this good guide.

    There’s a nice tool to build .htaccess file directly on web with a simple form:

    www . htaccesseditor . com /en.shtml [Delinked, the first .htaccess code generation I’ve tried produces utterly bullshit. If you use the code for www vs. non-www canonicalization from this page you create an exploit enabling negative SEO.]

    Best regards, Riccardo Giuntoli.

  8. […] I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root’s .htaccess file: ErrorDocument 401 /get-the-fuck-outta-here.asp ErrorDocument 403 /get-the-fudge-outta-here.asp ErrorDocument 404 /404handler.php ErrorDocument 410 /410-gone-forever.asp ErrorDocument 503 /410-down-for-maintenance.asp # … Options -Indexes Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script: ErrorDocument 401 /error.php?errno=401 ErrorDocument 403 /error.php?errno=403 ErrorDocument 404 /404handler.php ErrorDocument 410 /error.php?errno=410 ErrorDocument 503 /error.php?errno=503 # … Options -Indexes Note that relative URLs in pages or scripts called by ErrorDocument directives don’t work. Don’t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that. If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that’s bad enough.) For more information please read my pamphlet Why error handling is important. […]

  9. […] Why proper error handling is important […]

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.