Why proper error handling is important

Misconfigured servers can prevent search engines from crawling and indexing. I admit that’s news of yesterday. However, standard setups and code copied from low quality resources are underestimated –but very popular– points of failure. According to Google a missing robots.txt file in combination with amateurish error handling can result in invisibility on Google’s SERPs. That’s a very common setup by the way.

Googler Jonathon Simon said:

This way [correct setup] when the Google crawler or other search engine checks for a robots.txt file, they get a 200 response if the file is found and a 404 response if it is not found. If they get a 200 response for both cases then it is ambiguous if your site has blocked search engines or not, reducing the likelihood your site will be fully crawled and indexed.

That’s a very carefully written warning, so I try to rephrase the message between the lines:

If you have no robots.txt and your server responds “Ok” (or 302 on a request of robots.txt followed by a 200 response on request of the error page) when Googlebot tries to fetch it, Googlebot might not be willing to crawl your stuff further, hence your pages will not make it in Google’s search index.

If you don’t suffer from IIS (Windows hosting is a horrible nightmare coming with more pitfalls than countable objects in the universe: go find a reliable host) here is a bullet-proof setup.

If you don’t have a robots.txt file yet, create one and upload it today:

User-agent: *
Disallow:

This tells crawlers that your whole domain is spiderable. If you want to exclude particular pages, file-types or areas of your site, refer to the robots.txt manual.

Next look at the .htaccess file in your server’s Web root directory. If your FTP client doesn’t show it, add “-a” to “external mask” in the settings and reconnect. If you find complete URLs in lines starting with “ErrorDocument”, your error handling is screwed up. What happens is that your server does a soft redirect to the given URL, which probably responds with “200-Ok”, and the actual error code gets lost in cyberspace. Sending 401 errors to absolute URLs will slow your server down to the performance of a single IBM-XT hosting Google.com, all other error directives pointing to absolute URLs result in crap. Here is a well formed .htaccess sample:

ErrorDocument 401 /get-the-fuck-outta-here.html
ErrorDocument 403 /get-the-fudge-outta-here.html
ErrorDocument 404 /404-not-found.html
ErrorDocument 410 /410-gone-forever.html
Options -Indexes
<Files “.ht*”>
deny from all
</Files>
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.canonical-server-name\.com [NC]
RewriteRule (.*) http://www.canonical-server-name.com/$1 [R=301,L]

With “ErrorDocument” directives you can capture other clumsiness as well, for example 500 errors with /server-too-buzzy.html or so. Or make the error handling comfortable using /error.php?errno=[insert err#]. In any case avoid relative URLs (src attribute in IMG elements, CSS/feed links, href attributes of A elements …) on all landing pages. You can test actual HTTP response codes with online header checkers.

The other statements above do different neat things. Options -Indexes disallows directory browsing, the next block makes sure that nobody can read your server directives, and the last three lines redirect invalid server names to your canonical server address.

.htaccess is a plain ASCII file, it can get screwed when you upload it in binary mode or when you change it with a word processor. Best edit it with an ASCII/ANSI editor (vi, notepad) as htaccess.txt on your local machine (most FTP clients choose ASCII mode for text files) and rename it to “.htaccess” on the server. Keep in mind that file names are case sensitive.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

14 Comments to "Why proper error handling is important"

  1. Edd on 2 March, 2007  #link

    Any article related to the .htaccess is always a mess, except from this one, really straight.

    nice!

  2. Sebastian on 3 March, 2007  #link

    Thanks :)

  3. Mr Apache on 4 March, 2007  #link

    this article that shows every single Apache Status Code and the actual headers and src returned from that error! Force Apache to output any HTTP Status Code with ErrorDocument

  4. Sebastian on 4 March, 2007  #link
  5. […] Reprinted with permission […]

  6. […] to your theme’s error page. If you don’t blog in the root, learn here how you should handle HTTP errors outside the /blog/ directory. Load 404.php in an ASCII editor to check whether it will actually send a 404 response. If the very […]

  7. Riccardo Giuntoli on 21 November, 2007  #link

    Thank you for this good guide.

    There’s a nice tool to build .htaccess file directly on web with a simple form:

    www . htaccesseditor . com /en.shtml [Delinked, the first .htaccess code generation I’ve tried produces utterly bullshit. If you use the code for www vs. non-www canonicalization from this page you create an exploit enabling negative SEO.]

    Best regards, Riccardo Giuntoli.

  8. […] I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root’s .htaccess file: ErrorDocument 401 /get-the-fuck-outta-here.asp ErrorDocument 403 /get-the-fudge-outta-here.asp ErrorDocument 404 /404handler.php ErrorDocument 410 /410-gone-forever.asp ErrorDocument 503 /410-down-for-maintenance.asp # … Options -Indexes Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script: ErrorDocument 401 /error.php?errno=401 ErrorDocument 403 /error.php?errno=403 ErrorDocument 404 /404handler.php ErrorDocument 410 /error.php?errno=410 ErrorDocument 503 /error.php?errno=503 # … Options -Indexes Note that relative URLs in pages or scripts called by ErrorDocument directives don’t work. Don’t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that. If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that’s bad enough.) For more information please read my pamphlet Why error handling is important. […]

  9. […] Why proper error handling is important […]

  10. g1smd on 18 March, 2010  #link

    I was surprised to see your “Proper Error Handling” code has several points of failure.

    This line in particular:

    RewriteCond %{HTTP_HOST} !^www\.canonical-server-name\.com [NC]

    It fails to redirect www with port number or trailing period; also fails for wrong casing.

    It also crates an infinite loop for HTTP/1.0 requests because host header is blank for HTTP/1.0 requests.

    Simple solution, replace [NC] with $, and add ( ) and ?:

    RewriteCond %{HTTP_HOST} !^(www\.canonical-server-name\.com)?$

  11. Sebastian on 19 March, 2010  #link

    g1smd, I disagree, and here’s why:

    Server names aren’t case sensitive. wWw.ExamplE.COM == www.example.com. Since URIs with just uppercase/lowercase differences in the server name part aren’t different at all, there’s no point in redirecting for cosmetic reasons. All uppercase server names may look pretty ugly, but technically that’s not an issue. Also, avoiding redirects when possible is a good idea.

    I consider requests with a port number faulty, or even abusive, hence I redirect them. That works both for :80 (http) and :443 (https). We don’t deal with crappy AOL proxies from the last century any more.

    As for HTTP/1.0 requests, those are rare nowadays, and mostly faked, that is they tell HTTP/1.0 but provide an HTTP/1.1 host header. True HTTP/1.0 requests are dead, those ancient user agents can’t even handle virtual hosts (many domains sharing one IP). So when there’s no point for anybody in performing HTTP/1.0 requests at all, why should I still support them?

    Agree?

  12. g1smd on 19 March, 2010  #link

    No. Your original rule WILL correctly redirect both example.com:80 (port number) and example.com. (trailing period) requests, but it WILL NOT redirect non-canonical www.example.com:80 and www.example.com. requests as originally coded.

    In the interests of fixing all non-canonical incoming links, the proposed replacement does fix ALL of those problems. I prefer to redirect all non-canonical requests (incorrect protocol, sub-domain, domain, TLD, trailing period, port number, or combination of those) to the canonical form.

    As for HTTP/1.0, it is so easy to add support for that, that it seems pointless to not do so. I know that most so-called HTTP/1.0 requests DO include a valid host header, but it only takes one request to arrive without and you potentially have a self-DOS problem on your hands.

  13. Sebastian on 19 March, 2010  #link

    Taking the risk of driving you and JdMorgan from WMW as well crazy, we can agree on this?

    RewriteCond %{HTTP_HOST} !^(www\.canonical-server-name\.com)?$ [NC]

  14. g1smd on 20 March, 2010  #link

    I asked for a second opinion since you’re so set on this. :) Here’s a summary of the reply I received.

    A badly-written (or malicious) script intentionally or otherwise using HTTP/1.0 to send requests received by an IP-based server (i.e. one having a unique IP address which therefore *can* be reached with an HTTP “Host” header) will lead to an infinite redirection loop. Since this violates his own stated preference “to minimize redirects,” it seems a no-brainer to support/properly-handle HTTP/1.0 requests by excluding blank hostnames from being redirected.

    The same can be said for the uppercase domains: Sure, the *server* is case-insensitive, but do you want to rely on the “kindness of others” not go and create a bunch of mis-cased links to your site? Google and the other majors may sort this in their back-end processing, but what if they drop the ball? What about other search engines, are they that smart, too? I say, delete the [NC] and get on with it.

    On the Web, one cannot discount the fact that many access requests come from incompetent and/or malicious scripters. I want my server’s function and my page’s SERP rankings under *my* control, thank you very much. :)

    Please also note that the quotes around “.ht*” in the “<Files “.ht*”>” directive are “smart quotes” or “high-bit-set character-code quotes” and will either crater the server with a HTTP 500 Error, or just not work — and I haven’t got time to test which one right now.

    For myself, I’d follow the Apache recommendation here, and use

    <FilesMatch “^\.ht”>
    Deny from all
    </FilesMatch>

    instead. :)
    .

    That’s the problem with .htaccess. It’s very compact, very powerful server configuration code. There’s often many ways to apparently achieve the same thing, but some might have flaws that have catastrophic consequences for rankings and traffic even though the code ‘appears’ to work. In addition there are many other ways to try to do some jobs that are totally inappropriate. I look back at code I wrote several years ago, and now change it in several ways using the various things I have learned since then.

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.