Shit happens, your redirects hit the fan!

confused spiderAlthough robust search engine crawlers are rather fault-tolerant creatures, there is an often overlooked but quite safe procedure to piss off the spiders. Playing redirect ping pong mostly results in unindexed contents. Google reports chained redirects under the initially requested URL as URLs not followed due to redirect errors, and recommends:

Minimize the number of redirects needed to follow a link from one page to another.

The same goes for other search engines, they can’t handle longish chains of redirecting URLs. In other words: all search engines consider URLs involved in longish redirect chains unreliable, not trustworthy, low quality …

What’s that to you? Well, you might play redirect ping pong with search engine crawlers unknowingly. If you’ve ever redesigned a site, chances are you’ve build chained redirects. In most cases those chains aren’t too complex, but it’s worth checking. Bear in mind that Apache, .htaccess, scripts or CMS software and whatnot can perform redirects, often without notice and undetectable with a browser.

I made up this example, but I’ve seen worse redirect chains. Here is the transcript of Ms. Googlebot’s chat with your Web server:
crappy redirect chain

Googlebot: Now that’s a nice link I’ve discovered on this old and trusted page. I can’t wait to fetch it. Hey port 80 at yourstuff.com, would you please be so kind to serve me /some-page?

.htaccess: Oh silly Googlebot, don’t you read Matt’s blog? He told me that a 301 redirect is the canonical answer when someone requests my stuff without the www-prefix. I didn’t bother to lookup the resource you’ve asked for, and why should I since your request is wrong, wrong, wrong! Here is the canonical URL: 301-Moved permanently, Location: http://www.yourstuff.com/some-page.

Googlebot: Dear Web server, of course that’s the best thing to do. I apologize for my silly request, but please note that I’m following a link where the Webmaster of the authority site forgot the dup-dup-dup stuff in the HREF attribute’s value. Can you please serve me /some-page now?

Apache: Doh! Why does this silly bot bother me with an incomplete URI? Since /some-page is not an existing file, probably it’s a directory and a brain dead Webmaster forgot the trailing slash. Lets see … yep, that’s it, and it has even a default document index.php. Dear Ms. Googlebot, here is the correct URL: 302-Found elsewhere, Location: http://www.yourstuff.com/some-page/index.php.

[Most servers will respond with 301-Moved permanently, Location: http://www.yourstuff.com/some-page/, but however the server is configured, it will be a redirect.]

Googlebot: Oh I see, I’ve got a really malformed URI. Sorry for bothering you again, could you please deliver the content of http://www.yourstuff.com/some-page/index.php? Thanks in advance!

/some-page/index.php: Uuaahhh … who’s knocking? I slept sooo long … not fully awake … please be patient while Apache hands me over to the PHP parser. Actually, I’m totally outdated because my Webmaster discovered some blogging software and moved my contents to a WordPress database. Please wait. Where is the darn post-it with the new address? Ahh, just found it under my pillow. Please don’t disturb me again, bye bye, and here you go: 301-Moved permanantly, Location: http://www.yourstuff.com/blog/some-post/.

Googlebot: You’re kidding! Why didn’t you give me this URI in the first place? Never mind, I want http://www.yourstuff.com/blog/some-post/ now.

Hack in /blog/index.php: Oh my god, another bot following outdated links. My Webmaster is clever, he read somewhere that category names in URLs boost search engine rankings and changed the permalink structure to category/postname. Wait a moment, I’m looking up the category for this post, be right back. Hmmm, this category name is not exactly a money term … the heck with it! Here is the keyword stuffed URL I just made for you: 301-Moved permanantly, Location: http://www.yourstuff.com/blog/cat/some-post/.

Googlebot: Four crappy redirects in a row, what a bunch of bull! This is my ultimately last try: gimme http://www.yourstuff.com/blog/cat/some-post/!

WordPress: See, that didn’t hurt, did it? Here you are: 200-OK, the content

Googlebot: Here’s your shabby page. Don’t send me to this assclowns again! I’ve better things to do than running five times to this misconfigured server only to fetch a keyword laden sales pitch with 34 grammar errors, 67 typos, and a ton of affiliate links. Grrrr!

Crawl scheduler: Sorry dear. I can’t blacklist them for stuff like that, but I’ve set the crawl frequency for yourstuff.com to once a year, and I’ve red-flagged the document ID so that the indexer can downrank it accordingly.

Do you really want to treat Ms. Googlebot so badly? Not to speak of the minus points you gain for playing redirect ping pong with a search engine. Maybe most search engines index a page served after four redirects, but I won’t rely on such a redirect chain. It’s quite easy to shorten it. Just delete outdated stuff so that all requests run into a 404-Not found, then write up a list in a format like

Old URI 1 Delimiter New URI 1 \n
Old URI 2 Delimiter New URI 2 \n
  … Delimiter   … \n

and write a simple redirect script which reads this file and performs a 301 redirect to New URI when REQUEST_URI == Old URI. If REQUEST_URI doesn’t match any entry, then send a 404 header and include your actual error page. If you need to change the final URLs later on, you can easily do that in the text file’s right column with search and replace.

Next point the ErrorDocument 404 directive in your root’s .htaccess file to this script. Done. Not looking at possible www/non-www canonicalization redirects, you’ve shortened the number of redirects to one, regardless how often you’ve moved your pages. Don’t forget to add all outdated URLs to the list when you redesign your stuff again, and cover common 3rd party sins like truncating trailing slashes too. The flat file from the example above would look like:

/some-page Delimiter /blog/cat/some-post/ \n
/some-page/ Delimiter /blog/cat/some-post/ \n
/some-page/index.php Delimiter /blog/cat/some-post/ \n
/blog/some-post Delimiter /blog/cat/some-post/ \n
/blog/some-post/ Delimiter /blog/cat/some-post/ \n
  … Delimiter   … \n

With a large site consider a database table, processing huge flat files with every 404 error can come with disadvantages. Also, if you’ve patterns like /blog/post-name/ ==> /blog/cat/post-name/ then don’t generate and process longish mapping tables but cover these redirects algorithmically.

To gather URLs worth a 301 redirect use these sources:

  • Your server logs.
  • 404/301/302/… reports from your server stats.
  • Google’s Web crawl error reports.
  • Tools like XENU’s Link Sleuth which crawl your site and output broken links as well as all sorts of redirects, and can even check your complete Web space for orphans.
  • Sitemaps of outdated structures/site areas.
  • Server header checkers which follow all redirects to the final destination.

Disclaimer: If you suffer from IIS/ASP, free hosts, restrictive hosts like Yahoo or other serious maladies, this post is not for you.

I’m curious, does did your site play redirect ping pong with search engine crawlers?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

19 Comments to "Shit happens, your redirects hit the fan!"

  1. Andy Beard on 26 September, 2007  #link

    WP2.3 has a whole load more in-built redirects that are supposed to handle a lot of scenarios, but I need to do some testing.

  2. g1smd on 26 September, 2007  #link

    Good post to explain the problem. How about some extra solutions in a later post? That part is a bit lightweight.

    I’m not too impressed with using a 404 error page to then send a 301 redirect back to the browser. I never like mixing things up like that.

    I much prefer to fix the problems directly in the .htaccess or http.conf file of the Apache server and always use single-step solution that takes a source URL and redirects it in-one-move to the correct URL.

    Several lines of code can fix a very large amount of problems.

  3. Sebastian on 26 September, 2007  #link

    g1smd, will do. In some cases you can’t fix it with .htaccess or http.conf, especially when the redirects are application sided.

  4. Carsten Cumbrowski on 27 September, 2007  #link

    If you suffer from IIS/ASP maladies, this post is not for you. :)

    I know exactly what you are talking about and dealing with a site with over 100,000 pages indexed 2 site redesigns and a seven year history of all kinds of custom stuff etc., is a royal pain in the a**. A DB driven solution was out of the question (too much overhead). I looked at logs of the past 6 months (URLs that generated traffic, spider or not), pages indexed in the SE, any custom tracking logs and then the source code of the site itself. The result was a program that would analyze the URL and determine if a redirect is necessary.

    All rules would be checked first before the redirect. The order in which the rules are checked was important, starting with the HOST (multiple domains for the same site), canonical issues, trailing slashes, directories, scripts and then URL parameters. Each step rule looks at the URL based on the result of the previous rule and not the the one from the original request. At the end was the check if a redirect is necessary (yes/no) and if it was yes, it would redirect to the final version of the URL.

  5. Sebastian on 27 September, 2007  #link

    Servus Carsten :)
    That’s a great post on IIS/ASP redirects, and the poject you’ve described sounds like lots of fun. URL consolidation is a task where many SEM folks fail because they don’t dig deep enough.

  6. eaglehawk on 27 September, 2007  #link

    Do you have a suggestion for the redirect script?

  7. Sebastian on 27 September, 2007  #link

    Well, I’ve a generic script I use on multiple sites, but it needs a couple tweaks before I can share it. Sorry, I can’t promise that I’ll post it soon.

  8. Melanie Phung on 27 September, 2007  #link

    Heh heh. I’m sending this one over to all my developers.

  9. TheMadHat on 27 September, 2007  #link

    We moved a site about 6 months ago with 90k pages. We were doing chaing redirects (around 3) and everything blew up. Sitemaps wouldn’t validate and pages were not being reindexed very quickly. We had to modify the backend to do away with any obvious redirects. A lot more difficult in IIS as Carsten mentioned. It quickly bounced back (quickly being a couple weeks) after we fixed it.

  10. Paul Pedersen on 27 September, 2007  #link

    Great post. I’ve seen a great deal of this over the years.

  11. Lorna on 6 October, 2007  #link

    This is a great post for Googlebot dum-dums such as myself. I’m checking my server logs immediately.

  12. […] whatever) when it runs into a redirect condition. Some redirects are done by the server itself (see handling incomplete URIs), and there are several places where you can set (conditional) redirect directives: Apache’s […]

  13. kristin on 20 December, 2007  #link

    Right now, I am redirecting 3 times for every page visit.
    1. Redirect to the Single Sign-On server to get a ticket.
    2. Redirect back to the page (service url) from the Single Sign-On server
    3. Self-redirect without the ticket parameter

    This ofcourse is pissing off Google’s web crawler. Any solution to avoid this?

  14. Sebastian on 20 December, 2007  #link

    Not sure why a ticket benefits a crawler, actually I guess that’s worthless, so why not checking for legit search engine crawlers and delivering those the contents without any redirects?

  15. Anon Coward on 30 January, 2008  #link

    Kristin - Here is (half) a .Net solution that I use (obviously very easy to write in php, ruby - anything).

    Do not automatically redirect anyone who hasn’t already got a session on your Single Sign-On Server.

    But how do I do this you ask!

    Ok, here we go…

    In your content page if an authenticated session for the user does not exist then you need to check the Single Sign-on server to see if there is one there, right?

    To do this output this to the bottom of your page using Response.Write or whatever:

    “”

    As you can see this links into a code page on your Single Sign-on Server, not a javascript file (you could set up isapi filters or whatever to make the .js extension map to the .apsx handler also).

    What does AmIAuthenticated.aspx return?

    This:

    —begin code

    If (Session(”username”) Is Nothing) Then
    returnScript = “”
    Else
    If (Request.UrlReferrer Is Nothing) Then
    returnScript = “window.location = ‘” ConfigurationSettings.AppSettings (”SessionServerUrl”).ToString() “‘;”

    Else
    returnScript = “window.location = ‘” ConfigurationSettings.AppSettings(”SessionServerUrl”).ToString() “?redirectUrl=” Server.UrlEncode(Request.UrlReferrer.AbsoluteUri) “‘;”
    End If
    End If

    —end code

    So if there was a session it takes the user (using JavaScript window.location) to the Single Sign-On server to pick up there ticket!

    So you only redirect people if js / cookies are working, and if they have already logged into you Single Sign-On Server, not search engine crawlers!

    Ok, I think that should get you started…

    P.s. If you can’t get this going it is because of my bad explanation, the method works, I am using it…

  16. Anon Coward on 30 January, 2008  #link

    Is what you output using Response.Write

    Missed out that bit on my original post…

    (after ‘To do this output this to the bottom of your page using Response.Write or whatever:’

  17. Anon Coward on 31 January, 2008  #link

    ok looks like the blog engine is removing the link to the JS file. It is just a regular js link but with the src set to AmIAuthenticated.aspx on the authentication server

  18. […] Shit happens, your redirects hit the fan! […]

  19. Carter Cole on 6 August, 2010  #link

    i built tool and API to enumerate redirect chains and show each status code along the way.

    Here is the redirect chain examiner along with the code to use API on your website

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.