Although robust search engine crawlers are rather fault-tolerant creatures, there is an often overlooked but quite safe procedure to piss off the spiders. Playing redirect ping pong mostly results in unindexed contents. Google reports chained redirects under the initially requested URL as URLs not followed due to redirect errors, and recommends:
Minimize the number of redirects needed to follow a link from one page to another.
The same goes for other search engines, they can’t handle longish chains of redirecting URLs. In other words: all search engines consider URLs involved in longish redirect chains unreliable, not trustworthy, low quality …
What’s that to you? Well, you might play redirect ping pong with search engine crawlers unknowingly. If you’ve ever redesigned a site, chances are you’ve build chained redirects. In most cases those chains aren’t too complex, but it’s worth checking. Bear in mind that Apache, .htaccess, scripts or CMS software and whatnot can perform redirects, often without notice and undetectable with a browser.
I made up this example, but I’ve seen worse redirect chains. Here is the transcript of Ms. Googlebot’s chat with your Web server:
Googlebot: Now that’s a nice link I’ve discovered on this old and trusted page. I can’t wait to fetch it. Hey port 80 at yourstuff.com, would you please be so kind to serve me
.htaccess: Oh silly Googlebot, don’t you read Matt’s blog? He told me that a 301 redirect is the canonical answer when someone requests my stuff without the www-prefix. I didn’t bother to lookup the resource you’ve asked for, and why should I since your request is wrong, wrong, wrong! Here is the canonical URL:
301-Moved permanently, Location: http://www.yourstuff.com/some-page.
Googlebot: Dear Web server, of course that’s the best thing to do. I apologize for my silly request, but please note that I’m following a link where the Webmaster of the authority site forgot the dup-dup-dup stuff in the HREF attribute’s value. Can you please serve me
Apache: Doh! Why does this silly bot bother me with an incomplete URI? Since
/some-page is not an existing file, probably it’s a directory and a brain dead Webmaster forgot the trailing slash. Lets see … yep, that’s it, and it has even a default document
index.php. Dear Ms. Googlebot, here is the correct URL:
302-Found elsewhere, Location: http://www.yourstuff.com/some-page/index.php.
[Most servers will respond with
301-Moved permanently, Location: http://www.yourstuff.com/some-page/, but however the server is configured, it will be a redirect.]
Googlebot: Oh I see, I’ve got a really malformed URI. Sorry for bothering you again, could you please deliver the content of
http://www.yourstuff.com/some-page/index.php? Thanks in advance!
/some-page/index.php: Uuaahhh … who’s knocking? I slept sooo long … not fully awake … please be patient while Apache hands me over to the PHP parser. Actually, I’m totally outdated because my Webmaster discovered some blogging software and moved my contents to a WordPress database. Please wait. Where is the darn post-it with the new address? Ahh, just found it under my pillow. Please don’t disturb me again, bye bye, and here you go:
301-Moved permanantly, Location: http://www.yourstuff.com/blog/some-post/.
Googlebot: You’re kidding! Why didn’t you give me this URI in the first place? Never mind, I want
Hack in /blog/index.php: Oh my god, another bot following outdated links. My Webmaster is clever, he read somewhere that category names in URLs boost search engine rankings and changed the permalink structure to category/postname. Wait a moment, I’m looking up the category for this post, be right back. Hmmm, this category name is not exactly a money term … the heck with it! Here is the keyword stuffed URL I just made for you:
301-Moved permanantly, Location: http://www.yourstuff.com/blog/cat/some-post/.
Googlebot: Four crappy redirects in a row, what a bunch of bull! This is my ultimately last try: gimme
WordPress: See, that didn’t hurt, did it? Here you are:
200-OK, the content
Googlebot: Here’s your shabby page. Don’t send me to this assclowns again! I’ve better things to do than running five times to this misconfigured server only to fetch a keyword laden sales pitch with 34 grammar errors, 67 typos, and a ton of affiliate links. Grrrr!
Crawl scheduler: Sorry dear. I can’t blacklist them for stuff like that, but I’ve set the crawl frequency for yourstuff.com to once a year, and I’ve red-flagged the document ID so that the indexer can downrank it accordingly.
Do you really want to treat Ms. Googlebot so badly? Not to speak of the minus points you gain for playing redirect ping pong with a search engine. Maybe most search engines index a page served after four redirects, but I won’t rely on such a redirect chain. It’s quite easy to shorten it. Just delete outdated stuff so that all requests run into a 404-Not found, then write up a list in a format like
and write a simple redirect script which reads this file and performs a 301 redirect to
New URI when
REQUEST_URI == Old URI. If
REQUEST_URI doesn’t match any entry, then send a 404 header and include your actual error page. If you need to change the final URLs later on, you can easily do that in the text file’s right column with search and replace.
Next point the
ErrorDocument 404 directive in your root’s .htaccess file to this script. Done. Not looking at possible www/non-www canonicalization redirects, you’ve shortened the number of redirects to one, regardless how often you’ve moved your pages. Don’t forget to add all outdated URLs to the list when you redesign your stuff again, and cover common 3rd party sins like truncating trailing slashes too. The flat file from the example above would look like:
With a large site consider a database table, processing huge flat files with every 404 error can come with disadvantages. Also, if you’ve patterns like
/blog/post-name/ ==> /blog/cat/post-name/ then don’t generate and process longish mapping tables but cover these redirects algorithmically.
To gather URLs worth a 301 redirect use these sources:
- Your server logs.
- 404/301/302/… reports from your server stats.
- Google’s Web crawl error reports.
- Tools like XENU’s Link Sleuth which crawl your site and output broken links as well as all sorts of redirects, and can even check your complete Web space for orphans.
- Sitemaps of outdated structures/site areas.
- Server header checkers which follow all redirects to the final destination.
Disclaimer: If you suffer from IIS/ASP, free hosts, restrictive hosts like Yahoo or other serious maladies, this post is not for you.
does did your site play redirect ping pong with search engine crawlers?
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments