We find redirects on every Web site out there. They’re often performed unnoticed in the background, unintentionally messed up, implemented with a great deal of ignorance, but seldom perfect from a SEO perspective. Unfortunately, the Webmaster boards are flooded with contradictorily, misleading and plain false advice on redirects. If you for example read “for SEO purposes you must make use of 301 redirects only” then better close the browser window/tab to prevent you from crappy advice. A 302 or 307 redirect can be search engine friendly too.
With this post I do plan to bore you to death. So lean back, grab some popcorn, and stay tuned for a longish piece explaining the Interweb’s forwarding requests as dull as dust. Or, if you know everything about redirects, then please digg, sphinn and stumble this post before you surf away. Thanks.
Redirects are defined in the HTTP protocol, not in search engine guidelines
For the moment please forget everything you’ve heard about redirects and their SEO implications, clear your mind, and follow me to the very basics defined in the HTTP protocol. Of course search engines interpret some redirects in a non-standard way, but understanding the norm as well as its use and abuse is necessary to deal with server sided redirects. I don’t bother with outdated HTTP 1.0 stuff, although some search engines still apply it every once in a while, hence I’ll discuss the 307 redirect introduced in HTTP 1.1 too. For information on client sided redirects please refer to Meta Refresh - the poor man’s 301 redirect or read my other pamphlets on redirects, and stay away from JavaScript URL manipulations.
What is a server sided redirect?
Think about an HTTP redirect as a forwarding request. Although redirects work slightly different from snail mail forwarding requests, this analogy perfectly fits the procedure. Whilst with US Mail forwarding requests a clerk or postman writes the new address on the envelope before it bounces in front of a no longer valid respectively temporarily abandoned letter-box or pigeon hole, on the Web the request’s location (that is the Web server responding to the server name part of the URL) provides the requestor with the new location (absolute URL).
A server sided redirect tells the user agent (browser, Web robot, …) that it has to perform another request for the URL given in the HTTP header’s “location” line in order to fetch the requested contents. The type of the redirect (301, 302 or 307) also instructs the user agent how to perform future requests of the Web resource. Because search engine crawlers/indexers try to emulate human traffic with their content requests, it’s important to choose the right redirect type both for humans and robots. That does not mean that a 301-redirect is always the best choice, and it certainly does not mean that you always must return the same HTTP response code to crawlers and browsers. More on that later.
Execution of server sided redirects
Server sided redirects are executed before your server delivers any content. In other words, your server ignores everything it could deliver (be it a static HTML file, a script output, an image or whatever) when it runs into a redirect condition. Some redirects are done by the server itself (see handling incomplete URIs), and there are several places where you can set (conditional) redirect directives: Apache’s httpd.conf, .htaccess, or in application layers for example in PHP scripts. (If you suffer from IIS/ASP maladies, this post is for you.) Examples:
| Browser Request: |
ww.site.com /page.php?id=1 |
site.com /page.php?id=1 |
www.site.com /page.php?id=1 |
www.site.com /page.php?id=2 |
| Apache: |
301 header:
www.site.com /page.php?id=1 |
|
|
|
| .htaccess: |
|
301 header:
www.site.com /page.php?id=1 |
|
|
| /page.php: |
|
|
301 header:
www.site.com /page.php?id=2 |
200 header:
(Info like content length...)
Content: Article #2 |
The 301 header may or may not be followed by a hyperlink pointing to the new location, solely added for user agents which can’t handle redirects. Besides that link, there’s no content sent to the client after the redirect header.
More important, you must not send a single byte to the client before the HTTP header. If you for example code [space(s)|tab|new-line|HTML code]<?php ... in a script that shall perform a redirect or is supposed to return a 404 header (or any HTTP header different from the server’s default instructions), you’ll produce a runtime error. The redirection fails, leaving the visitor with an ugly page full of cryptic error messages but no link to the new location.
That means in each and every page or script which possibly has to deal with the HTTP header, put the logic testing those conditions at the very top. Always send the header status code and optional further information like a new location to the client before you process the contents.
After the last redirect header line terminate execution with the “L” parameter in .htaccess, PHP’s exit; statement, or whatever.
An HTTP redirect, regardless its type, consists of two lines in the HTTP header. In this example I’ve requested http://www.sebastians-pamphlets.com/about/, which is an invalid URI because my server name lacks the www-thingy, hence my canonicalization routine outputs this HTTP header:
HTTP/1.1 301 Moved Permanently
Date: Mon, 01 Oct 2007 17:45:55 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.4
Location: http://sebastians-pamphlets.com/about/
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1
The redirect response code in a HTTP status line
The first line of the header defines the protocol version, the reponse code, and provides a human readable reason phrase. Here is a shortened and slightly modified excerpt quoted from the HTTP/1.1 protocol definition:
Status-Line
The first line of a Response message is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by SP (space) characters. No CR or LF is allowed except in the final CRLF sequence.
Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
[e.g. “HTTP/1.1 301 Moved Permanently” + CRLF]
Status Code and Reason Phrase
The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request. […] The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase.
The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. […]:
[…]
- 3xx: Redirection - Further action must be taken in order to complete the request
[…]
The individual values of the numeric status codes defined for HTTP/1.1, and an example set of corresponding Reason-Phrases, are presented below. The reason phrases listed here are only recommendations — they MAY be replaced by local equivalents without affecting the protocol [that means you could translate and/or rephrase them].
[…]
300: Multiple Choices
301: Moved Permanently
302: Found [Elsewhere]
303: See Other
304: Not Modified
305: Use Proxy
307: Temporary Redirect
[…]
In terms of SEO the understanding of 301/302-redirects is important. 307-redirects, introduced with HTTP/1.1, are still capable to confuse some search engines, even major players like Google when Ms. Googlebot for some reasons thinks she must do HTTP/1.0 requests, usually caused by weird respectively ancient server configurations (or possibly testing newly discovered sites under certain circumstances). You should not perform 307 redirects as response to most HTTP/1.0 requests, use 302/301 –whatever fits best– instead. More info on this issue below in the 302/307 sections.
Please note that the default reponse code of all redirects is 302. That means when you send a HTTP header with a location directive but without an explicit response code, your server will return a 302-Found status line. That’s kinda crappy, because in most cases you want to avoid the 302 code like the plague. Do no nay never rely on default response codes! Always prepare a server sided redirect with a status line telling an actual response code (301, 302 or 307)! In server sided scripts (PHP, Perl, ColdFusion, JSP/Java, ASP/VB-Script…) always send a complete status line, and in .htaccess or httpd.conf add a [R=301|302|307,L] parameter to statements like RewriteRule:
RewriteRule (.*) http://www.site.com/$1 [R=301,L]
The next element you need in every redirect header is the location directive. Here is the official syntax:
Location
The Location response-header field is used to redirect the recipient to a location other than the Request-URI for completion of the request or identification of a new resource. […] For 3xx responses, the location SHOULD indicate the server’s preferred URI for automatic redirection to the resource. The field value consists of a single absolute URI.
Location = “Location” “:” absoluteURI [+ CRLF]
An example is:
Location: http://sebastians-pamphlets.com/about/
Please note that the value of the location field must be an absolute URL, that is a fully qualified URL with scheme (http|https), server name (domain|subdomain), and path (directory/file name) plus the optional query string (”?” followed by variable/value pairs like ?id=1&page=2...), no longer than 2047 bytes (better 255 bytes because most scripts out there don’t process longer URLs for historical reasons). A relative URL like ../page.php might work in (X)HTML (although you better plan a spectacular suicide than any use of relative URIs!), but you must not use relative URLs in HTTP response headers!
How to implement a server sided redirect?
You can perform HTTP redirects with statements in your Web server’s configuration, and in server sided scripts, e.g. PHP or Perl. JavaScript is a client sided language and therefore lacks a mechanism to do HTTP redirects. That means all JS redirects count as a 302-Found response.
Bear in mind that when you redirect, you possibly leave tracks of outdated structures in your HTML code, not to speak of incoming links. You must change each and every internal link to the new location, as well as all external links you control or where you can ask for an URL update. If you leave any outdated links, visitors probably don’t spot it (although every redirect slows things down), but search engine spiders continue to follow them, what ends in redirect chains eventually. Chained redirects often are the cause of deindexing pages, site areas or even complete sites by search engines, hence do no more than one redirect in a row and consider two redirects in a row risky. You don’t control offsite redirects, in some cases a search engine has already counted one or two redirects before it requests your redirecting URL (caused by redirecting traffic counters etcetera). Always redirect to the final destination to avoid useless hops which kill your search engine traffic. (Google recommends “that you use fewer than five redirects for each request”, but don’t try to max out such limits because other services might be less BS-tolerant.)
Like conventional forwarding requests, redirects do expire. Even a permanent 301-redirect’s source URL will be requested by search engines every now and then because they can’t trust you. As long as there is one single link pointing to an outdated and redirecting URL out there, it’s not forgotten. It will stay alive in search engine indexes and address books of crawling engines even when the last link pointing to it was changed or removed. You can’t control that, and you can’t find all inbound links a search engine knows, despite their better reporting nowadays (neither Yahoo’s site explorer nor Google’s link stats show you all links!). That means you must maintain your redirects forever, and you must not remove (permanent) redirects. Maintenance of redirects includes hosting abandoned domains, and updates of location directives whenever you change the final structure. With each and every revamp that comes with URL changes check for incoming redirects and make sure that you eliminate unnecessary hops.
Often you’ve many choices where and how to implement a particular redirect. You can do it in scripts and even static HTML files, CMS software, or in the server configuration. There’s no such thing as a general best practice, just a few hints to bear in mind.
Doubt: Don’t believe Web designers and developers when they say that a particular task can’t be done without redirects. Do your own research, or ask an SEO expert. When you for example plan to make a static site dynamic by pulling the contents from a database with PHP scripts, you don’t need to change your file extensions from *.html to *.php. Apache can parse .html files for PHP, just enable that in your root’s .htaccess:
AddType application/x-httpd-php .html .htm .shtml .txt .rss .xml .css
Then generate tiny PHP scripts calling the CMS to replace the outdated .html files. That’s not perfect but way better than URL changes, provided your developers can manage the outdated links in the CMS’ navigation. Another pretty popular abuse of redirects is click tracking. You don’t need a redirect script to count clicks in your database, make use of the onclick event instead.
- Transparency: When the shit hits the fan and you need to track down a redirect with not more than the HTTP header’s information in your hands, you’ll begin to believe that performance and elegant coding is not everything. Reading and understanding a large httpd.conf file, several complex .htaccess files, and searching redirect routines in a conglomerate of a couple generations of scripts and include files is not exactly fun. You could add a custom field identifying the piece of redirecting code to the HTTP header. In .htaccess that would be achieved with
Header add X-Redirect-Src "/content/img/.htaccess"
and in PHP with
header("X-Redirect-Src: /scripts/inc/header.php", TRUE);
(Whether or not you should encode or at least obfuscate code locations in headers depends on your security requirements.)
- Encapsulation: When you must implement redirects in more than one script or include file, then encapsulate all redirects including all the logic (redirect conditions, determining new locations, …). You can do that in an include file with a meaningful file name for example. Also, instead of plastering the root’s .htaccess file with tons of directory/file specific redirect statements, you can gather all requests for redirect candidates and call a script which tests the REQUEST_URI to execute the suitable redirect. In .htaccess put something like:
RewriteEngine On
RewriteBase /old-stuff
RewriteRule ^(.*)\.html$ do-redirects.php
This code calls /old-stuff/do-redirects.php for each request of an .html file in /old-stuff/. The PHP script:
$requestUri = $_SERVER["REQUEST_URI"];
if (stristr($requestUri, "/contact.html")) {
$location = "http://example.com/new-stuff/contact.htm";
}
...
if ($location) {
@header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
@header("X-Redirect-Src: /old-stuff/do-redirects.php", TRUE);
@header("Location: $location");
exit;
}
else {
[output the requested file or whatever]
}
(This is also an example of a redirect include file which you could insert at the top of a header.php include or so. In fact, you can include this script in some files and call it from .htaccess without modifications.) This method will not work with ASP on IIS because amateurish wannabe Web servers don’t provide the REQUEST_URI variable.
- Documentation: When you design or update an information architecture, your documentation should contain a redirect chapter. Also comment all redirects in the source code (your genial regular expressions might lack readability when someone else looks at your code). It’s a good idea to have a documentation file explaining all redirects on the Web server (you might work with other developers when you change your site’s underlying technology in a few years).
- Maintenance: Debugging legacy code is a nightmare. And yes, what you write today becomes legacy code in a few years. Thus keep it simple and stupid, implement redirects transparent rather than elegant, and don’t forget that you must change your ancient redirects when you revamp a site area which is the target of redirects.
- Performance: Even when performance is an issue, you can’t do everything in httpd.conf. When you for example move a large site changing the URL structure, the redirect logic becomes too complex in most cases. You can’t do database lookups and stuff like that in server configuration files. However, some redirects like for example server name canonicalization should be performed there, because they’re simple and not likely to change. If you can’t change httpd.conf, .htaccess files are for you. They’re are slower than cached config files but still faster than application scripts.
Redirects in server configuration files
Here is an example of a canonicalization redirect in the root’s .htaccess file:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^sebastians-pamphlets\.com [NC]
RewriteRule (.*) http://sebastians-pamphlets.com/$1 [R=301,L]
- The first line enables Apache’s mod_rewrite module. Make sure it’s available on your box before you copy, paste and modify the code above.
-
The second line checks the server name in the HTTP request header (received from a browser, robot, …). The “NC” parameter ensures that the test of the server name (which is, like the scheme part of the URI, not case sensitive by definition) is done as intended. Without this parameter a request of http://SEBASTIANS-PAMPHLETS.COM/ would run in an unnecessary redirect. The rewrite condition returns TRUE when the server name is not sebastians-pamphlets.com. There’s an important detail: not “!”
Most Webmasters do it the other way round. They check if the server name equals an unwanted server name, for example with RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]. That’s not exactly efficient, and fault-prone. It’s not efficient because one needs to add a rewrite condition for each and every server name a user could type in and the Web server would respond to. On most machines that’s a huge list like “w.example.com, ww.example.com, w-w-w.example.com, …” because the default server configuration catches all not explicitely defined subdomains.
Of course next to nobody puts that many rewrite conditions into the .htaccess file, hence this method is fault-prone and not suitable to fix canonicalization issues. In combination with thoughtlessly usage of relative links (bullcrap that most designers and developers love out of lazyness and lack of creativity or at least fantasy), one single link to an existing page on a non-exisiting subdomain not redirected in such an .htaccess file could result in search engines crawling and possibly even indexing a complete site under the unwanted server name. When a savvy competitor spots this exploit you can say good bye to a fair amount of your search engine traffic.
Another advantage of my single line of code is that you can point all domains you’ve registered to catch type-in traffic or whatever to the same Web space. Every new domain runs into the canonicalization redirect, 100% error-free.
- The third line performs the 301 redirect to the requested URI using the canonical server name. That means when the request URI was http://www.sebastians-pamphlets.com/about/, the user agent gets redirected to http://sebastians-pamphlets.com/about/. The “R” parameter sets the reponse code, and the “L” parameter means leave if the|one condition matches (=exit), that is the statements following the redirect execution, like other rewrite rules and such stuff, will not be parsed.
If you’ve access to your server’s httpd.conf file (what most hosting services don’t allow), then better do such redirects there. The reason for this recommendation is that Apache must look for .htaccess directives in the current directory and all its upper levels for each and every requested file. If the request is for a page with lots of embedded images or other objects, that sums up to hundreds of hard disk accesses slowing down the page loading time. The server configuration on the other hand is cached and therefore way faster. Learn more about .htaccess disadvantages. However, since most Webmasters can’t modify their server configuration, I provide .htaccess examples only. If you can do, then you know how to put it in httpd.conf.
Redirecting directories and files with .htaccess
When you need to redirect chunks of static pages to another location, the easiest way to do that is Apache’s redirect directive. The basic syntax is Redirect [301|302|307] Path URL, e.g. Redirect 307 /blog/feed http://feedburner.com/myfeed or Redirect 301 /contact.htm /blog/contact/. Path is always a file system path relative to the Web space’s root. URL is either a fully qualified URL (on another machine) like http://feedburner.com/myfeed, or a relative URL on the same server like /blog/contact/ (Apache adds scheme and server in this case, so that the HTTP header is build with an absolute URL in the location field; however, omitting the scheme+server part of the target URL is not recommended, see the warning below).
When you for example want to consolidate a blog on its own subdomain and a corporate Web site at example.com, then put
Redirect 301 / http://example.com/blog
in the .htacces file of blog.example.com. When you then request http://blog.example.com/category/post.html you’re redirected to http://example.com/blog/category/post.html.
Say you’ve moved your product pages from /products/*.htm to /shop/products/*.htm then put
Redirect 301 /products http://example.com/shop/products
Omit the trailing slashes when you redirect directories. To redirect particular files on the other hand you must fully qualify the locations:
Redirect 302 /misc/contact.html http://example.com/cms/contact.php
or, when the new location resides on the same server:
Redirect 301 /misc/contact.html /cms/contact.php
Warning: Although Apache allows local redirects like Redirect 301 /misc/contact.html /cms/contact.php, with some server configurations this will result in 500 server errors on all requests. Therefore I recommend the use of fully qualified URLs as redirect target, e.g. Redirect 301 /misc/contact.html http://example.com/cms/contact.php!
Maybe you found a reliable and unbeatable cheap hosting service to host your images. Copy all image files from example.com to image-example.com and keep the directory structures as well as all file names. Then add to example.com’s .htaccess
RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$ http://www.image-example.com$1.$2
The regex should match e.g. /img/nav/arrow-left.png so that the user agent is forced to request http://www.image-example.com/img/nav/arrow-left.png. Say you’ve converted your GIFs and JPGs to the PNG format during this move, simply change the redirect statement to
RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$ http://www.image-example.com$1.png
With regular expressions and RedirectMatch you can perform very creative redirects.
Please note that the response codes used in the code examples above most probably do not fit the type of redirect you’d do in real life with similar scenarios. I’ll discuss use cases for all redirect response codes (301|302|307) later on.
Redirects in server sided scripts
You can do HTTP redirects only with server sided programming languages like PHP, ASP, Perl etcetera. Scripts in those languages generate the output before anything is send to the user agent. It should be a no-brainer, but these PHP examples don’t count as server sided redirects:
print "<META HTTP-EQUIV=Refresh CONTENT="0; URL=http://example.com/">\n";
print "<script type="text/javascript">window.location = "http://example.com/";</script>\n";
Just because you can output a redirect with a server sided language that does not make the redirect an HTTP redirect.
In PHP you perform HTTP redirects with the header() function:
$newLocation = "http://example.com/";
@header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
@header("Location: $newLocation");
exit;
The first input parameter of header() is the complete header line, in the first line of code above that’s the status-line. The second parameter tells whether a previously sent header line shall be replaced (default behavior) or not. The third parameter sets the HTTP status code, don’t use it more than once. If you use an ancient PHP version (prior 4.3.0) you can’t put the 2nd and 3rd input parameter. The “@” suppresses PHP warnings and error messages.
With ColdFusion you code
<CFHEADER statuscode="307" statustext="Temporary Redirect">
<CFHEADER name="Location" value="http://example.com/">
A redirecting Perl script begins with
#!/usr/bin/perl -w
use strict;
print "Status: 302 Found Elsewhere\r\n", "Location: http://example.com/\r\n\r\n";
exit;
Even with ASP you can do server sided redirects. VBScript:
Dim newLocation
newLocation = "http://example.com/"
Response.Status = "301 Moved Permanently"
Response.AddHeader "Location", newLocation
Response.End
JScript:
Function RedirectPermanent(newLocation) {
Response.Clear();
Response.Status = 301;
Response.AddHeader("Location", newLocation);
Response.Flush();
Response.End();
}
...
Response.Buffer = TRUE;
...
RedirectPermanent ("http://example.com/");
Again, if you suffer from IIS/ASP maladies: here you go.
Remember: Don’t output anything before the redirect header, and nothing after the redirect header!
Redirects done by the Web server itself
When you read your raw server logs, you’ll find a few 302 and/or 301 redirects Apache has performed without an explicit redirect statement in the server configuration, .htaccess, or a script. Most of these automatic redirects are the result of a very popular bullshit practice: removing trailing slashes. Although the standard defines that an URI like /directory is not a file name by default, therefore equals /directory/ if there’s no file named /directory, choosing the version without the trailing slash is lazy at least, and creates lots of troubles (404s in some cases, otherwise external redirects, but always duplicate content issues you should fix with URL canonicalization routines).
For example Yahoo is a big fan of truncated URLs. They might save a few terabytes in their indexes by storing URLs without the trailing slash, but they send every user’s browser twice to those locations. Web servers must do a 302 or 301 redirect on each Yahoo-referrer requesting a directory or pseudo-directory, because they can’t serve the default document of an omitted path segment (the path component of an URI begins with a slash, the slash is its segment delimiter, and a trailing slash stands for the last (or only) segment representing a default document like index.html). From the Web server’s perspective /directory does not equal /directory/, only /directory/ addresses /directory/index.(htm|html|shtml|php|...), whereby the file name of the default document must be omitted (among other things to preserve the URL structure when the underlying technology changes). Also, the requested URI without its trailing slash may address a file or an on the fly output (if you make use of mod_rewrite to mask ugly URLs you better test what happens with screwed URIs of yours).
Yahoo wastes even their own resources. Their crawler persistently requests the shortened URL, what bounces with a redirect to the canonical URL. Here is an example from my raw logs:
74.6.20.165 - - [05/Oct/2007:01:13:04 -0400] "GET /directory HTTP/1.0″ 301 26 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)”
74.6.20.165 - - [05/Oct/2007:01:13:06 -0400] “GET /directory/ HTTP/1.0″ 200 8642 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)”
[I’ve replaced a rather long path with “directory”]
If you persistently redirect Yahoo to the canonical URLs (with trailing slash), they’ll use your canonical URLs on the SERPs eventually (but their crawler still requests Yahoo-generated crap). Having many good inbound links as well as clean internal links –all with the trailing slash– helps too, but is not a guarantee for canonical URL normalization at Yahoo.
Here is an example. This URL responds with 200-OK, regardless whether it’s requested with or without the canonical trailing slash:
http://www.jlh-design.com/2007/06/im-confused/
(That’s the default (mis)behavior of everybody’s darling WordPress with permalinks by the way. Here is some PHP canonicalization code to fix this flaw.) All internal links use the canonical URL. I didn’t find a serious inbound link pointing to a truncated version of this URL. Yahoo’s Site Explorer lists the URL without the trailing slash: […]/im-confused, and the same happens on Yahoo’s SERPs: […]/im-confused. Even when a server responds 200-OK to two different URLs, a serious search engine should normalize according to the internal links as well as an entry in the XML sitemap, therefore choose the URL with the trailing slash as canonical URL.
Fucking up links on search result pages is evil enough, although fortunately this crap doesn’t influence discovery crawling directly because those aren’t crawled by other search engines (but scraped or syndicated search results are crawlable). Actually, that’s not the whole horror story. Other Yahoo properties remove the trailing slashes from directory and home page links too (look at the “What Readers Viewed” column in your MBL stats for example), and some of those services provide crawlable pages carrying invalid links (pulled from the search index or screwed otherwise). That means other search engines pick those incomplete URLs from Yahoo’s pages (or other pages with links copied from Yahoo pages), crawl them, and end up with search indexes blown up with duplicate content. Maybe Yahoo does all that only to burn Google’s resources by keeping their canonicalization routines and duplicate content filters busy, but it’s not exactly gentlemanlike that such cat fights affect all Webmasters across the globe. Yahoo directly as well as indirectly burns our resources with unnecessary requests of screwed URLs, and we must implement sanitizing redirects for software like WordPress –which doesn’t care enough about URL canonicalization–, just because Yahoo manipulates our URLs to peeve Google. Doh!
If somebody from Yahoo (or MSN, or any other site manipulating URLs this way) reads my rant, I highly recommend this quote from Tim Berners-Lee (January 2005):
Scheme-Based Normalization
[…] the following […] URIs are equivalent:
http://example.com
http://example.com/
In general, an URI that uses the generic syntax for authority with an empty path should be normalized to a path of “/”.
[…]
Normalization should not remove delimiters [”/” or “?”] when their associated component is empty unless licensed to do so by the scheme specification. [emphasis mine]
In my book sentences like “Note that the absolute path cannot be empty; if none is present in the original URI, it MUST be given as ‘/’ […]” in the HTTP specification as well as Section 3.3 of the URI’s Path Segment specs do not sound like a licence to screw URLs. Omitting the path segment delimiter “/” representing an empty last path segment might sound legal if the specs are interpreted without applying common sense, but knowing that Web servers can’t respond to requests of those incomplete URIs and nevertheless truncating trailing slashes is a brain dead approach (actually, such crap deserves a couple unprintable adjectives).
Frequently scanning the raw logs for 302/301 redirects is a good idea. Also, implement documented canonicalization redirects when a piece of software responds to different versions of URLs. It’s the Webmaster’s responsibility to ensure that each piece of content is available under one and only one URL. You cannot rely on any search engine’s URL canonicalization, because shit happens, even with high sophisticated algos:
When search engines crawl identical content through varied URLs, there may be several negative effects:
1. Having multiple URLs can dilute link popularity. For example, in the diagram above [example in Google’s blog post], rather than 50 links to your intended display URL, the 50 links may be divided three ways among the three distinct URLs.
2. Search results may display user-unfriendly URLs […]
Redirect or not? A few use cases.
Before I blather about the three redirect response codes you can choose from, I’d like to talk about a few situations where you shall not redirect, and cases where you probably don’t redirect but should do so.
Unfortunately, it’s a common practice to replace various sorts of clean links with redirects. Whilst legions of Webmasters don’t obfuscate their affiliate links, they hide their valuable outgoing links in fear of PageRank leaks and other myths, or react to search engine FUD with castrated links.
With very few exceptions, the A Element a.k.a. Hyperlink is the best method to transport link juice (PageRank, topical relevancy, trust, reputation …) as well as human traffic. Don’t abuse my beloved A Element:
<a onclick="window.location = 'http://example.com/'; return false;" title="http://example.com">bad example</a>
Such a “link” will transport some visitors, but does not work when JavaScript is disabled or the user agent is a Web robot. This “link” is not an iota better:
<a href="http://example.com/blocked-directory/redirect.php?url=http://another-example.com/" title="Another bad example">example</a>
Simplicity pays. You don’t need the complexity of HREF values changed to ugly URLs of redirect scripts with parameters, located in an uncrawlable path, just because you don’t want that search engines count the links. Not to speak of cases where redirecting links is unfair or even risky, for example click tracking scripts which do a redirect.
- If you need to track outgoing traffic, then by all means do it in a search engine friendly way with clean URLs which benefit the link destination and don’t do you any harm, here is a proven method.
- If you really can’t vouch for a link, for example because you link out to a so called bad neighborhood (whatever that means), or to a link broker, or to someone who paid for the link and Google can detect it or a competitor can turn you in, then add rel=”nofollow” to the link. Yeah, rel-nofollow is crap … but it’s there, it works, we won’t get something better, and it’s less complex than redirects, so just apply it to your fishy links as well as to unmoderated user input.
- If you decide that an outgoing link adds value for your visitors, and you personally think that the linked page is a great resource, then almost certainly search engines will endorse the link (regardless whether it shows a toolbar PR or not). There’s way too much FUD and crappy advice out there.
- You really don’t lose PageRank when you link out. Honestly gained PageRanks sticks at your pages. You only lower the amount of PageRank you can pass to your internal links a little. That’s not a bad thing, because linking out to great stuff can bring in more PageRank in the form of natural inbound links (there are other advantages too). Also, Google dislikes PageRank hoarding and the unnatural link patterns you create with practices like that.
- Every redirect slows things down, and chances are that a user agent messes with the redirect what can result in rendering nil, scrambled stuff, or something completely unrelated. I admit that’s not a very common problem, but it happens with some outdated though still used browsers. Avoid redirects where you can.
In some cases you should perform redirects for sheer search engine compliance, in other words selfish SEO purposes. For example don’t let search engines handle your affiliate links.
- If you operate an affiliate program, then internally redirect all incoming affiliate links to consolidate your landing page URLs. Although incoming affiliate links don’t bring much link juice, every little helps when it lands on a page which doesn’t credit search engine traffic to an affiliate.
- Search engines are pretty smart when it comes to identifying affiliate links. (Thin) affiliate sites suffer from decreasing search engine traffic. Fortunately, the engines respect robots.txt, that means they usually don’t follow links via blocked subdirectories. When you link to your merchants within the content, using URLs that don’t smell like affiliate links, it’s harder to detect the intention of those links algorithmically. Of course that doesn’t prevent you from smart algos trained to spot other patterns, and this method will not pass reviews by humans, but it’s worth a try.
- If you’ve pages which change their contents often by featuring for example a product of the day, you might have a redirect candidate. Instead of duplicating a daily changing product page, you can do a dynamic soft redirect to the product pages. Whether a 302 or a 307 redirect is the best choice depends on the individual circumstances. However, you can promote the hell out of the redirecting page, so that it gains all the search engine love without passing on PageRank etc. to product pages which phase out after a while. (If the product page is hosted by the merchant you must use a 307 response code. Otherwise make sure the 302′ing URL ist listed in your XML sitemap with a high priority. If you can, send a 302 with most HTTP/1.0 requests, and a 307 responding to HTTP/1.1 requests. See the 302/307 sections for more information.)
- If an URL comes with a session-ID or another tracking variable in its query string, you must 301-redirect search engine crawlers to an URI without such randomly generated noise. There’s no need to redirect a human visitor, but search engines hate tracking variables so just don’t let them fetch such URLs.
- There are other use cases involving creative redirects which I’m not willing to discuss here.
Of course both lists above aren’t complete.
Choosing the best redirect response code (301, 302, or 307)
I’m sick of articles like “search engine friendly 301 redirects” propagating that only permanant redirects work with search engines. That’s a lie. I read those misleading headlines daily on the webmaster boards, in my feed reader, at Sphinn, and elsewhere … and I’m not amused. Lemmings. Amateurish copycats. Clueless plagiarists. [Insert a few lines of somewhat offensive language and swearing
]
Of course most redirects out there return the wrong response code. That’s because the default HTTP response code for all redirects is 302, and many code monkeys forget to send a status-line providing the 301 Moved Permanantly when an URL was actually moved or the requested URI is not the canonical URL. When a clueless coder or hosting service invokes a Location: http://example.com/ header statement without a previous HTTP/1.1 301 Moved Permanantly status-line, the redirect becomes a soft 302 Found. That does not mean that 302 or 307 redirects aren’t search engine friendly at all. All HTTP redirects can be safely used with regard to search engines. The point is that one must choose the correct response code based on the actual circumstances and goals. Blindly 301′ing everything is counterproductive sometimes.
301 - Moved Permanently
The message of a 301 reponse code to the requestor is: “The requested URI has vanished. It’s gone forever and perhaps it never existed. I will never supply any contents under this URI (again). Request the URL given in location, and replace the outdated respectively wrong URL in your bookmarks/records by the new one for future requests. Don’t bother me again. Farewell.”
Lets start with the definition of a 301 redirect quoted from the HTTP/1.1 specifications:
The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs [(1)]. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.
The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s). […]
Read a polite “SHOULD” as “must”.
(1) Although technically you could provide more than one location, you must not do that because it irritates too many user agents, search engine crawlers included.
Make use of the 301 redirect when a requested Web resource was moved to another location, or when a user agent requests an URI which is definitely wrong and you’re able to tell the correct URI with no doubt. For URL canonicalization purposes (more info here) the 301 redirect is your one and only friend.
You must not recycle any 301′ing URLs, that means once an URL responds with 301 you must stick with it, you can’t reuse this URL for other purposes next year or so.
Also, you must maintain the 301 response and a location corresponding to the redirecting URL forever. That does not mean that the location can’t be changed. Say you’ve moved a contact page /contact.html to a CMS where it resides under /cms/contact.php. If a user agent requests /contact.html it does a 301 redirect pointing to /cms/contact.php. Two years later you change your software again, and the contact page moves to /blog/contact/. In this case you must change the initial redirect, and create a new one:
/contact.html 301-redirects to /blog/contact/, and
/cms/co