<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; .htaccess</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Mon, 30 Jun 2008 20:12:40 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Why storing URLs with truncated trailing slashes is an utterly idiocy</title>
		<link>http://sebastians-pamphlets.com/thou-must-not-steal-the-trailing-slash-from-my-urls/</link>
		<comments>http://sebastians-pamphlets.com/thou-must-not-steal-the-trailing-slash-from-my-urls/#comments</comments>
		<pubDate>Wed, 06 Feb 2008 08:03:44 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Blogging]]></category>

		<category><![CDATA[Usability]]></category>

		<category><![CDATA[Technorati]]></category>

		<category><![CDATA[Duplicate Content]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Anchor Text]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/thou-must-not-steal-the-trailing-slash-from-my-urls/</guid>
		<description><![CDATA[With some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like Yahoo thinks circumcising URLs is cool. Proper URL canonicalization might, for example, screw your blog&#8217;s reputation at Technorati.
In fact the problem is not your URL canonicalization, e.g. 301 redirects [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/yahoo-steals-my-trailing-slashes.png" width="200" height="337" style="margin-left:4px;" align="right" alt="Yahoo steals my trailing slashes" title="The Yahoo mob steals my trailing slashes :(" />With some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like <a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#invisible-server-redirects">Yahoo thinks circumcising URLs is cool</a>. Proper URL canonicalization might, for example, screw your blog&#8217;s reputation at Technorati.</p>
<p>In fact <b>the problem is</b> not your URL canonicalization, e.g. 301 redirects from http://example.com to http://example.com/ respectively http://example.com/directory to http://example.com/directory/, but <b>crappy software that removes trailing forward slashes from your URLs</b>.</p>
<p>Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that &#8220;http://example.com&#8221; as anchor text makes any sense when the URL in HREF points to &#8220;http://example.com/&#8221;). Spreading invalid URLs is not funny. People as well as Web robots take invalid URLs from your pages for various purposes. Many usages of invalid URLs are capable to damage the search engine rankings of the link destinations. You can&#8217;t control that, hence don&#8217;t screw our URLs. Never. Period.</p>
<p>Folks who don&#8217;t agree with the above said read on.</p>
<ul><b style="margin-left:-20px;">TOC:</b></p>
<li><a href="#what-is-a-trailing-slash">What is a trailing slash?</a> About URLs, directory URIs, default documents, directory indexes, &#8230;</li>
<li><a href="#how-to-rescue-trailing-slashes">How to rescue stolen trailing slashes</a> About Apache&#8217;s handling of directory requests, and rewriting respectively redirecting invalid directory URIs in .htaccess as well as in PHP scripts.</li>
<li><a href="#why-stealing-slashes-is-plain-robbery">Why stealing trailing slashes is not cool</a> Truncating slashes is not only plain robbery (bandwidth theft), it often causes malfunctions at the destination server and 3rd party services as well.</li>
<li><a href="#url-canonicalization-irritates-technorati">How URL canonicalization irritates Technorati</a> 301 redirects that &#8220;add&#8221; a trailing slash to directory URLs, respectively virtual URIs that mimic directories, seem to irritate Technorati so much that it can&#8217;t compute reputation, recent post lists, and so on.</li>
</ul>
<h3 id="what-is-a-trailing-slash">What is a trailing slash?</h3>
<p>The Web&#8217;s standards say (<a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#invisible-server-redirects">links and full quotes</a>): The trailing path segment delimiter &#8220;/&#8221; represents an empty last path segment. Normalization should not remove delimiters when their associated component is empty. (Read the polite &#8220;should&#8221; as &#8220;must&#8221;.)</p>
<p>To understand that, lets look at the most common URL components:<br />
<code style="font-size:110%; font-weight:bold;"><abbr title="http:// | https://">scheme://</abbr> <abbr title="Possibly subdomains '.' domain: the 'example' or 'www.example' part of (www.)example.com">server-name</abbr>.<abbr title="Top Level Domain like '.com'">tld</abbr> <abbr title="Path = optional directory or directory hierarchy + optional file name"><b style="color:red;">/path</b></abbr> <abbr title="Variable=value pairs delimited by &amp;">?query-string</abbr> <abbr title="On the page link to a DOM-ID">#fragment</abbr></code><br />
The (red) <b>path</b> part begins with a forward slash &#8220;/&#8221; and must consist of at least one byte (the trailing slash itself in case of the home page URL http://example.com<b>/</b>). </p>
<p>If an URL ends with a slash, it points to a directory&#8217;s default document, or, if there&#8217;s no default document, to a list of objects stored in a directory. The home page link lacks a directory name, because &#8220;/&#8221; after the TLD (.com|net|org|&#8230;) stands for the root directory. </p>
<p>Automated directory indexes (a list of links to all files) should be forbidden, use <code>Options -Indexes</code> in .htaccess to send such requests to your 403-Forbidden page. </p>
<p>In order to set default file names and their search sequence for your directories use <code>DirectoryIndex index.html index.htm index.php /error_handler/missing_directory_index_doc.php</code>. In this example: on request of http://example.com/directory/ Apache will first look for /directory/index.html, then if that doesn&#8217;t exist for /directory/index.htm, then /directory/index.php, and if all that fails, it will serve an error page (that should log such requests so that the Webmaster can upload the missing default document to /directory/).</p>
<p>The URL http://example.com (without the trailing slash) is invalid, and there&#8217;s no specification telling a reason why a Web server should respond to it with meaningful contents. Actually, the location http://example.com points to <em>Null</em>&nbsp; (nil, zilch, nada, zip, nothing), hence the correct response is &#8220;404 - we haven&#8217;t got &#8216;nothing to serve&#8217; yet&#8221;.</p>
<p>The same goes for sub-directories. If there&#8217;s no <b>file</b> named &#8220;/dir&#8221;, the URL http://example.com/dir points to Null too. If you&#8217;ve a <b>directory</b> named &#8220;/dir&#8221;, the canonical URL http://example.com/dir/ either points to a directory index page (an autogenerated list of all files) or the directory&#8217;s default document &#8220;index.(html|htm|shtml|php|&#8230;)&#8221;. A request of http://example.com/dir &#8211;without the trailing slash that tells the Web server that the request is for a directory&#8217;s index&#8211; resolves to &#8220;not found&#8221;.   </p>
<p>You must not reference a default document by its name! If you&#8217;ve links like http://example.com/index.html you can&#8217;t change the underlying technology without serious hassles. Say you&#8217;ve a static site with a file structure like /index.html, /contact/index.html, /about/index.html and so on. Tomorrow you&#8217;ll realize that static stuff sucks, hence you&#8217;ll develop a dynamic site with PHP. You&#8217;ll end up with new files: /index.php, /contact/index.php, /about/index.php and so on. If you&#8217;ve coded your internal links as http://example.com/contact/ etc. they&#8217;ll still work, without redirects from .html to .php. Just change the DirectoryIndex directive from &#8220;&#8230; index.html &#8230; index.php &#8230;&#8221; to &#8220;&#8230; index.php &#8230; index.html &#8230;&#8221;. (Of course you can configure Apache to parse .html files for PHP code, but that&#8217;s another story.)</p>
<p>It seems that truncating default document names can make sense for services that deal with URLs, but watch out for sites that serve different contents under various extensions of &#8220;index&#8221; files (intentionally or not). I&#8217;d say that folks submitting their ugly index.html files to directories, search engines, top lists and whatnot deserve all the hassles that come with later changes. </p>
<h3 id="how-to-rescue-trailing-slashes">How to rescue stolen trailing slashes</h3>
<p>Since Web servers know that users are faulty by design, they jump through a couple of resource burning hoops in order to either add the trailing slash so that relative references inside HTML documents (CSS/JS/feed links, image locations, HREF values &#8230;) work correctly, or apply voodoo to accomplish that without (visibly) changing the address bar. </p>
<p>With Apache, <code>DirectorySlash On</code> enables this behavior (<a href="http://www.seoconsultants.com/tools/headers/">check</a> whether your Apache version does 301 or 302 redirects, in case of 302s find another solution). You can also rewrite invalid requests in .htaccess when you need special rules: <code><br />
RewriteEngine  on<br />
RewriteBase    /content/<br />
RewriteRule    ^dir1$  http://example.com/content/dir1/  [R=301,L]<br />
RewriteRule    ^dir2$  http://example.com/content/dir2/  [R=301,L]  </code></p>
<p>With content management systems (CMS) that generate virtual URLs on the fly, often there&#8217;s no other chance than hacking the software to canonicalize invalid requests. To prevent search engines from indexing invalid URLs that are in fact duplicates of canonical URLs, you&#8217;ll perform permanent redirects (<a href="http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#301-moved-permanently">301</a>).</p>
<p>Here is a WordPress (header.php) example: <code><b><br />
$requestUri                = $_SERVER[&quot;REQUEST_URI&quot;];<br />
$queryString               = $_SERVER[&quot;QUERY_STRING&quot;];<br />
$doRedirect                = FALSE;<br />
$fileExtensions            = array(&quot;.html&quot;, &quot;.htm&quot;, &quot;.php&quot;);<br />
$serverName                = $_SERVER[&quot;SERVER_NAME&quot;];<br />
$canonicalServerName       = $serverName;<br />
&nbsp;<br />
// if you prefer http://example.com/* URLs remove the &quot;www.&quot;:<br />
$srvArr                    = explode(&quot;.&quot;, $serverName);<br />
$canonicalServerName       = $srvArr[count($srvArr) - 2]                             .&quot;.&quot; .$srvArr[count($srvArr) - 1];<br />
&nbsp;<br />
$url                       = parse_url (&quot;http://&quot; .$canonicalServerName .$requestUri);<br />
$requestUriPath            = $url[&quot;path&quot;];<br />
if (substr($requestUriPath, -1, 1) != &quot;/&quot;) {<br />
    $isFile                = FALSE;<br />
    foreach($fileExtensions as $fileExtension) {<br />
        if ( strtolower(substr($requestUriPath, strlen($fileExtension) * -1, strlen($fileExtension))) == strtolower($fileExtension) ) {<br />
            $isFile = TRUE;<br />
        }<br />
    }<br />
    if (!$isFile) {<br />
        $requestUriPath .= &quot;/&quot;;<br />
        $doRedirect        = TRUE;<br />
    }<br />
}<br />
$canonicalUrl              = &quot;http://&quot; .$canonicalServerName .$requestUriPath;<br />
if ($queryString) {<br />
    $canonicalUrl         .= &quot;?&quot; . $queryString;<br />
}<br />
if ($url[&quot;fragment&quot;]) {<br />
    $canonicalUrl         .= &quot;#&quot; . $url[&quot;fragment&quot;];<br />
}<br />
if ($doRedirect) {<br />
    @header(&quot;HTTP/1.1 301 Moved Permanently&quot;, TRUE, 301);<br />
    @header(&quot;Location: $canonicalUrl&quot;);<br />
    exit;<br />
}  </b></code><br />
Check your permalink settings and edit the values of <code>$fileExtensions</code> and <code>$canonicalServerName</code> accordingly. For other CMSs adapt the code, perhaps you need to change the handling of query strings and fragments. The code above will not run under IIS, because it has no REQUEST_URI variable.</p>
<h3 id="why-stealing-slashes-is-plain-robbery">Why stealing trailing slashes is not cool</h3>
<p>This section expressed in one sentence: <b>Cool URLs don&#8217;t change, hence changing other people&#8217;s URLs is not cool.</b></p>
<p>Folks should understand the &#8220;U&#8221; in URL as <b>unique</b>. Each URL addresses one and only one particular resource. Technically spoken, if you change one single character of an URL, the altered URL points to a different resource, or nowhere. </p>
<p>Think of URLs as phone numbers. When you call 555-0100 you reach the switchboard, 555-0101 is the fax, and 555-0109 is the phone extension of somebody. When you steal the last digit, dialing 555-010, you get nowhere.</p>
<p><img src="http://sebastians-pamphlets.com/img/posts/fools-at-yahoo.jpg" width="120" height="134" style="margin-right:4px;" align="left" alt="Yahoo'ish fools steal our trailing slashes" title="Yahoo'ish fools steal our trailing slashes :(" />Only a fool would assert that a phone number shortened by one digit is way cooler than the complete phone number that actually connects somewhere. Well, the last digit of a phone number and the trailing slash of a directory link aren&#8217;t much different. If somebody hands out an URL (with trailing slash), then use it as is, or don&#8217;t use it at all. Don&#8217;t &#8220;prettify&#8221; it, because any change destroys its serviceability.</p>
<p>If one requests a directory without the trailing slash, most Web servers will just reply to the user agent (brower, screen reader, bot) with a redirect header telling that one must use a trailing slash, then the user agent has to re-issue the request in the formally correct way. From a Webmaster&#8217;s perspective, burning resources that thoughtlessly is plain theft. From a user&#8217;s perspective, things will often work without the slash, but they&#8217;ll be quicker with it. &#8220;<b>Often</b>&#8221; doesn&#8217;t equal &#8220;always&#8221;:
<ul>
<li>Some Web servers will serve the 404 page.</li>
<li>Some Web servers will serve the wrong content, because /dir is a valid script, virtual URI, or page that has nothing to do with the index of /dir/.</li>
<li>Many Web servers will respond with a 302 HTTP response code (Found) instead of a correct 301-redirect, so that most search engines discovering the sneakily circumcised URL will index the contents of the canonical URL under the invalid URL. Now all search engine users will request the incomplete URL too, running into unnecessary redirects.</li>
<li>Some Web servers will serve identical contents for /dir and /dir/, that leads to duplicate content issues with search engines that index both URLs from links. Most Web services that rank URLs will assign different scorings to all known URL variants, instead of accumulated rankings to both URLs (which would be the right thing to do, but is technically, well, challenging).</li>
<li>Some user agents can&#8217;t handle (301) redirects properly. Exotic user agents might serve the user an empty page or the redirect&#8217;s &#8220;error message&#8221;, and Web robots like the crawlers sent out by Technorati or MSN-LiveSearch hang up respectively process garbage.</li>
</ul>
<p>Does it really make sense to maliciously manipulate URLs just because some clueless developers say &#8220;dude, without the slash it looks way cooler&#8221;? Nope. Stealing trailing slashes in general as well as storing amputated URLs is a brain dead approach. </p>
<p>KISS (keep it simple, stupid) is a great principle. &#8220;Cosmetic corrections&#8221; like trimming URLs add unnecessary complexity that leads to erroneous behavior and requires even more code tweaks. GIGO (garbage in, garbage out) is another great principle that applies here. Smart algos don&#8217;t change their inputs. As long as the input is processible, they accept it, otherwise they skip it. </p>
<h3>Exceptions</h3>
<p style="border: 1px solid red; padding:2px; margin:2px; margin-bottom:25px;">URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - &#8220;domain.co.uk&#8221; in print and &#8220;domain dot co dot uk&#8221; on radio is enough. The necessary redirect is cheaper than a visitor who doesn&#8217;t type in the canonical URL including scheme, www-prefix, and trailing slash.</p>
<h3 id="url-canonicalization-irritates-technorati">How URL canonicalization seems to irritate Technorati</h3>
<p>Due to the not exactly responsively (respectively swamped) Technorati user support parts of this section should be interpreted as educated speculation. Also, I didn&#8217;t research enough cases to come to a working theory. So here is just the story &#8220;how Technorati fails to deal with my blog&#8221;.</p>
<p>When I moved my blog from <a href="http://sebastians-pamphlets.com/about/sebastianx-blogspot-com/">blogspot</a> to this domain, <a href="http://sebastians-pamphlets.com/how-to-seo-sanitize-a-wordpress-theme/">I&#8217;ve enhanced the faulty WordPress URL canonicalization</a>. If any user agent requests http://sebastians-pamphlets.com it gets redirected to http://sebastians-pamphlets.com/. Invalid post/page URLs like http://sebastians-pamphlets.com/about redirect to http://sebastians-pamphlets.com/about/. All redirects are permanent, returning the HTTP response code &#8220;301&#8243;.</p>
<p>I&#8217;ve claimed my blog as http://sebastians-pamphlets.com<b>/</b>, but <a href="http://www.technorati.com/people/technorati/SebastianX">Technorati shows its URL without the trailing slash</a>. <code><small><br />
&#8230;&lt;div class=&quot;url&quot;&gt;&lt;a href=&quot;http://sebastians-pamphlets.com&quot;&gt;http://sebastians-pamphlets.com&lt;/a&gt; &lt;/div&gt;  &lt;a class=&quot;image-link&quot; href=&quot;/blogs/sebastians-pamphlets.com&quot;&gt;&lt;img &#8230;</small></code></p>
<p>By the way, they forgot dozens of fans (folks who &#8220;fave&#8217;d&#8221; either my old blogspot outlet or this site) too.<br />
<img src="http://sebastians-pamphlets.com/img/posts/technorati-claimed-blogs.png" width="498" height="233" alt="Blogs claimed at Technorati" title="My blog at Technorati" /></p>
<p>I&#8217;ve added a description and tons of tags, that both don&#8217;t show up on public pages. It seems my tags were deleted, at least they aren&#8217;t visible in edit mode any more.<br />
<img src="http://sebastians-pamphlets.com/img/posts/technorati-edit-claimed-blog.png" width="498" height="121" alt="Edit blog settings at Technorati" title="Editing my blog's settings at Technorati" /></p>
<p>Shortly after the submission, Technorati stopped to adjust the reputation score from newly discovered inbound links. Furthermore, the <a href="http://www.technorati.com/blogs/sebastians-pamphlets.com?posts">list of my recent posts</a> became stale, although I&#8217;ve pinged Technorati with every update, and technorati received my update notifications via ping services too. And yes, I&#8217;ve tried <a href="http://www.technorati.com/ping/http://sebastians-pamphlets.com">manual pings</a> to no avail.</p>
<p>I&#8217;ve gained lots of fresh inbound links, but the authority score didn&#8217;t change. So I&#8217;ve asked Technorati&#8217;s support for help. A few weeks later, in December/2007, I&#8217;ve got an answer:</p>
<blockquote><p>I&#8217;ve taken a look at the issue regarding picking up your pings for &#8220;sebastians-pamphlets.com&#8221;.  After making a small adjustment, I&#8217;ve sent our spiders to revisit your page and your blog should be indexed successfully from now on.</p>
<p>Please let us know if you experience any problems in the future.  Do not hesitate to contact us if you have any other questions.</p>
</blockquote>
<p>Indeed, Technorati updated the reputation score from &#8220;56&#8243; to &#8220;191&#8243;, and refreshed the list of posts including the most recent one.</p>
<p>Of course the &#8220;small adjustment&#8221; didn&#8217;t persist (I assume that a batch process stole the trailing slash that the friendly support person has added). I&#8217;ve sent a follow-up email asking whether that&#8217;s a slash issue or not, but didn&#8217;t receive a reply yet. I&#8217;m quite sure that Technorati doesn&#8217;t follow 301-redirects, so that&#8217;s a plausible cause for this bug at least.</p>
<p>Since December 2007 Technorati didn&#8217;t update my authority score (just the rank goes up and down depending on the number of inbound links Technorati shows on the <a href="http://www.technorati.com/blogs/sebastians-pamphlets.com?reactions">reactions page</a> - by the way these numbers are often unreal and change in the range of hundreds from day to day).<br />
<img src="http://sebastians-pamphlets.com/img/posts/technorati-blog-reactions-with-authority-scoring.png" width="498" height="225" alt="Blog reactions and authority scoring at Technorati" title="My inbound links and stale authority scoring at Technorati" /></p>
<p>It seems Technorati didn&#8217;t index my posts since then (December/18/2007), so probably my outgoing links don&#8217;t count for their destinations.<br />
<img src="http://sebastians-pamphlets.com/img/posts/technorati-latest-posts-stale-since-2-months.png" width="498" height="196" alt="Stale list of recent posts at Technorati" title="No new posts indexed by Technorati since December/18/2007" /></p>
<p>(All screenshots were taken on February/05/2008. When you click the Technorati links today, it <strike>could</strike> hopefully will look differently.)</p>
<p>I&#8217;m not amused. I&#8217;m curious what would happen when I add <code><br />
if (!preg_match(&quot;/Technorati/i&quot;, &quot;$userAgent&quot;)) {/* redirect code */} </code><br />
to my canonicalization routine, but I can resist to handle particular Web robots. My URL canonicalization should be identical both for visitors and crawlers. <a href="http://www.technorati.com/tag/technorati+bug+report" rel="tag nofollow">Technorati</a> should be able to fix this bug without code changes at my end or weeky support requests. Wishful thinking? Maybe.</p>
<p><b>Update 2008-03-06:</b> Technorati crawls my blog again. The 301 redirects weren&#8217;t the issue. I&#8217;ll explain that in a follow-up post soon.</p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/thou-must-not-steal-the-trailing-slash-from-my-urls/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The hacker tool MSN-LiveSearch is responsible for brute force attacks</title>
		<link>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/</link>
		<comments>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 15:36:08 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Testing]]></category>

		<category><![CDATA[MSN]]></category>

		<category><![CDATA[Search Quality]]></category>

		<category><![CDATA[Crap]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/</guid>
		<description><![CDATA[A while ago I&#8217;ve staged a public SEO contest, asking whether the 401 HTTP response code prevents from search engine indexing or not. 
Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 [...]]]></description>
			<content:encoded><![CDATA[<p><img  src="http://sebastians-pamphlets.com/img/posts/401-private-property-keep-out.png" width="200" height="133" align="right" style="margin-left:4px;" alt="401 = Private Property, keep out!" title="401 = Private Property, keep out!" />A while ago I&#8217;ve staged a public <a href="http://sebastians-pamphlets.com/seo-test-do-search-engines-index-password-protected-urls/">SEO contest</a>, asking whether the 401 HTTP response code prevents from search engine indexing or not. </p>
<p>Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 HTTP response code that translates to a polite &#8220;Authorization Required&#8221;, meaning &#8220;Forbidden unless you provide valid authorization&#8221;. </p>
<p>Experience of life and common sense tell search engines, that when a Webmaster protects content with a user/password query, this content is not available to the public. Search engines that respect Webmasters/site owners do not point their users to protected content. </p>
<p>Also, that makes no sense for the search engine. Searchers submitting a query with keywords that match a protected URL would be pissed when they click the promising search result on the SERP, but the linked site responds with an unfriendly &#8220;Enter user and password in order to access [title of the protected area]&#8221;, that resolves to a harsh error message because the searcher can&#8217;t provide such information, and usually can&#8217;t even sign up from the 401 error page<sup><a href="#401-error-document-footnote">1</a></sup>. </p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/evil-use-of-search-results.png" width="200" height="255" align="right" style="margin-left:4px;" alt="Evil use of search results" title="The evil variant of search results " />Unfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example <code>/members/</code>. Even when those systems can handle individual URLs, many Webmasters leave default URLs in place that are either guessable or well known on the Web. </p>
<p>Developing a script that searches for a string like <code>/members/</code> in URLs and then &#8220;tests&#8221; the search results with brute force attacks is a breeze. Also, such scripts are available (for a few bucks or even free) at various places. Without the help of a search engine that provides the lists of protected URLs, the hacker&#8217;s job is way more complicated. In other words, search engines that list protected URLs on their SERPs willingly support and encourage hacking, content theft, and DOS-like server attacks.</p>
<p>Ok, lets look at the test results. All search engines have casted their votes now. <b>Here are the winners:</b> </p>
<h3>Google <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Once my test was out, <a href="http://mattcutts.com/blog/">Matt Cutts</a> from <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40207">Google</a> researched the question and told me:</p>
<blockquote><p>My belief from talking to folks at Google is that 401/forbidden URLs that we crawl won&#8217;t be indexed even as a reference, so .htacess password-protected directories shouldn&#8217;t get indexed as long as we crawl enough to discover the 401. Of course, if we discover an URL but didn&#8217;t crawl it to see the 401/Forbidden status, that URL reference could still show up in Google.</p>
</blockquote>
<p>Well, that&#8217;s exactly the expected behavior, and I wasn&#8217;t surprised that my test results confirm Matt&#8217;s statement. Thanks to Google&#8217;s BlitzIndexing&trade; Ms. Googlebot spotted the 401 so fast, that the URL never showed up on Google&#8217;s SERPs. Google reports the <a href="http://sebastians-pamphlets.com/porn/">protected URL</a> in my <a href="http://google.com/webmasters/tols/">Webmaster Console</a> account for this blog as not indexable.</p>
<h3>Yahoo <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Yahoo&#8217;s crawler Slurp also fetched the protected URL in no time, and Yahoo did the right thing too. I wonder whether or not that&#8217;s going to change if <a href="http://searchengineland.com/080201-064343.php">M$ buys Yahoo</a>. </p>
<h3>Ask <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </h3>
<p>Ask&#8217;s crawler isn&#8217;t the most diligent Web robot out there. However, somehow Ask has managed not to index a reference to my password protected URL.</p>
<p><b style="font-size:110%;">And here is the ultimate loser:</b></p>
<h3>MSN LiveSearch <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </h3>
<p>Oh well. Obviously MSN LiveSearch is a must have in a deceitful cracker&#8217;s toolbox:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/msn-indexes-401-protected-urls.png" width="467" height="223" align="center" style="" alt="MSN LiveSearch indexes password protected URLs" title="MSN LiveSearch indexes password protected URLs" /></p>
<p>As if indexing references to password protected URLs wouldn&#8217;t be crappy enough, MSN even indexes sitemap files that are referenced in robots.txt only. Sitemaps are machine readable URL submission files that have absolute no value for humans. Webmasters make use of sitemap files to mass submit their URLs to search engines. The <a href="http://sitemaps.org/">sitemap protocol</a>, that MSN officially supports, defines a communication channel between Webmasters and search engines - not searchers, and especially not scrapers that can use indexed sitemaps to steal Web contents more easily. Here is a screen shot of an MSN SERP:</p>
<p><img  src="http://sebastians-pamphlets.com/img/posts/msn-lists-unlinked-porn-sitemap-file-2008-01.png" width="460" height="54" align="center" style="" alt="MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)" title="MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)" /><br />
<img  src="http://sebastians-pamphlets.com/img/posts/msn-indexes-unlinked-porn-sitemap-file-2008-01.png" width="460" height="58" align="center" style="" alt="MSN LiveSearch indexes unlinked sitemaps files (MSN Webmaster Tools)" title="MSN LiveSearch indexes unlinked sitemaps files (MSN Webmaster Tools)" /></p>
<p>All the other search engines got the sitemap submission of the test URL too, but none of them fell for it. Neither Google, Yahoo, nor Ask have indexed the sitemap file (they never index submitted sitemaps that have no inbound links by the way) or its protected URL.</p>
<h3>Summary</h3>
<p><b style="font-size:110%;">All major search engines except MSN respect the 401 barrier.</b></p>
<p>Since <a href="http://sebastians-pamphlets.com/msn-admits-clueless-and-ineffective-spamming/">MSN LiveSearch is well known for spamming</a>, it&#8217;s not a big surprise that they support hackers, scrapers and other content thieves. </p>
<p>Of course MSN search is still an experiment, operating in a not yet ready to launch stage, and the big players made their mistakes in the beginning too. But MSN has a history of ignoring Web standards as well as Webmaster concerns. It took them two years to implement the pretty simple sitemaps protocol, they still can&#8217;t handle 301 redirects, their sneaky stealth bots spam the referrer logs of all Web sites out there in order to fake human traffic from MSN SERPs (MSN traffic doesn&#8217;t exist in most niches), and so on. Once pointed to such crap, they don&#8217;t even fix the simplest bugs in a timely manner. I mean, not complying to the HTTP 1.1 protocol from the last century is an evidence of incapacity, and that&#8217;s just one example.</p>
<p>&nbsp;</p>
<p><b>Update Feb/06/2008:</b> Last night I&#8217;ve received an email from Microsoft confirming the 401 issue. The MSN Live Search engineer said they are currently working on a fix, and he provided me with an email address to report possible further issues. Thank you, <a href="http://nathanbuggia.com/">Nathan Buggia</a>! I&#8217;m still curious how MSN Live Search will handle sitemap files in the future.</p>
<p>&nbsp;</p>
<hr width="128" color="silver" align="center" />
<p id="401-error-document-footnote"><sup>1</sup>&nbsp;<small>Smart Webmasters provide sign up as well as login functionality on the page referenced as ErrorDocument 401, but the majority of all failed logins leave the user alone with the short hard coded 401 message that Apache outputs if there&#8217;s no 401 error document. Please note that you shouldn&#8217;t use a PHP script as 401 error page, because this might disable the user/password prompt (due to a PHP bug). With a <a href="http://sebastians-pamphlets.com/error401.html">static 401 error page</a> that fires up on invalid user/pass entries or a hit on the cancel button, you can perform a meta refresh to redirect the visitor to a signup page. Bear in mind that in .htaccess you <b>must not</b> use absolute URLs (http://&#8230; or https://&#8230;) in the ErrorDocument 401 directive, and that on the error page you <b>must</b> use absolute URLs for CSS, images, links and whatnot because relative URIs don&#8217;t work there!</small></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/all-search-engines-except-msn-live-search-respect-the-401-barrier/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Upgrading from IIS/ASP to Apache/PHP</title>
		<link>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/</link>
		<comments>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/#comments</comments>
		<pubDate>Tue, 11 Dec 2007 20:47:25 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[404grabber]]></category>

		<category><![CDATA[Duplicate Content]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Copy+Paste-Penalties]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[IIS]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/</guid>
		<description><![CDATA[Once you&#8217;re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won&#8217;t work, and you can&#8217;t run MS-Access &#8220;databases&#8221; and such stuff under Apache. 
Here is my idea of a smoothly migration from IIS/ASP to [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/upgrade-from-iis-asp-to-apache-php.png" width="250" height="227" align="right" style="margin-left:4px;" alt="Upgrade from Windows/IIS/ASP to Unix/Apache/PHP" title="Get the most out of your Web site - throw away Windows/IIS/ASP!"  />Once you&#8217;re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won&#8217;t work, and you can&#8217;t run MS-Access &#8220;databases&#8221; and such stuff under Apache. </p>
<p>Here is my idea of a smoothly migration from IIS/ASP to Apache/PHP. Grab any Unix box from your hoster&#8217;s portfolio and start over.</p>
<p>(Recently I got a tiny IIS/ASP site about <a href="http://link-condom.com/">uses &amp; abuses of link condoms</a> and moved it to an Apache server. I&#8217;m well known for brutal IIS rants, but so far I didn&#8217;t discuss a way out of such a dilemma, so I thought blogging this move could be a good idea.) </p>
<p>I don&#8217;t want to make this piece too complex, so I skip database and code migration strategies. Read Mike Hillyer&#8217;s article <a href="http://dev.mysql.com/tech-resources/articles/migrating-from-microsoft.html">Migrating from Microsoft Access/MS-SQL to MySQL</a>, and try tools like <a href="http://asp2php.naken.cc/docs.php">ASP to PHP</a>. (With my tiny <a href="http://link-condom.com/about.asp">link condom</a> site I overwrote the ASP code with PHP statements in my primitive text editor.)</p>
<p><b>From an SEO perspective such an upgrade comes with pitfalls:</b>
<ul>
<li>Changing file extensions from .asp to .php is not an option. We want to keep the number of unavoidable redirects as low as possible.</li>
<li>Default.asp is usually not configured as a valid default document under Apache, hence requests of http://example.com/ run into 404 errors.</li>
<li>Basic server name canonicalization routines (www vs. non-www) from ASP scripts are not convertible.</li>
<li>IIS-URIs are not case sensitive, that means that /Default.asp will 404 on Apache when the filename is /default.asp. Usually there are lowercase/uppercase issues with query string variables and values as well.</li>
<li>Most probably search engines have URL variants in their indexes, so we want to adapt their URL canonicalization, at least where possible.</li>
<li>HTML editors like Microsoft Visual Studio tend to duplicate the HTML code of templated page areas. Instead of editing menus or footers in all scripts we want to encapsulate them.</li>
<li>If the navigation makes use of relative links, we need to convert those to absolute URLs.</li>
<li>Error handling isn&#8217;t convertible. Improper error handling can cause decreasing search engine traffic.</li>
</ul>
<h3>Running /default.asp, /home.asp etc. as PHP scripts</h3>
<p>When you upload an .asp file to an Apache Web server, most user agents can&#8217;t handle it. Browsers treat them as unknown file types and force downloads instead of rendering them. Next those files aren&#8217;t parsed for PHP statements, provided you&#8217;ve rewritten the ASP code already.</p>
<p>To tell Apache that .asp files are valid PHP scripts outputting X/HTML, add this code to your server config or your .htaccess file in the root: <code><b><br />
AddType text/html .asp<br />
AddHandler application/x-httpd-php .asp </b></code><br />
The first line says that .asp files shall be treated as HTML documents, and should force the server to send a <code>Content-Type: text/html</code> HTTP header. The second line tells Apache that it must parse .asp files for PHP code. </p>
<p>Just in case the AddType statement above doesn&#8217;t produce a <code>Content-Type: text/html</code> header, here is another way to tell all user agents requesting .asp files from your server that the content type for .asp is text/html. If you&#8217;ve mod_headers available, you can accomplish that with this .htaccess code: <code><b><br />
&lt;IfModule mod_headers.c&gt;<br />
SetEnvIf Request_URI \.asp is_asp=is_asp<br />
Header set &quot;Content-type&quot; &quot;text/html&quot; env=is_asp<br />
Header set imagetoolbar &quot;no&quot;<br />
&lt;/IfModule&gt; </b></code><br />
(The imagetoolbar=no header tells IE to behave nicely; you can use this directive in a meta tag too.)<br />
If for some reason mod_headers doesn&#8217;t work well with mod_setenvif, giving 500 error codes or so, then you can set the content-type with PHP too. Add this to a PHP script file which is included in all your scripts at the very top: <code><b><br />
@header(&quot;Content-type: text/html&quot;, TRUE);  </b></code><br />
Instead of &#8220;text/html&#8221; alone, you can define the character set too: &#8220;text/html; charset=UTF-8&#8243;</p>
<h3>Sanitizing the home page URL by eliminating &#8220;default.asp&#8221;</h3>
<p>Instead of slowing down Apache by defining just another default document name (<code>DirectoryIndex index.html index.shtml index.htm index.php [...] default.asp</code>), we get rid of &#8220;/default.asp&#8221; with this &#8220;/index.php&#8221; script: <code><b><br />
&lt;?php<br />
@require(&quot;default.asp&quot;);<br />
?&gt; </b></code><br />
Now every request of http://example.com/ executes /index.php which includes /default.asp. This works with subdirectories too.</p>
<p>Just in case someone requests /default.asp directly (search engines keep forgotten links!), we perform a permanent redirect in .htaccess: <code><b><br />
Redirect 301 /default.asp http://example.com/<br />
Redirect 301 /Default.asp http://example.com/ </b></code></p>
<h3>Converting the ASP code for server name canonicalization</h3>
<p>If you find ASP canonicalization routines like <code><b><br />
&lt;%@ Language=VBScript %&gt;<br />
&lt;%<br />
if strcomp(Request.ServerVariables(&quot;SERVER_NAME&quot;), &quot;www.example.com&quot;, vbCompareText) = 0 then<br />
   Response.Clear<br />
   Response.Status = &quot;301 Moved Permanently&quot;<br />
   strNewUrl = Request.ServerVariables(&quot;URL&quot;)<br />
   if instr(1,strNewUrl, &quot;/default.asp&quot;, vbCompareText) &gt; 0 then<br />
     strNewUrl = replace(strNewUrl, &quot;/Default.asp&quot;, &quot;/&quot;)<br />
     strNewUrl = replace(strNewUrl, &quot;/default.asp&quot;, &quot;/&quot;)<br />
   end if<br />
   if Request.QueryString &lt;&gt; &quot;&quot; then<br />
       Response.AddHeader &quot;Location&quot;,&quot;http://example.com&quot; &amp; strNewUrl &amp; &quot;?&quot; &amp; Request.QueryString<br />
   else<br />
       Response.AddHeader &quot;Location&quot;,&quot;http://example.com&quot; &amp; strNewUrl<br />
   end if<br />
   Response.End<br />
end if<br />
%&gt;  </b></code><br />
(or the other way round) at the top of all scripts, just select and delete. This .htaccess code works way better, because it takes care of other server name garbage too: <code><b><br />
RewriteEngine On<br />
RewriteCond %{HTTP_HOST} !^example\.com [NC]<br />
RewriteRule (.*) http://example.com/$1 [R=301,L] </b></code><br />
(you need mod_rewrite, that&#8217;s usually enabled with the default configuration of Apache Web servers). </p>
<h3>Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value</h3>
<p>Probably a M$ developer didn&#8217;t read more than the scheme and server name chapter of the URL/URI standards, at least I&#8217;ve no better explanation for the fact that these clowns made the path and query string segment of URIs case-insensitive. (Ok, I have an idea, but nobody wants to read about M$ world domination plans.)</p>
<p>Just because &#8211;contrary to Web standards&#8211; M$ finds it funny to serve the same contents on request of /Home.asp as well as /home.ASP, such crap doesn&#8217;t fly on the World Wide Web. Search engines &#8211;and other Web services which store URLs&#8211; treat them as different URLs, and consider everything except one version duplicate content.</p>
<p>Creating hyperlinks in HTML editors by picking the script files from the Windows Explorer can result in HREF values like &#8220;/Script.asp&#8221;, although the file itself is stored with an all-lowercase name, and the FTP client uploads &#8220;/script.asp&#8221; to the Web server. There are more ways to fuck up file names with improper use of (leading) uppercase characters. Typos like that are somewhat undetectable with IIS, because the developer surfing the site won&#8217;t get 404-Not found responses. </p>
<p>Don&#8217;t misunderstand me, you&#8217;re free to camel-case file names for improved readability, but then make sure that the file system&#8217;s notation matches the URIs in HREF/SRC values. (Of course hyphened file names like &#8220;buy-cheap-viagra.asp&#8221; top the CamelCased version &#8220;BuyCheapViagra.asp&#8221; when it comes to search engine rankings, but don&#8217;t freak out about keywords in URLs, that&#8217;s ranking factor #202 or so.)</p>
<p>Technically spoken, converting all file names, variable names and values as well to all-lowercase is the simplest solution. This way it&#8217;s quite easy to 301-redirect all invalid requests to the canonical URLs. </p>
<p>However, each redirect puts search engine traffic at risk. Not all search engines process 301 redirects as they should (<a href="http://sphinn.com/story/16345">MSN Live Search</a> for example doesn&#8217;t follow permanent redirects and doesn&#8217;t pass the reputation earned by the old URL over to the new URL). So if you&#8217;ve good SERP positions for &#8220;misspelled&#8221; URLs, it might make sense to stick with ugly directory/file names. Check your search engine rankings, perform [site:example.com] search queries on all major engines, and read the SERP referrer reports from the old site&#8217;s server stats to identify all URLs you don&#8217;t want to redirect. By the way, the link reports in <a href="http://www.google.com/webmasters/tools/">Google&#8217;s Webmaster Console</a> and <a href="http://siteexplorer.search.yahoo.com/">Yahoo&#8217;s Site Explorer</a> reveal invalid URLs with (internal as well as external) inbound links too.</p>
<p>Whatever strategy fits your needs best, you&#8217;ve to call a script handling invalid URLs from your .htaccess file. You can do that with the ErrorDocument directive: <code><b><br />
ErrorDocument 404 /404handler.php </b></code><br />
That&#8217;s safe with static URLs without parameters and should work with dynamic URIs too. When you &#8211;in some cases&#8211; deal with query strings and/or virtual URIs, the .htaccess code becomes more complex, but handling virtual paths and query string parameters in the PHP scripts might be easier: <code><b><br />
&lt;IfModule mod_rewrite.c&gt;<br />
RewriteEngine On<br />
RewriteBase /<br />
RewriteCond %{REQUEST_FILENAME} !-f<br />
RewriteCond %{REQUEST_FILENAME} !-d<br />
RewriteRule . /404handler.php [L]<br />
&lt;/IfModule&gt; </b></code><br />
In both cases Apache will process /404handler.php if the requested URI is invalid, that is if the path segment (/directory/file.extension) points to a file that doesn&#8217;t exist.</p>
<p>And here is the PHP script /404handler.php:<br />
<b><a onclick="showContent('php-code-404-handler'); return false;">View</a>|<a onclick="hideContent('php-code-404-handler'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)<code id="php-code-404-handler" style="display:none;"><b><br />
&lt;?php // 404handler.php<br />
      // called from .htaccess if the requested path doesn&#8217;t exist<br />
&nbsp;<br />
$thisFileName    = &quot;404handler.php&quot;;  // change this<br />
$canonicalScheme = &quot;http://&quot;;<br />
$canonicalServer = &quot;example.com&quot;; // change this<br />
$errorPageUri    = &quot;/error.asp&quot;;  // change this<br />
$documentRoot    = $_SERVER[&quot;DOCUMENT_ROOT&quot;];<br />
$requestUri      = $_SERVER[&quot;REQUEST_URI&quot;];<br />
$canonicalUri    = &quot;&quot;;<br />
$requestedUrl    = $canonicalScheme .$canonicalServer .$requestUri;<br />
$canonicalUrl    = &quot;&quot;;<br />
$url             = parse_url($requestedUrl);<br />
$requestPath     = $url[&quot;path&quot;];<br />
$includeScript   = &quot;&quot;;<br />
$queryString     = $url[&quot;query&quot;];<br />
&nbsp;<br />
// keep misspelled URIs with nice search engine rankings<br />
if (&quot;$requestPath&quot; == &quot;/Sample.asp&quot;) {  // change this<br />
   $includeScript = $documentRoot .&quot;/sample.asp&quot;;  // change this<br />
}<br />
// &#8230;<br />
if (!empty($includeScript)) {<br />
   @header(&quot;HTTP/1.1 200 OK&quot;, TRUE, 200);<br />
   @include($includeScript);<br />
   exit;<br />
}<br />
&nbsp;<br />
// if the lowercase version exists, redirect to it<br />
$lcPath = strtolower($url[&quot;path&quot;]);<br />
$lcFile = $documentRoot .$lcPath;<br />
if (file_exists($lcFile) &#038;&#038; !stristr($requestUri,$thisFileName)) {<br />
    $canonicalUrl = $canonicalScheme .$canonicalServer .$lcPath;<br />
    if ($queryString) {<br />
        $canonicalUrl .= &quot;?&quot; .$queryString;<br />
    }<br />
    if ($url[&quot;fragment&quot;]) {<br />
        $canonicalUrl .= &quot;#&quot; .$url[&quot;fragment&quot;];<br />
    }<br />
}<br />
if (!empty($canonicalUrl)) {<br />
    @header(&quot;HTTP/1.1 301 Moved Permanently&quot;, TRUE, 301);<br />
    @header(&quot;Location: $canonicalUrl&quot;);<br />
    exit;<br />
}<br />
&nbsp;<br />
// serve the 404 error page<br />
@header(&quot;HTTP/1.1 404 Not found&quot;, TRUE, 404);<br />
@include($documentRoot .$errorPageUri);<br />
exit;<br />
?&gt;   </b></code><br />
(Edit the values in all lines marked with &#8220;// change this&#8221;.)</p>
<p>This script doesn&#8217;t handle case issues with query string variables and values. Query string canonicalization must be developed for each individual site. Also, capturing misspelled URLs with nice search engine rankings should be implemented utilizing a database table when you&#8217;ve more than a dozen or so. </p>
<p>Lets see what the /404handler.php script does with requests of non-existing files. </p>
<p>First we test the requested URI for invalid URLs which are nicely ranked at search engines. We don&#8217;t care much about duplicate content issues when the engines deliver targeted traffic. Here is an example (which admittedly doesn&#8217;t rank for anything but illustrates the functionality): both <a href="http://link-condom.com/sample.asp">/sample.asp</a> as well as <a href="http://link-condom.com/Sample.asp">/Sample.asp</a> deliver the same content, although there&#8217;s no /Sample.asp script. Of course a better procedure would be renaming /sample.asp to /Sample.asp, permanently redirecting /sample.asp to /Sample.asp in .htaccess, and changing all internal links accordinly.</p>
<p>Next we lookup the all lowercase version of the requested path. If such a file exists, we perform a permanent redirect to it. Example: <a href="http://link-condom.com/About.asp">/About.asp</a> 301-redirects to <a href="http://link-condom.com/about.asp">/about.asp</a>, which is the file that exists.</p>
<p>Finally, if everything we tried to find a suitable URI for the actual request failed, we send the client a 404 error code and output the error page. Example: <a href="http://link-condom.com/gimme404.asp" rel="nofollow crap">/gimme404.asp</a> doesn&#8217;t exist, hence /404handler.php responds with a 404-Not Found header and displays /error.asp, but <a href="http://link-condom.com/error.asp">/error.asp</a> directly requested responds with a 200-OK.</p>
<p>You can easily refine the script with other algorithms and mappings to adapt its somewhat primitive functionality to your project&#8217;s needs. </p>
<h3>Tweaking code for future maintenance</h3>
<p>Legacy code comes with repetition, redundancy and duplication caused by developers who love copy+paste respectively copy+paste+modify, or Web design software that generates static files from templates. Even when you&#8217;re not willing to do a complete revamp by shoving your contents into a CMS, you must replace the ASP code anyway, what gives you the opportunity to encapsulate all templated page areas. </p>
<p>Say your design tool created a bunch of .asp files which all contain the same sidebars, headers and footers. When you move those files to your new server, create PHP include files from each templated page area, then replace the duplicated HTML code with <code>&lt;?php @include("header.php"); ?&gt;</code>, <code>&lt;?php @include("sidebar.php"); ?&gt;</code>, <code>&lt;?php @include("footer.php"); ?&gt;</code> and so on. Note that when you&#8217;ve HTML code in a PHP include file, you must add <code>&lt;?php ?&gt;</code> before the first line of HTML code or contents in included files. Also, leading spaces, empty lines and such which don&#8217;t hurt in HTML, can result in errors with PHP statements like header(), because those fail when the server has sent anything to the user agent (even a single space, new line or tab is too much).</p>
<p>It&#8217;s a good idea to use PHP scripts that are included at the very top and bottom of all scripts, even when you currently have no idea what to put into those. Trust me and create top.php and bottom.php, then add the calls (<code>&lt;?php @include("top.php"); ?&gt;</code> [&#8230;] <code>&lt;?php @include("bottom.php"); ?&gt;</code>) to all scripts. Tomorrow you&#8217;ll write a generic routine that you must have in all scripts, and you&#8217;ll happily do that in top.php. The day after tomorrow you&#8217;ll paste the GoogleAnalytics tracking code into bottom.php. With complex sites you need more hooks. </p>
<h3>Using absolute URLs on different systems</h3>
<p>Another weak point is the use of relative URIs in links, image sources or references to feeds or external scripts. The lame excuse of most developers is that they need to test the site on their local machine, and that doesn&#8217;t work with absolute URLs. Crap. Of course it works. The first statement in top.php is <code><b><br />
@require($_SERVER[&quot;SERVER_NAME&quot;] .&quot;.php&quot;); </b></code><br />
This way you can set the base URL for each environment and your code runs everywhere. For development purposes on a subdomain you&#8217;ve a &#8220;dev.example.com.php&#8221; include file, on the production system example.com the file name resolves to &#8220;www.example.com.php&#8221;: <code><b><br />
&lt;?php<br />
$baseUrl = &#8220;http://example.com&#8221;;<br />
?&gt;  </b></code><br />
Then the menu in sidebar.php looks like: <code><b><br />
&lt;?php<br />
$classVMenu = &quot;vmenu&quot;;<br />
print &quot;<br />
&lt;img src=\&quot;$baseUrl/vmenuheader.png\&quot; width=\&quot;128\&quot; height=\&quot;16\&quot; alt=\&quot;MENU\&quot; /&gt;<br />
&lt;ul&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/\&quot;&gt;Home&lt;/a&gt;&lt;/li&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/contact.asp\&quot;&gt;Contact&lt;/a&gt;&lt;/li&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/sitemap.asp\&quot;&gt;Sitemap&lt;/a&gt;&lt;/li&gt;<br />
&#8230;<br />
&lt;/ul&gt;<br />
&quot;;<br />
?&gt; </b></code><br />
Mixing X/HTML with server sided scripting languages is fault-prone and makes maintenance a nightmare. Don&#8217;t make the same mistake as WordPress. Avoid crap like that: <code><br />
&lt;li&gt;&lt;a class=&quot;&lt;?php print $classVMenu; ?&gt;&quot; href=&quot;&lt;?php print $baseUrl; ?&gt;/contact.asp&quot;&gt;&lt;/a&gt;&lt;/li&gt; </code></p>
<h3>Error handling</h3>
<p>I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root&#8217;s .htaccess file: <code><b><br />
ErrorDocument 401 /get-the-fuck-outta-here.asp<br />
ErrorDocument 403 /get-the-fudge-outta-here.asp<br />
ErrorDocument 404 /404handler.php<br />
ErrorDocument 410 /410-gone-forever.asp<br />
ErrorDocument 503 /410-down-for-maintenance.asp<br />
# &#8230;<br />
Options -Indexes </b></code><br />
Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script: <code></b><br />
ErrorDocument 401 /error.php?errno=401<br />
ErrorDocument 403 /error.php?errno=403<br />
ErrorDocument 404 /404handler.php<br />
ErrorDocument 410 /error.php?errno=410<br />
ErrorDocument 503 /error.php?errno=503<br />
# &#8230;<br />
Options -Indexes </b></code><br />
Note that relative URLs in pages or scripts called by ErrorDocument directives don&#8217;t work. <b>Don&#8217;t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that.</b> If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that&#8217;s bad enough.) For more information please read my pamphlet <a href="http://sebastians-pamphlets.com/why-proper-error-handling-is-important/">Why error handling is important</a>. </p>
<p>Last but not least create a robots.txt file in the root. If you&#8217;ve nothing to hide from search engine crawlers, this one will suffice: <code></b><br />
User-agent: *<br />
Disallow:<br />
Allow: /<br />
</b></code></p>
<p>I&#8217;m aware that this tiny guide can&#8217;t cover everything. It should give you an idea of the pitfalls and possible solutions. If you&#8217;re somewhat code-savvy my code snippets will get you started, but hire an expert when you plan to migrate a large site. And don&#8217;t view the source code of <a href="http://link-condom.com/">link-condom.com</a> pages where I didn&#8217;t implement all tips from this tutorial. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise</title>
		<link>http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/</link>
		<comments>http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/#comments</comments>
		<pubDate>Tue, 09 Oct 2007 14:57:53 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Cloaking]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/the-anatomy-of-http-redirects-301-302-307/</guid>
		<description><![CDATA[We find redirects on every Web site out there. They&#8217;re often performed unnoticed in the background, unintentionally messed up, implemented with a great deal of ignorance, but seldom perfect from a SEO perspective. Unfortunately, the Webmaster boards are flooded with contradictorily, misleading and plain false  advice on redirects. If you for example read &#8220;for [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/http-redirects.png" width="200" height="150" alt="HTTP Redirects" title="HTTP Redirects" style="margin-left:3px;" align="right"  />We find redirects on every Web site out there. They&#8217;re often performed unnoticed in the background, unintentionally messed up, implemented with a great deal of ignorance, but seldom perfect from a SEO perspective. Unfortunately, the Webmaster boards are flooded with contradictorily, misleading and plain false  advice on redirects. If you for example read &#8220;for SEO purposes you must make use of 301 redirects only&#8221; then better close the browser window/tab to prevent you from crappy advice. A 302 or 307 redirect can be search engine friendly too.</p>
<p>With this post I do plan to bore you to death. So lean back, grab some popcorn, and stay tuned for a longish piece explaining the Interweb&#8217;s forwarding requests as dull as dust. Or, if you know everything about redirects, then please digg, sphinn and stumble this post before you surf away. Thanks.</p>
<ul id="redirect-jump-station" style="margin-bottom:25px;"><b>Jump Station</b></p>
<li class="toc-h3"><a href="#post-203">The anatomy of a server sided redirect</a></li>
<li class="toc-h4"><a href="#http-redirect-def">Redirects are defined in the HTTP protocol, not in search engine guidelines</a></li>
<li class="toc-h3"><a href="#whats-a-ss-redirect">What is a server sided redirect?</a></li>
<li class="toc-h4"><a href="#exec-ss-redirect">Execution of server sided redirects</a></li>
<li class="toc-h3"><a href="#http-redirect-header">What is an HTTP redirect header?</a></li>
<li class="toc-h4"><a href="#http-status-line">The redirect response code in a HTTP status line</a></li>
<li class="toc-h4"><a href="#http-header-location">The redirect header&#8217;s &#8220;location&#8221; directive</a></li>
<li class="toc-h3"><a href="#how-to-implement-ss-redirect">How to implement a server sided redirect?</a></li>
<li class="toc-h4"><a href="#redirect-server-config">Redirects in server configuration files</a></li>
<li class="toc-h4"><a href="#redirect-dir-files-htaccess">Redirecting directories and files with .htaccess</a></li>
<li class="toc-h4"><a href="#redirect-in-scripts">Redirects in server sided scripts</a></li>
<li class="toc-h3"><a href="#invisible-server-redirects">Redirects done by the Web server itself</a></li>
<li class="toc-h3"><a href="#redirect-or-not">Redirect or not? A few use cases&#8230;</a></li>
<li class="toc-h3"><a href="#choosing-a-redirect-response-code">Choosing the best redirect response code (301, 302, or 307)</a></li>
<li class="toc-h4"><a href="#301-moved-permanently">301 - Moved Permanently</a></li>
<li class="toc-h4"><a href="#moving-sites-301">Moving sites with 301 redirects</a></li>
<li class="toc-h4"><a href="#302-found-elsewhere">302 - Found [Elsewhere]</a></li>
<li class="toc-h4"><a href="#307-temporary-redirect">307 - Temporary Redirect</a></li>
<li class="toc-h3"><a href="#redirect-recap">Recap</a></li>
</ul>
<h4 id="http-redirect-def">Redirects are defined in the HTTP protocol, not in search engine guidelines</h4>
<p>For the moment please forget everything you&#8217;ve heard about redirects and their SEO implications, clear your mind, and follow me to the very basics defined in the HTTP protocol. Of course search engines interpret some redirects in a non-standard way, but understanding the norm as well as its use and abuse is necessary to deal with server sided redirects. I don&#8217;t bother with outdated HTTP 1.0 stuff, although some search engines still apply it every once in a while, hence I&#8217;ll discuss the 307 redirect introduced in <a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html">HTTP 1.1</a> too. For information on client sided redirects please refer to <a href="http://sebastians-pamphlets.com/google-and-yahoo-treat-undelayed-meta-refresh-as-301-redirect/">Meta Refresh - the poor man&#8217;s 301 redirect</a> or read my other <a href="http://sebastians-pamphlets.com/links/categories/?cat=redirects">pamphlets on redirects</a>, and stay away from <a href="http://sebastians-pamphlets.com/links/categories/?cat=javascript-redirects">JavaScript URL manipulations</a>.</p>
<h3 id="whats-a-ss-redirect">What is a server sided redirect?</h3>
<p>Think about an HTTP redirect as a forwarding request. Although redirects work slightly different from snail mail forwarding requests, this analogy perfectly fits the <em>procedure</em>. Whilst with <a href="https://moversguide.usps.com/?referral=USPS">US Mail forwarding requests</a> a clerk or postman writes the new address on the envelope before it bounces in front of a no longer valid respectively temporarily abandoned letter-box or pigeon hole, on the Web the request&#8217;s location (that is the Web server responding to the <em>server name</em> part of the URL) provides the requestor with the new location (absolute URL). </p>
<p>A server sided redirect tells the user agent (browser, Web robot, &#8230;) that it has to perform another request for the URL given in the HTTP header&#8217;s &#8220;location&#8221; line in order to fetch the requested contents. The type of the redirect (301, 302 or 307) also instructs the user agent how to perform future requests of the Web resource. Because search engine crawlers/indexers try to emulate human traffic with their content requests, it&#8217;s important to choose the right redirect type both for humans and robots. That does not mean that a 301-redirect is always the best choice, and it certainly does not mean that you always must return the same HTTP response code to crawlers and browsers. More on that later.</p>
<h4 id="exec-ss-redirect">Execution of server sided redirects</h4>
<p>Server sided redirects are executed <b>before</b> your server delivers any content. In other words, your server ignores everything it <b>could</b> deliver (be it a static HTML file, a script output, an image or whatever) when it runs into a redirect condition. Some redirects are done by the server itself (see <a href="http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/#incomplete-uri">handling incomplete URIs</a>), and there are several places where you can set (conditional) redirect directives: Apache&#8217;s <a href="http://httpd.apache.org/docs/2.2/configuring.html">httpd.conf</a>, <a href="http://httpd.apache.org/docs/2.2/howto/htaccess.html">.htaccess</a>, or in application layers for example in <a href="http://www.php.net/manual/en/function.header.php">PHP scripts</a>. (If you suffer from IIS/ASP maladies, <a href="http://www.cumbrowski.com/CarstenC/seo_301redirect_aspsrc.asp">this post</a> is for you.) <b>Examples:</b></p>
<table cellpadding="1" cellspacing="0" border="1" bordercolor="gray">
<tr>
<th><b>Browser Request:</b></th>
<th><code><b>ww.site.com<br />/page.php?id=1</b></code></th>
<th><code><b>site.com<br />/page.php?id=1</b></code></th>
<th><code><b>www.site.com<br />/page.php?id=1</b></code></th>
<th><code><b>www.site.com<br />/page.php?id=2</b></code></th>
</tr>
<tr>
<td><b>Apache:</b></td>
<td>301 header:<br /><code>www.site.com<br />/page.php?id=1</code></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td><b>.htaccess:</b></td>
<td>&nbsp;</td>
<td>301 header:<br /><code>www.site.com<br />/page.php?id=1</code></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td><b>/page.php:</b></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td valign="top">301 header:<br /><code>www.site.com<br />/page.php?id=2</code></td>
<td valign="top">200 header:<br /><code>(Info like content length...)</code><br />
<hr />Content:<br />Article #2</td>
</tr>
</table>
<p>The 301 header may or may not be followed by a hyperlink pointing to the new location, solely added for user agents which can&#8217;t handle redirects. Besides that link, there&#8217;s no content sent to the client <b>after</b> the redirect header.</p>
<p>More important, you must not send a single byte to the client <b>before</b> the HTTP header. If you for example code <code>[space(s)|tab|new-line|HTML code]&lt;?php ...</code> in a script that shall perform a redirect or is supposed to return a 404 header (or any HTTP header different from the server&#8217;s default instructions), you&#8217;ll produce a runtime error. The redirection fails, leaving the visitor with an ugly page full of cryptic error messages but no link to the new location.</p>
<p>That means in each and every page or script which possibly has to deal with the HTTP header, put the logic testing those conditions at the very top. <strong>Always send the header status code and optional further information like a new location to the client before you process the contents.</strong> </p>
<p>After the last redirect header line terminate execution with the &#8220;L&#8221; parameter in .htaccess, PHP&#8217;s <code>exit;</code> statement, or whatever.</p>
<h3 id="http-redirect-header">What is an HTTP redirect header?</h3>
<p>An HTTP redirect, regardless its type, consists of two lines in the HTTP header. In this example I&#8217;ve requested http://www.sebastians-pamphlets.com/about/, which is an invalid URI because my server name lacks the www-thingy, hence my canonicalization routine outputs this HTTP header:</code><br />
<b>HTTP/1.1 301 Moved Permanently</b><br />
<span style="color:gray;">Date: Mon, 01 Oct 2007 17:45:55 GMT<br />
Server: Apache/1.3.37 (Unix) PHP/4.4.4</span><br />
<b>Location: http://sebastians-pamphlets.com/about/</b><br />
<span style="color:gray;">Connection: close<br />
Transfer-Encoding: chunked<br />
Content-Type: text/html; charset=iso-8859-1</span></code></p>
<h4 id="http-status-line">The redirect response code in a HTTP status line</h4>
<p>The first line of the header defines the protocol version, the reponse code, and provides a human readable reason phrase. Here is a shortened and slightly modified excerpt quoted from the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6">HTTP/1.1 protocol definition</a>:<br />
<blockquote><b>Status-Line</b></p>
<p>The first line of a <em>Response message</em> is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by <acronym title="Space, blank, ASCII 0x20">SP</acronym> (space) characters. No <acronym title="Carriage Return, ASCII 0x0D">CR</acronym> or <acronym title="Line Feed, ASCII 0x0A">LF</acronym> is allowed except in the final <acronym title="New Line, CR followed by LF">CRLF</acronym> sequence.</p>
<p>Status-Line = HTTP-Version <i>SP</i> Status-Code <i>SP</i> Reason-Phrase <i>CRLF</i><br />
[e.g. &#8220;HTTP/1.1 301 Moved Permanently&#8221; + CRLF]</p>
<p><b>Status Code and Reason Phrase</b></p>
<p>The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request. [&#8230;] The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase.</p>
<p>The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. [&#8230;]:<br />
[&#8230;]<br />
- <b>3xx</b>: Redirection - Further action must be taken in order to complete the request<br />
[&#8230;]</p>
<p>The individual values of the numeric status codes defined for HTTP/1.1, and an example set of corresponding Reason-Phrases, are presented below. The reason phrases listed here are only recommendations &#8212; they MAY be replaced by local equivalents without affecting the protocol [that means you could translate and/or rephrase them].<br />
[&#8230;]<br />
<span style="color:gray;">300: Multiple Choices</span><br />
<b>301: Moved Permanently</b><br />
<b>302: Found [Elsewhere]</b><br />
<span style="color:gray;">303: See Other<br />
304: Not Modified<br />
305: Use Proxy</span><br />
<b>307: Temporary Redirect</b><br />
[&#8230;]</p></blockquote>
<p>In terms of SEO the understanding of 301/302-redirects is important. 307-redirects, introduced with HTTP/1.1, are still capable to confuse some search engines, even major players like Google when Ms. Googlebot for some reasons thinks she <em>must</em> do HTTP/1.0 requests, usually caused by weird respectively ancient server configurations (or possibly testing newly discovered sites under certain circumstances). You should not perform 307 redirects as response to most HTTP/1.0 requests, use 302/301 &#8211;whatever fits best&#8211; instead. More info on this issue below in the 302/307 sections.</p>
<p>Please note that the default reponse code of all redirects is 302. That means when you send a HTTP header with a location directive but without an explicit response code, your server will return a 302-Found status line. That&#8217;s kinda crappy, because in most cases you want to avoid the 302 code like the plague. Do no nay never rely on default response codes! <strong>Always prepare a server sided redirect with a status line telling an actual response code (301, 302 or 307)!</strong> In server sided scripts (PHP, Perl, ColdFusion, JSP/Java, ASP/VB-Script&#8230;) always send a complete status line, and in .htaccess or httpd.conf add a <code>[R=301|302|307<span style="color:gray;">,L</span>]</code> parameter to statements like <code>RewriteRule</code>: <code><br />
RewriteRule (.*) http://www.site.com/$1 [R=301,L]</code></p>
<h4 id="http-header-location">The redirect header&#8217;s &#8220;location&#8221; field</h4>
<p>The next element you need in every redirect header is the <b>location</b> directive. Here is the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.30">official syntax</a>:<br />
<blockquote>
<b>Location</b></p>
<p>The Location response-header field is used to redirect the recipient to a location other than the Request-URI for completion of the request or identification of a new resource. [&#8230;] For 3xx responses, the location SHOULD indicate the server&#8217;s preferred URI for automatic redirection to the resource. The field value consists of a single absolute URI.</p>
<p>Location = &#8220;Location&#8221; &#8220;:&#8221; absoluteURI [+ CRLF]</p>
<p>An example is:<br />
<code><br />
Location: http://sebastians-pamphlets.com/about/</code></p></blockquote>
<p><img src="http://sebastians-pamphlets.com/img/posts/redirect-to-an-absolute-url.png" width="200" height="150" alt="Redirect to absolute URLs only" title="A redirect's location is ALWAYS an absolute URL!" style="margin-left:3px;" align="right"  />Please note that the value of the location field must be an <b>absolute URL</b>, that is a fully qualified URL with scheme (http|https), server name (domain|subdomain), and path (directory/file name) plus the optional query string (&#8221;?&#8221; followed by variable/value pairs like <code>?id=1&amp;page=2...</code>), no longer than 2047 bytes (better 255 bytes because most scripts out there don&#8217;t process longer URLs <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.1">for historical reasons</a>). A relative URL like <code>../page.php</code> <em>might</em> work in (X)HTML (although you better plan a spectacular suicide than any use of relative URIs!), but <strong>you must not use relative URLs in HTTP response headers</strong>!</p>
<h3 id="how-to-implement-ss-redirect">How to implement a server sided redirect?</h3>
<p>You can perform HTTP redirects with statements in your Web server&#8217;s configuration, and in server sided scripts, e.g. PHP or Perl. JavaScript is a client sided language and therefore lacks a mechanism to do HTTP redirects. That means all JS redirects count as a 302-Found response.</p>
<p>Bear in mind that when you redirect, you possibly leave tracks of outdated structures in your HTML code, not to speak of incoming links. You must change each and every internal link to the new location, as well as all external links you control or where you can ask for an URL update. If you leave any outdated links, visitors probably don&#8217;t spot it (although every redirect slows things down), but search engine spiders continue to follow them, what ends in <a href="http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/">redirect chains</a> eventually. Chained redirects often are the cause of deindexing pages, site areas or even complete sites by search engines, hence do no more than one redirect in a row and consider two redirects in a row risky. You don&#8217;t control offsite redirects, in some cases a search engine has already counted one or two redirects before it requests your redirecting URL (caused by redirecting traffic counters etcetera). <b>Always redirect to the final destination to avoid useless hops which kill your search engine traffic.</b> (<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40132">Google recommends</a> &#8220;that you use fewer than five redirects for each request&#8221;, but don&#8217;t try to max out such limits because other services might be less BS-tolerant.)</p>
<p>Like conventional forwarding requests, redirects do expire. Even a permanent 301-redirect&#8217;s source URL will be requested by search engines every now and then because they can&#8217;t trust you. As long as there is one single link pointing to an outdated and redirecting URL out there, it&#8217;s not forgotten. It will stay alive in search engine indexes and address books of crawling engines even when the last link pointing to it was changed or removed. You can&#8217;t control that, and you can&#8217;t find all inbound links a search engine knows, despite their better reporting nowadays (neither <a href="https://siteexplorer.search.yahoo.com/">Yahoo&#8217;s site explorer</a> nor <a href="https://www.google.com/webmasters/tools/siteoverview">Google&#8217;s link stats</a> show you all links!). That means <b>you must maintain your redirects forever, and you must not remove (permanent) redirects</b>. Maintenance of redirects includes hosting abandoned domains, and updates of location directives whenever you change the final structure. <b>With each and every revamp that comes with URL changes check for incoming redirects and make sure that you eliminate unnecessary hops.</b></p>
<p>Often you&#8217;ve many choices where and how to implement a particular redirect. You can do it in scripts and even static HTML files, CMS software, or in the server configuration. There&#8217;s no such thing as a general best practice, just a few hints to bear in mind.
<ul>
<li><img src="http://sebastians-pamphlets.com/img/posts/redirects-are-dynamite-so-blast-carefully.png" width="150" height="164" alt="Redirects are dynamite, so blast carefully" title="SEO wise, the best redirect is no redirect at all!" style="margin-left:3px;" align="right"  /><b>Doubt</b>: Don&#8217;t believe Web designers and developers when they say that a particular task can&#8217;t be done without redirects. Do your own research, or ask an SEO expert. When you for example plan to make a static site dynamic by pulling the contents from a database with PHP scripts, you don&#8217;t need to change your file extensions from *.html to *.php. Apache can parse .html files for PHP, just enable that in your root&#8217;s .htaccess: <code><br />
AddType application/x-httpd-php .html <span style="color:gray;">.htm .shtml .txt .rss .xml .css</span></code><br />
Then generate tiny PHP scripts calling the CMS to replace the outdated .html files. That&#8217;s not perfect but way better than URL changes, provided your developers can manage the outdated links in the CMS&#8217; navigation. Another pretty popular abuse of redirects is click tracking. You don&#8217;t need a redirect script to count clicks in your database, <a href="http://sebastians-pamphlets.com/how-to-turn-click-tracking-into-miserable-failure/">make use of the onclick event instead</a>. </li>
<li><b>Transparency</b>: When the shit hits the fan and you need to track down a redirect with not more than the HTTP header&#8217;s information in your hands, you&#8217;ll begin to believe that performance and elegant coding is not everything. Reading and understanding a large httpd.conf file, several complex .htaccess files, and searching redirect routines in a conglomerate of a couple generations of scripts and include files is not exactly fun. You could add a custom field identifying the piece of redirecting code to the HTTP header. In .htaccess that would be achieved with <code><br />
Header add X-Redirect-Src &quot;/content/img/.htaccess&quot;</code><br />
and in PHP with <code><br />
header(&quot;X-Redirect-Src: /scripts/inc/header.php&quot;, TRUE);</code><br />
(Whether or not you should encode or at least obfuscate code locations in headers depends on your security requirements.) </li>
<li><b>Encapsulation</b>: When you must implement redirects in more than one script or include file, then encapsulate all redirects including all the logic (redirect conditions, determining new locations, &#8230;). You can do that in an include file with a meaningful file name for example. Also, instead of plastering the root&#8217;s .htaccess file with tons of directory/file specific redirect statements, you can gather all requests for redirect candidates and call a script which tests the REQUEST_URI to execute the suitable redirect. In .htaccess put something like:<code><br />
RewriteEngine On<br />
RewriteBase /old-stuff<br />
RewriteRule ^(.*)\.html$ do-redirects.php</code><br />
This code calls /old-stuff/do-redirects.php for each request of an .html file in /old-stuff/. The PHP script: <code><br />
$requestUri = $_SERVER[&quot;REQUEST_URI&quot;];<br />
if (stristr($requestUri, &quot;/contact.html&quot;)) {<br />
    $location = &quot;http://example.com/new-stuff/contact.htm&quot;;<br />
}<br />
...<br />
if ($location) {<br />
    @header(&quot;HTTP/1.1 301 Moved Permanently&quot;, TRUE, 301);<br />
    @header(&quot;X-Redirect-Src: /old-stuff/do-redirects.php&quot;, TRUE);<br />
    @header(&quot;Location: $location&quot;);<br />
    exit;<br />
}<br />
else {<br />
    [output the requested file or whatever]<br />
}</code><br />
(This is also an example of a redirect include file which you could insert at the top of a header.php include or so. In fact, you can include this script in some files <em>and</em> call it from .htaccess without modifications.) This method will not work with ASP on IIS because amateurish wannabe Web servers don&#8217;t provide the REQUEST_URI variable.</li>
<li><b>Documentation</b>: When you design or update an information architecture, your documentation should contain a redirect chapter. Also comment all redirects in the source code (your genial regular expressions might lack readability when someone else looks at your code). It&#8217;s a good idea to have a documentation file explaining all redirects on the Web server (you might work with other developers when you change your site&#8217;s underlying technology in a few years).</li>
<li><b>Maintenance</b>: Debugging legacy code is a nightmare. And yes, what you write today becomes legacy code in a few years. Thus keep it simple and stupid, implement redirects transparent rather than elegant, and don&#8217;t forget that you must change your ancient redirects when you revamp a site area which is the target of redirects.</li>
<li><b>Performance</b>: Even when performance is an issue, you can&#8217;t do everything in httpd.conf. When you for example move a large site changing the URL structure, the redirect logic becomes too complex in most cases. You can&#8217;t do database lookups and stuff like that in server configuration files. However, some redirects like for example server name canonicalization should be performed there, because they&#8217;re simple and not likely to change. If you can&#8217;t change httpd.conf, .htaccess files are for you. They&#8217;re are slower than cached config files but still faster than application scripts.</li>
</ul>
<h4 id="redirect-server-config">Redirects in server configuration files</h4>
<p>Here is an example of a canonicalization redirect in the root&#8217;s .htaccess file: <code><br />
RewriteEngine On<br />
RewriteCond %{HTTP_HOST} !^sebastians-pamphlets\.com [NC]<br />
RewriteRule (.*) http://sebastians-pamphlets.com/$1 [R=301,L]</code>
<ol>
<li>The first line enables Apache&#8217;s mod_rewrite module. Make sure it&#8217;s available on your box before you copy, paste and modify the code above.</li>
<li>
<p>The second line checks the server name in the HTTP request header (received from a browser, robot, &#8230;). The &#8220;NC&#8221; parameter ensures that the test of the server name (which is, like the scheme part of the URI, not case sensitive by <a href="http://www.ietf.org/rfc/rfc2396.txt">definition</a>) is done as intended. Without this parameter a request of http://SEBASTIANS-PAMPHLETS.COM/ would run in an unnecessary redirect. The rewrite condition returns TRUE when the server name is <b>not</b> sebastians-pamphlets.com. There&#8217;s an important detail: <b>not</b> &#8220;!&#8221; </p>
<p>Most Webmasters do it the other way round. They check if the server name equals an unwanted server name, for example with <code>RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]</code>. That&#8217;s not exactly efficient, and fault-prone. It&#8217;s not efficient because one needs to add a rewrite condition for each and every server name a user could type in and the Web server would respond to. On most machines that&#8217;s a huge list like &#8220;w.example.com, ww.example.com, w-w-w.example.com, &#8230;&#8221; because the default server configuration catches all not explicitely defined subdomains. </p>
<p>Of course next to nobody puts that many rewrite conditions into the .htaccess file, hence this method is fault-prone and not suitable to fix canonicalization issues. In combination with thoughtlessly usage of relative links (bullcrap that most designers and developers love out of lazyness and lack of creativity or at least fantasy), one single link to an existing page on a non-exisiting subdomain not redirected in such an .htaccess file could result in search engines crawling and possibly even indexing a complete site under the unwanted server name. When a <a href="http://fantomaster.com/fantomNews/archives/2007/07/20/negative-seo-inverse-seo-or-black-seo-whats-it-to-be/">savvy competitor</a> spots this exploit you can say good bye to a fair amount of your search engine traffic.</p>
<p>Another advantage of my single line of code is that you can point all domains you&#8217;ve registered to catch type-in traffic or whatever to the same Web space. Every new domain runs into the canonicalization redirect, 100% error-free.</p>
</li>
<li>The third line performs the 301 redirect to the requested URI using the canonical server name. That means when the request URI was http://www.sebastians-pamphlets.com/about/, the user agent gets redirected to http://sebastians-pamphlets.com/about/. The &#8220;R&#8221; parameter sets the reponse code, and the &#8220;L&#8221; parameter means <em>leave if the</em>|<em>one condition matches</em> (=exit), that is the statements following the redirect execution, like other rewrite rules and such stuff, will not be parsed.</li>
</ol>
<p>If you&#8217;ve access to your server&#8217;s httpd.conf file (what most hosting services don&#8217;t allow), then better do such redirects there. The reason for this recommendation is that Apache must look for .htaccess directives in the current directory and all its upper levels for each and every requested file. If the request is for a page with lots of embedded images or other objects, that sums up to hundreds of hard disk accesses slowing down the page loading time. The server configuration on the other hand is cached and therefore way faster. Learn more about <a href="http://httpd.apache.org/docs/2.2/howto/htaccess.html#when">.htaccess disadvantages</a>. However, since most Webmasters can&#8217;t modify their server configuration, I provide .htaccess examples only. If you can do, then you know how to put it in httpd.conf. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<h4 id="redirect-dir-files-htaccess">Redirecting directories and files with .htaccess</h4>
<p>When you need to redirect chunks of static pages to another location, the easiest way to do that is Apache&#8217;s <a href="http://httpd.apache.org/docs/2.2/mod/mod_alias.html#redirect">redirect directive</a>. The basic syntax is <code>Redirect [301|302|307] Path URL</code>, e.g. <code>Redirect 307 /blog/feed http://feedburner.com/myfeed</code> or <code>Redirect 301 /contact.htm /blog/contact/</code>. <code>Path</code> is always a file system path relative to the Web space&#8217;s root. <code>URL</code> is either a fully qualified URL (on another machine) like http://feedburner.com/myfeed, or a relative URL on the same server like /blog/contact/ (Apache adds scheme and server in this case, so that the HTTP header is build with an absolute URL in the location field; however, omitting the scheme+server part of the target URL is not recommended, see the warning below).</p>
<p>When you for example want to consolidate a blog on its own subdomain and a corporate Web site at example.com, then put <code><br />
 Redirect 301 / http://example.com/blog</code><br />
in the .htacces file of blog.example.com. When you then request http://blog.example.com/category/post.html you&#8217;re redirected to http://example.com/blog/category/post.html.</p>
<p>Say you&#8217;ve moved your product pages from /products/*.htm to /shop/products/*.htm then put <code><br />
Redirect 301 /products http://example.com/shop/products</code></p>
<p>Omit the trailing slashes when you redirect directories. To redirect particular files on the other hand you must fully qualify the locations: <code><br />
Redirect 302 /misc/contact.html http://example.com/cms/contact.php</code><br />
or, when the new location resides on the same server: <code><br />
Redirect 301 /misc/contact.html /cms/contact.php</code></p>
<p><b style="color:red;">Warning:</b> Although Apache allows local redirects like <code>Redirect 301 /misc/contact.html /cms/contact.php</code>, with some server configurations this will result in 500 server errors on all requests. Therefore I recommend the use of fully qualified URLs as redirect target, e.g. <code>Redirect 301 /misc/contact.html <b>http://example.com</b>/cms/contact.php</code>!</p>
<p>Maybe you found a reliable and unbeatable cheap hosting service to host your images. Copy all image files from example.com to image-example.com and keep the directory structures as well as all file names. Then add to example.com&#8217;s .htaccess <code><br />
RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$ http://www.image-example.com$1.$2</code><br />
The regex should match e.g. <code>/img/nav/arrow-left.png</code> so that the user agent is forced to request http://www.image-example.com<b>/img/nav/arrow-left.png</b>. Say you&#8217;ve converted your GIFs and JPGs to the PNG format during this move, simply change the redirect statement to <code><br />
RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$ http://www.image-example.com$1.png</code><br />
With regular expressions and <a href="http://httpd.apache.org/docs/2.2/mod/mod_alias.html#redirectmatch">RedirectMatch</a> you can perform very creative redirects.</p>
<p>Please note that the response codes used in the code examples above most probably do not fit the type of redirect you&#8217;d do in real life with similar scenarios. I&#8217;ll discuss use cases for all redirect response codes (301|302|307) later on.</p>
<h4 id="redirect-in-scripts">Redirects in server sided scripts</h4>
<p>You can do HTTP redirects only with server sided programming languages like PHP, ASP, Perl etcetera. Scripts in those languages generate the output before anything is send to the user agent. It should be a no-brainer, but these PHP examples don&#8217;t count as server sided redirects: <code><br />
print &quot;&lt;META HTTP-EQUIV=Refresh CONTENT=&quot;0; URL=http://example.com/&quot;&gt;\n&quot;;<br />
print &quot;&lt;script type="text/javascript"&gt;window.location = &quot;http://example.com/&quot;;&lt;/script&gt;\n&quot;;</code><br />
Just because you can output a redirect with a server sided language that does not make the redirect an HTTP redirect. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>In PHP you perform HTTP redirects with the <a href="http://www.php.net/manual/en/function.header.php">header() function</a>: <code><br />
$newLocation = &quot;http://example.com/&quot;;<br />
@header(&quot;HTTP/1.1 301 Moved Permanently&quot;, TRUE, 301);<br />
@header(&quot;Location: $newLocation&quot;);<br />
exit;</code><br />
The first input parameter of header() is the complete header line, in the first line of code above that&#8217;s the status-line. The second parameter tells whether a previously sent header line shall be replaced (default behavior) or not. The third parameter sets the HTTP status code, don&#8217;t use it more than once. If you use an ancient PHP version (prior 4.3.0) you can&#8217;t put the 2nd and 3rd input parameter. The &#8220;@&#8221; suppresses PHP warnings and error messages.</p>
<p>With ColdFusion you code <code><br />
&lt;CFHEADER statuscode=&quot;307&quot; statustext=&quot;Temporary Redirect&quot;&gt;<br />
&lt;CFHEADER name=&quot;Location&quot; value=&quot;http://example.com/&quot;&gt; </code></p>
<p>A redirecting Perl script begins with <code><br />
#!/usr/bin/perl -w<br />
use strict;<br />
print &quot;Status: 302 Found Elsewhere\r\n&quot;, &quot;Location: http://example.com/\r\n\r\n&quot;;<br />
exit; </code></p>
<p>Even with ASP you can do server sided redirects. VBScript: <code><br />
Dim newLocation<br />
newLocation = &quot;http://example.com/&quot;<br />
Response.Status = &quot;301 Moved Permanently&quot;<br />
Response.AddHeader &quot;Location&quot;, newLocation<br />
Response.End </code><br />
JScript: <code><br />
Function RedirectPermanent(newLocation) {<br />
Response.Clear();<br />
Response.Status = 301;<br />
Response.AddHeader(&quot;Location&quot;, newLocation);<br />
Response.Flush();<br />
Response.End();<br />
}<br />
...<br />
Response.Buffer = TRUE;<br />
...<br />
RedirectPermanent (&quot;http://example.com/&quot;); </code><br />
Again, if you suffer from IIS/ASP maladies: <a href="http://www.cumbrowski.com/CarstenC/seo_301redirect_aspsrc.asp">here you go</a>.</p>
<p><b><a href="#exec-ss-redirect">Remember</a>: Don&#8217;t output anything before the redirect header, and nothing after the redirect header!</b></p>
<h3 id="invisible-server-redirects">Redirects done by the Web server itself</h3>
<p>When you read your raw server logs, you&#8217;ll find a few 302 and/or 301 redirects Apache has performed without an explicit redirect statement in the server configuration, .htaccess, or a script. Most of these automatic redirects are the result of a very popular bullshit practice: removing trailing slashes. Although the standard defines that an URI like <code>/directory</code> is not a file name by default, therefore equals <code>/directory/</code> if there&#8217;s no file named <code>/directory</code>, choosing the version without the trailing slash is lazy at least, and creates lots of troubles (404s in some cases, otherwise external redirects, but always duplicate content issues you should fix with URL canonicalization routines). </p>
<p>For example Yahoo is a big fan of truncated URLs. They might save a few terabytes in their indexes by storing URLs without the trailing slash, but they send every user&#8217;s browser twice to those locations. Web servers must do a 302 or 301 redirect on each Yahoo-referrer requesting a directory or pseudo-directory, because they can&#8217;t serve the default document of an omitted path segment (the path component of an URI begins with a slash, the slash is its segment delimiter, and a trailing slash stands for the last (or only) segment representing a default document like index.html). From the Web server&#8217;s perspective <code>/directory</code> does not equal <code>/directory/</code>, only <code>/directory/</code> addresses <code>/directory/index.(htm|html|shtml|php|...)</code>, whereby the file name of the default document must be omitted (among other things to preserve the URL structure when the underlying technology changes). Also, the requested URI without its trailing slash <em>may</em> address a file or an on the fly output (if you make use of mod_rewrite to mask ugly URLs you better test what happens with screwed URIs of yours). </p>
<p>Yahoo wastes even their own resources. Their crawler persistently requests the shortened URL, what bounces with a redirect to the canonical URL. Here is an example from my raw logs: <code style="font-size:90%;"><br />
74.6.20.165 - - [05/Oct/2007:01:13:04 -0400] "GET <b>/directory</b> HTTP/1.0&#8243; 301 26 &#8220;-&#8221; &#8220;Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)&#8221;<br />
74.6.20.165 - - [05/Oct/2007:01:13:06 -0400] &#8220;GET <b>/directory/</b> HTTP/1.0&#8243; 200 8642 &#8220;-&#8221; &#8220;Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)&#8221;<br />
[I&#8217;ve replaced a rather long path with &#8220;directory&#8221;]</code><br />
If you persistently redirect Yahoo to the canonical URLs (with trailing slash), they&#8217;ll use your canonical URLs on the SERPs eventually (but their crawler still requests Yahoo-generated crap). Having many good inbound links as well as clean internal links &#8211;all with the trailing slash&#8211; helps too, but is not a guarantee for canonical URL normalization at Yahoo. </p>
<p>Here is an example. This URL responds with 200-OK, regardless whether it&#8217;s requested with or without the canonical trailing slash:<br />
<a href="http://www.jlh-design.com/2007/06/im-confused/">http://www.jlh-design.com/2007/06/im-confused<b>/</b></a><br />
(That&#8217;s the default (mis)behavior of everybody&#8217;s darling <a href="http://codex.wordpress.org/WordPress" rel="tag">WordPress</a> with permalinks by the way. Here is some <a href="http://sebastians-pamphlets.com/how-to-seo-sanitize-a-wordpress-theme/">PHP canonicalization code</a> to fix this flaw.) All internal links use the canonical URL. I didn&#8217;t find a serious inbound link pointing to a truncated version of this  URL. Yahoo&#8217;s Site Explorer lists the URL without the trailing slash: <a href="http://rds.yahoo.com/_ylt=AvoWb6YsAONIB5JuHEpx7hbal8kF/SIG=11iu4jphv/**http%3A//www.jlh-design.com/2007/06/im-confused">[&#8230;]/im-confused</a>, and the same happens on Yahoo&#8217;s SERPs: <a href="http://rds.yahoo.com/_ylt=A0geu7H1igdHzvUAJrxXNyoA;_ylu=X3oDMTFicXZvZGs5BHNlYwNzcgRwb3MDMQRjb2xvA2FjMgR2dGlkA01BUDAxNF8xMDAEbANXUzE-/SIG=121md4r95/EXP=1191763061/**http%3a//www.jlh-design.com/2007/06/im-confused">[&#8230;]/im-confused</a>. Even when a server responds 200-OK to two different URLs, a serious search engine should normalize according to the internal links as well as an entry in the XML sitemap, therefore choose the URL with the trailing slash as canonical URL. </p>
<p>Fucking up links on search result pages is evil enough, although fortunately this crap doesn&#8217;t influence discovery crawling directly because those aren&#8217;t crawled by other search engines (but scraped or syndicated search results <b>are</b> crawlable). Actually, that&#8217;s not the whole horror story. Other Yahoo properties remove the trailing slashes from directory and home page links too (look at the &#8220;What Readers Viewed&#8221; column in your MBL stats for example), and some of those services provide crawlable pages carrying invalid links (pulled from the search index or screwed otherwise). That means other search engines pick those incomplete URLs from Yahoo&#8217;s pages (or other pages with links copied from Yahoo pages), crawl them, and end up with search indexes blown up with duplicate content. Maybe Yahoo does all that only to burn Google&#8217;s resources by keeping their canonicalization routines and duplicate content filters busy, but it&#8217;s not exactly gentlemanlike that such cat fights affect all Webmasters across the globe. Yahoo directly as well as indirectly burns our resources with unnecessary requests of screwed URLs, and we must implement <a href="http://sebastians-pamphlets.com/how-to-seo-sanitize-a-wordpress-theme/">sanitizing redirects for software like WordPress</a> &#8211;which doesn&#8217;t care enough about URL canonicalization&#8211;, just because Yahoo manipulates our URLs to peeve Google. Doh!</p>
<p>If somebody from Yahoo (or MSN, or any other site manipulating URLs this way) reads my rant, I highly recommend this quote from <a href="http://gbiv.com/protocols/uri/rfc/rfc3986.html#rfc.section.6">Tim Berners-Lee</a> (January 2005):<br />
<blockquote><a href="http://gbiv.com/protocols/uri/rfc/rfc3986.html#rfc.section.6.2.3"><b>Scheme-Based Normalization</b></a><br />
[&#8230;] the following [&#8230;] URIs are equivalent:<br />
   http://example.com<br />
   http://example.com/<br />
In general, an URI that uses the generic syntax for authority with an empty path should be normalized to a path of &#8220;/&#8221;.<br />
[&#8230;]<br />
<b>Normalization should not remove delimiters</b> [&#8221;/&#8221; or &#8220;?&#8221;] <b>when their associated component is empty</b> unless licensed to do so by the scheme specification. [emphasis mine]</p></blockquote>
<p>In my book sentences like &#8220;Note that the absolute path cannot be empty; if none is present in the original URI, it MUST be given as &#8216;/&#8217; [&#8230;]&#8221; in the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html">HTTP specification</a> as well as Section 3.3 of the <a href="http://www.w3.org/TR/urispace" title="Section 3.3">URI&#8217;s Path Segment specs</a> do not sound like a licence to screw URLs. Omitting the path segment delimiter &#8220;/&#8221; representing an empty last path segment <em>might</em> sound legal if the specs are interpreted without applying common sense, but <em>knowing</em> that Web servers can&#8217;t respond to requests of those incomplete URIs and <em>nevertheless</em> truncating trailing slashes is a brain dead approach (actually, such crap deserves a couple unprintable adjectives). </p>
<p>Frequently scanning the raw logs for 302/301 redirects is a good idea. Also, <b>implement documented canonicalization redirects when a piece of software responds to different versions of URLs</b>. It&#8217;s the Webmaster&#8217;s responsibility to ensure that each piece of content is available under one and only one URL. You cannot rely on any search engine&#8217;s URL canonicalization, because shit happens, even with <a href="http://googlewebmastercentral.blogspot.com/2007/09/google-duplicate-content-caused-by-url.html">high sophisticated algos</a>:<br />
<blockquote>When search engines crawl identical content through varied URLs, there may be several negative effects:</p>
<p>1. Having multiple URLs can dilute link popularity. For example, in the <a href="http://googlewebmastercentral.blogspot.com/2007/09/google-duplicate-content-caused-by-url.html#BLOGGER_PHOTO_ID_5109718653929535858">diagram above</a> [example in Google&#8217;s blog post], rather than 50 links to your intended display URL, the 50 links may be divided three ways among the three distinct URLs.</p>
<p>2. Search results may display user-unfriendly URLs [&#8230;]</p></blockquote>
<h3 id="redirect-or-not">Redirect or not? A few use cases.</h3>
<p>Before I blather about the three redirect response codes you can choose from, I&#8217;d like to talk about a few situations where you shall not redirect, and cases where you probably don&#8217;t redirect but should do so. </p>
<p>Unfortunately, it&#8217;s a common practice to replace various sorts of clean links with redirects. Whilst legions of Webmasters don&#8217;t obfuscate their affiliate links, they hide their valuable outgoing links in fear of PageRank leaks and other myths, or react to search engine <acronym title="Fear, Uncertainty &amp; Doubt">FUD</acronym> with castrated links.</p>
<p>With very few exceptions, the <a href="http://www.smart-it-consulting.com/article.htm?node=155">A Element a.k.a. Hyperlink</a> is the best method to transport link juice (PageRank, topical relevancy, trust, reputation &#8230;) as well as human traffic. Don&#8217;t abuse my beloved A Element: <code><br />
&lt;a onclick=&quot;window.location = &apos;http://example.com/&apos;; return false;&quot; title=&quot;http://example.com&quot;&gt;bad example&lt;/a&gt;</code><br />
Such a &#8220;link&#8221; will transport some visitors, but does not work when JavaScript is disabled or the user agent is a Web robot. This &#8220;link&#8221; is not an iota better: <code><br />
&lt;a href=&quot;http://example.com/blocked-directory/redirect.php?url=http://another-example.com/&quot; title=&quot;Another bad example&quot;&gt;example&lt;/a&gt;</code></p>
<p>Simplicity pays. You don&#8217;t need the complexity of HREF values changed to ugly URLs of redirect scripts with parameters, located in an uncrawlable path, just because you don&#8217;t want that search engines count the links. Not to speak of cases where redirecting links is unfair or even risky, for example click tracking scripts which do a redirect.
<ul>
<li>If you need to track outgoing traffic, then by all means do it in a search engine friendly way with clean URLs which benefit the link destination and don&#8217;t do you any harm, <a href="http://sebastians-pamphlets.com/how-to-turn-click-tracking-into-miserable-failure/">here is a proven method</a>.</li>
<li>If you really can&#8217;t vouch for a link, for example because you link out to a so called bad neighborhood (whatever that means), or to a link broker, or to someone who paid for the link and Google can detect it or a competitor can turn you in, then add <a href="http://www.smart-it-consulting.com/article.htm?node=155&#038;page=90#a-rel">rel=&#8221;nofollow&#8221;</a> to the link. Yeah, <a href="http://sebastians-pamphlets.com/links/categories/?cat=nofollow">rel-nofollow</a> is <a href="http://sebastians-pamphlets.com/links/categories/?cat=crap">crap</a> &#8230; but it&#8217;s there, it works, we won&#8217;t get something better, and it&#8217;s less complex than redirects, so just apply it to your fishy links as well as to unmoderated user input.</li>
<li>If you decide that an outgoing link adds value for your visitors, and you personally think that the linked page is a great resource, then almost certainly search engines will endorse the link (regardless whether it shows a toolbar PR or not). There&#8217;s way too much FUD and crappy advice out there.</li>
<li>You really don&#8217;t lose PageRank when you link out. Honestly gained PageRanks sticks at your pages. You only lower the amount of PageRank you can pass to your internal links a little. That&#8217;s not a bad thing, because linking out to great stuff can bring in more PageRank in the form of natural inbound links (there are other advantages too). Also, Google dislikes PageRank hoarding and the unnatural link patterns you create with practices like that.</li>
<li>Every redirect slows things down, and chances are that a user agent messes with the redirect what can result in rendering nil, scrambled stuff, or something completely unrelated. I admit that&#8217;s not a very common problem, but it happens with some outdated though still used browsers. <b>Avoid redirects where you can.</b></li>
</ul>
<p>In some cases you should perform redirects for sheer search engine compliance, in other words selfish SEO purposes. For example don&#8217;t let search engines handle your affiliate links.
<ul>
<li>If you operate an affiliate program, then internally redirect all incoming affiliate links to <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/">consolidate your landing page URLs</a>. Although incoming affiliate links don&#8217;t bring much link juice, every little helps when it lands on a page which doesn&#8217;t credit search engine traffic to an affiliate.</li>
<li>Search engines are pretty smart when it comes to identifying affiliate links. (Thin) affiliate sites suffer from decreasing search engine traffic. Fortunately, the engines respect <a href="http://sebastians-pamphlets.com/links/categories/?cat=robotstxt">robots.txt</a>, that means they usually don&#8217;t follow links via blocked subdirectories. When you link to your merchants within the content, using URLs that don&#8217;t smell like affiliate links, it&#8217;s harder to detect the intention of those links algorithmically. Of course that doesn&#8217;t prevent you from smart algos trained to spot other patterns, and this method will not pass reviews by humans, but it&#8217;s <a href="http://sebastians-pamphlets.com/google-recommends-screwing-affiliates-in-exchange-for-better-serp-positioning/" title="Read the comments too!">worth a try</a>.</li>
<li>If you&#8217;ve pages which change their contents often by featuring for example a product of the day, you might have a redirect candidate. Instead of duplicating a daily changing product page, you can do a dynamic soft redirect to the product pages. Whether a 302 or a 307 redirect is the best choice depends on the individual circumstances. However, you can promote the hell out of the redirecting page, so that it gains all the search engine love without passing on PageRank etc. to product pages which phase out after a while. (If the product page is hosted by the merchant you must use a 307 response code. Otherwise make sure the 302&#8242;ing URL ist listed in your XML sitemap with a high priority. If you can, send a 302 with most HTTP/1.0 requests, and a 307 responding to HTTP/1.1 requests. See the 302/307 sections for more information.)</li>
<li>If an URL comes with a session-ID or another tracking variable in its query string, you must 301-redirect search engine crawlers to an URI without such randomly generated noise. There&#8217;s no need to redirect a human visitor, but <a href="http://www.smart-it-consulting.com/article.htm?node=148&#038;page=103">search engines hate tracking variables</a> so just don&#8217;t let them fetch such URLs. </li>
<li>There are other use cases involving creative redirects which I&#8217;m not willing to discuss here.</li>
</ul>
<p>Of course both lists above aren&#8217;t complete.</p>
<h3 id="choosing-a-redirect-response-code">Choosing the best redirect response code (301, 302, or 307)</h3>
<p><img src="http://sebastians-pamphlets.com/img/posts/choosing-a-redirect-response-code-301-302-307.png" width="200" height="150" alt="Choosing a redirect response code" title="Which HTTP redirect response code fits my needs?" style="margin-left:3px;" align="right"  />I&#8217;m sick of articles like &#8220;search engine friendly 301 redirects&#8221; propagating that only permanant redirects work with search engines. That&#8217;s a lie. I read those misleading headlines daily on the webmaster boards, in my feed reader, at Sphinn, and elsewhere &#8230; and I&#8217;m not amused. Lemmings. Amateurish copycats. Clueless plagiarists. [Insert a few lines of somewhat offensive language and swearing <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> ]</p>
<p>Of course most redirects out there return the wrong response code. That&#8217;s because the default HTTP response code for all redirects is 302, and many code monkeys forget to send a status-line providing the <b>301 Moved Permanantly</b> when an URL was actually moved or the requested URI is not the canonical URL. When a clueless coder or hosting service invokes a <code>Location: http://example.com/</code> header statement without a previous <code>HTTP/1.1 301 Moved Permanantly</code> status-line, the redirect becomes a soft <code>302 Found</code>. That does not mean that 302 or 307 redirects aren&#8217;t search engine friendly at all. All HTTP redirects can be safely used with regard to search engines. The point is that one must choose the correct response code based on the actual circumstances and goals. Blindly 301&#8242;ing everything is counterproductive sometimes.</p>
<h4 id="301-moved-permanently">301 - Moved Permanently</h4>
<p><img src="http://sebastians-pamphlets.com/img/posts/301-moved-permanently.png" width="200" height="101" alt="301 Moved Permanently" title="301 Moved Permanently" style="margin-left:3px;" align="right"  />The message of a 301 reponse code to the requestor is: &#8220;The requested URI has vanished. It&#8217;s gone forever and perhaps it never existed. I will <b>never</b> supply any contents under this URI (again). Request the URL given in location, and replace the outdated respectively wrong URL in your bookmarks/records by the new one for future requests. Don&#8217;t bother me again. Farewell.&#8221;</p>
<p>Lets start with the definition of a 301 redirect quoted from the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.2">HTTP/1.1 specifications</a>:<br />
<blockquote>The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs [<a href="#301-uris">(1)</a>]. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.</p>
<p>The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s). [&#8230;]</p></blockquote>
<p>Read a polite &#8220;SHOULD&#8221; as &#8220;must&#8221;.</p>
<p><span id="301-uris">(1)</span> Although technically you <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.14">could</a> provide more than one location, you must not do that because it irritates too many user agents, search engine crawlers included. </p>
<p>Make use of the 301 redirect when a requested Web resource was moved to another location, or when a user agent requests an URI which is definitely wrong and you&#8217;re able to tell the correct URI with no doubt. For URL canonicalization purposes (<a href="http://sebastians-pamphlets.com/how-to-seo-sanitize-a-wordpress-theme/">more</a> <a href="http://www.mattcutts.com/blog/seo-advice-url-canonicalization/">info</a> <a href="http://www.mattcutts.com/blog/canonicalization-update/">here</a>) the 301 redirect is your one and only friend.</p>
<p>You must not recycle any 301&#8242;ing URLs, that means once an URL responds with 301 you must stick with it, you can&#8217;t reuse this URL for other purposes next year or so. </p>
<p>Also, you must maintain the 301 response and a location corresponding to the redirecting URL forever. That does not mean that the location can&#8217;t be changed. Say you&#8217;ve moved a contact page <code>/contact.html</code> to a CMS where it resides under <code>/cms/contact.php</code>. If a user agent requests <code>/contact.html</code> it does a 301 redirect pointing to <code>/cms/contact.php</code>. Two years later you change your software again, and the contact page moves to <code>/blog/contact/</code>. In this case you must change the initial redirect, and create a new one:<br />
<code>/contact.html</code> 301-redirects to <code>/blog/contact/</code>, and<br />
<code>/cms/contact.php</code> 301-redirects to <code>/blog/contact/</code>.<br />
If you keep the initial redirect <code>/contact.html</code> to <code>/cms/contact.php</code>, and redirect <code>/cms/contact.php</code> to <code>/blog/contact/</code>, you create a <a href="http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/">redirect chain which can deindex your content at search engines</a>. Well, two redirects before a crawler reaches the final URL shouldn&#8217;t be a big deal, but add a canonicalization redirect fixing a www vs. non-www issue to the chain, and imagine a crawler comes from a directory or links list which counts clicks with a redirect script, you&#8217;ve four redirects in a row. That&#8217;s too much, most probably all search engines will not index such an unreliable Web resource.</p>
<p>301 redirects transfer search engine love like PageRank gathered by the redirecting URL to the new location, but the search engines keep the old URL in their indexes, and revisit it every now and then to check whether the 301 redirect is stable or not. If the redirect is gone on the next crawl, the new URL loses the reputation earned from the redirect&#8217;s inbound links. It&#8217;s impossible to get all inbound links changed, hence don&#8217;t delete redirects after a move.</p>
<p>It&#8217;s a good idea to check your 404 logs weekly or so, because search engine crawlers pick up malformed links from URL drops and such. Even when the link is invalid, for example because a crappy forum software has shortened the URL, it&#8217;s an asset you should not waste with a 404 or even 410 response. Find the best matching existing URL and do a 301 redirect.</p>
<p>Here is what Google says about 301 redirects:<br />
<blockquote>
<p>[<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40151">Source</a>] 301 (Moved permanently) [&#8230;] You should use this code to let Googlebot know that a page or site has permanently moved to a new location. [&#8230;]</p>
<p>[<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=66359">Source</a> &#8230;] If you&#8217;ve restructured your site, use 301 redirects (&#8221;RedirectPermanent&#8221;) in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.) [&#8230;]</p>
<p>[<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=34464">Source</a> &#8230;] If your old URLs redirect to your new site using HTTP 301 (permanent) redirects, our crawler will discover the new URLs. [&#8230;] Google listings are based in part on our ability to find you from links on other sites. To preserve your rank, you&#8217;ll want to tell others who link to you of your change of address. [&#8230;]</p>
<p>[<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=34481">Source</a> &#8230;] If your site [or page] is appearing as two different listings in our search results, we suggest consolidating these listings so we can more accurately determine your site&#8217;s [page&#8217;s] PageRank. The easiest way to do so [on site level] is to <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=44232">set the preferred domain using our webmaster tools</a>. You can also redirect one version [page] to the other [canonical URL] using a <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40151">301 redirect</a>. This should resolve the situation after our crawler discovers the change. [&#8230;]</p>
</blockquote>
<p>That&#8217;s exactly what the HTTP standard wants a search engine to do. Yahoo handles 301 redirects a little different:<br />
<blockquote>
<p>[<a href="http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-11.html">Source</a> &#8230;] When one web page redirects to another web page, Yahoo! Web Search sometimes indexes the page content under the URL of the entry or &#8220;source&#8221; page, and sometimes index it under the URL of the final, destination, or &#8220;target&#8221; page. [&#8230;]</p>
<p>When a page in one domain redirects to a page in another domain, Yahoo! records the &#8220;target&#8221; URL. [&#8230;]</p>
<p>When a top-level page [http://example.com/] in a domain presents a permanent redirect to a page deep within the same domain, Yahoo! indexes the &#8220;source&#8221; URL. [&#8230;]</p>
<p>When a page deep within a domain presents a permanent redirect to a page deep within the same domain, Yahoo! indexes the &#8220;target&#8221; URL. [&#8230;]</p>
<p>Because of mapping algorithms directing content extraction, Yahoo! Web Search is not always able to discard URLs that have been seen as 301s, so web servers might still see crawler traffic to the pages that have been permanently redirected. [&#8230;]</p>
</blockquote>
<p>As for the non-standard procedure to handle redirecting root index pages, that&#8217;s not a big deal, because in most cases a site owner promotes the top level page anyway. Actually, that&#8217;s a smart way to &#8220;break the rules&#8221; for the better. The way too many requests of permanently redirecting pages are more annoying.</p>
<h4 id="moving-sites-301">Moving sites with 301 redirects</h4>
<p>When you restructure a site, consolidate sites or separate sections, move to another domain, flee from a free host, or do other structural changes, then in theory you can install page by page 301 redirects and you&#8217;re done. Actually, that works but comes with disadvantages like a total loss of all search engine traffic for a while. As larger the site, as longer the while. With a large site highly dependent on SERP referrers this procedure can be the first phase of a filing for bankruptcy plan, because <em>all</em> search engines don&#8217;t send (much) traffic during the move.</p>
<p>Lets look at the process from a search engine&#8217;s perspective. The crawling of old.com all of a sudden bounces at 301 redirects to new.com. None of the redirect targets is known to the search engine. The crawlers report back redirect responses and the new URLs as well. The indexers spotting the redirects block the redirecting URLs for the query engine, but can&#8217;t pass the properties (PageRank, contextual signals and so on) of the redirecting resources to the new URLs, because those aren&#8217;t crawled yet. </p>
<p>The crawl scheduler initiates the handshake with the newly discovered server to estimate its robustness, and most propably does a conservative guess of the crawl frequency this server can sustain. The queue of uncrawled URLs belonging to the new server grows way faster than the crawlers actually deliver the first contents fetched from the new server. </p>
<p>Each and every URL fetched from the old server vanishes from the SERPs in no time, whilst the new URLs aren&#8217;t crawled yet, or are still waiting for an idle indexer able to assign them the properties of the old URLs, doing heuristic checks on the stored contents from both URLs and whatnot. </p>
<p>Slowly, sometimes weeks after the begin of the move, the first URLs from the new server populate the SERPs. They don&#8217;t rank very well, because the search engine has not yet discovered the new site&#8217;s structure and linkage completely, so that a couple of ranking factors stay temporairily unconsidered. Some of the new URLs may appear as URL-only listing, solely indexed based on off-page factors, hence lacking the ability to trigger search query relevance for their contents. </p>
<p>Many of the new URLs can&#8217;t regain their former PageRank in the first reindexing cycle, because without a complete survey of the &#8220;new&#8221; site&#8217;s linkage there&#8217;s only the PageRank from external inbound links passed by the redirects available (internal links no longer count for PageRank when the search engine discovers that the source of internally distributed PageRank does a redirect), so that they land in a secondary index. </p>
<p>Next, the suddenly lower PageRank results in a lower crawling frequency for the URLs in question. Also, the process removing redirecting URLs still runs way faster than the reindexing of moved contents from the new server. As more URLs are involved in a move, as longer the reindexing and reranking lasts. Replace Google&#8217;s very own PageRank with any term and you&#8217;ve a somewhat usable description of a site move handled by Yahoo, MSN, or Ask. There are only so many ways to handle such a challenge.</p>
<p>That&#8217;s a horror scenario, isn&#8217;t it? Well, at Google the recently changed infrastructure has greatly improved this process, and other search engines evolve too, but moves as well as significant structural changes will always result in periods of decreased SERP referrers, or even no search engine traffic at all.</p>
<p>Does that mean that big moves are too risky, or even not doable? Not at all. You just need deep pockets. If you lack a budget to feed the site with PPC or other bought traffic to compensate an estimated loss of organic traffic lasting at least a few weeks, but perhaps months, then don&#8217;t move. And when you move, then set up a professionally managed project, and hire experts for this task.</p>
<p>Here are some guidelines. I don&#8217;t provide a timeline, because that&#8217;s impossible without detailed knowledge of the individual circumstances. Adapt the procedure to fit your needs, nothing&#8217;s set in stone.
<ul>
<li>Set up the site on the new Web server (new.com). In robots.txt block everything exept a temporary page telling that this server is the new home of your site. Link to this page to get search engines familiar with the new server, but make sure there are no links to blocked content yet.</li>
<li>Create mapping tables &#8220;old URL to new URL&#8221; (respectively algos) to prepare the 301 redirects etcetera. You could consolidate multiple pages under one redirect target and so on, but you better wait with changes like that. Do them after the move. When you keep the old site&#8217;s structure on the new server, you make the job easier for search engines.</li>
<li>If you plan to do structural changes after the move, then develop the redirects in a way that you can easily change the redirect targets on the old site, and prepare the internal redirects on the new site as well. In any case, your redirect routines must be able to redirect or not depending on parameters like site area, user agent / requestor IP and such stuff, and you need a flexible control panel as well as URL specific crawler auditing on both servers.</li>
<li>On old.com develop a server sided procedure which can add links to the new location on every page on your old domain. Identify your URLs with the lowest crawling frequency. Work out a time table for the move which considers page importance (with regard search engine traffic), and crawl frequency.</li>
<li>Remove the <code>Disallow:</code> statements in the new server&#8217;s robots.txt. Create one or more XML sitemap(s) for the new server and make sure that you set crawl-priority and change-frequency accurately, last-modified gets populated with the scheduled begin of the move (IOW the day the first search engine crawler can access the sitemap). Feed the engines with sitemap files listing the important URLs first. Add sitemap-autodiscovery statements to robots.txt, and manually submit the sitemaps to Google and Yahoo.</li>
<li>Fire up the scripts creating visible &#8220;this page will move to [new location] soon&#8221; links on the old pages. Monitor the crawlers on the new server. Don&#8217;t worry about duplicate content issues in this phase, &#8220;move&#8221; in the anchor text is a magic word. Do nothing until the crawlers have fetched at least the first and second link level on the new server, as well as most of the important pages.</li>
<li>Briefly explain your redirect strategy in robots.txt comments on both servers. If you can, add obversely HTML comments to the HEAD section of all pages on the old server. You will cloak for a while, and things like that can help to pass reviews by humans which might get an alert from an algo or spam report. It&#8217;s more or less impossible to redirect human traffic in chunks, because that results in annoying surfing experiences, inconsistent database updates, and other disadvantages. Search engines aren&#8217;t cruel and understand that.</li>
<li>301 redirect all human traffic to the new server. Serve search engines the first chunk of redirecting pages. Start with a small chunk of not more than 1,000 pages or so, and bundle related pages to preserve most of the internal links within each chunk.</li>
<li>Closely monitor the crawling and indexing process of the first chunk, and don&#8217;t release the next one before it has (nearly) finished. Probably it&#8217;s necessary to handle each crawler individually.</li>
<li>Whilst you release chunk after chunk of redirects to the engines adjusting the intervals based on your experiences, contact all sites linking to you and ask for URL updates (bear in mind to delay these requests for inbound links pointing to URLs you&#8217;ll change after the move for other reasons). It helps when you offer an incentive, best let your marketing dept. handle this task (having a valid reason to get in touch with those Webmasters might open some opportunities).</li>
<li>Support the discovery crawling based on redirects and updated inbound links by releasing more and more XML sitemaps on the new server. Enabling sitemap based crawling should somewhat correlate to your release of redirect chunks. Both discovery crawling and submission based crawling share the bandwith respectively the amount of daily fetches the crawling engine has de