<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Sebastian's Pamphlets &#187; 404grabber</title>
	<link>http://sebastians-pamphlets.com</link>
	<description>If you've read my articles somewhere on the Internet, expect something different here.</description>
	<pubDate>Mon, 30 Jun 2008 20:12:40 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Upgrading from IIS/ASP to Apache/PHP</title>
		<link>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/</link>
		<comments>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/#comments</comments>
		<pubDate>Tue, 11 Dec 2007 20:47:25 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[404grabber]]></category>

		<category><![CDATA[Duplicate Content]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[Copy+Paste-Penalties]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[IIS]]></category>

		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/</guid>
		<description><![CDATA[Once you&#8217;re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won&#8217;t work, and you can&#8217;t run MS-Access &#8220;databases&#8221; and such stuff under Apache. 
Here is my idea of a smoothly migration from IIS/ASP to [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/upgrade-from-iis-asp-to-apache-php.png" width="250" height="227" align="right" style="margin-left:4px;" alt="Upgrade from Windows/IIS/ASP to Unix/Apache/PHP" title="Get the most out of your Web site - throw away Windows/IIS/ASP!"  />Once you&#8217;re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won&#8217;t work, and you can&#8217;t run MS-Access &#8220;databases&#8221; and such stuff under Apache. </p>
<p>Here is my idea of a smoothly migration from IIS/ASP to Apache/PHP. Grab any Unix box from your hoster&#8217;s portfolio and start over.</p>
<p>(Recently I got a tiny IIS/ASP site about <a href="http://link-condom.com/">uses &amp; abuses of link condoms</a> and moved it to an Apache server. I&#8217;m well known for brutal IIS rants, but so far I didn&#8217;t discuss a way out of such a dilemma, so I thought blogging this move could be a good idea.) </p>
<p>I don&#8217;t want to make this piece too complex, so I skip database and code migration strategies. Read Mike Hillyer&#8217;s article <a href="http://dev.mysql.com/tech-resources/articles/migrating-from-microsoft.html">Migrating from Microsoft Access/MS-SQL to MySQL</a>, and try tools like <a href="http://asp2php.naken.cc/docs.php">ASP to PHP</a>. (With my tiny <a href="http://link-condom.com/about.asp">link condom</a> site I overwrote the ASP code with PHP statements in my primitive text editor.)</p>
<p><b>From an SEO perspective such an upgrade comes with pitfalls:</b>
<ul>
<li>Changing file extensions from .asp to .php is not an option. We want to keep the number of unavoidable redirects as low as possible.</li>
<li>Default.asp is usually not configured as a valid default document under Apache, hence requests of http://example.com/ run into 404 errors.</li>
<li>Basic server name canonicalization routines (www vs. non-www) from ASP scripts are not convertible.</li>
<li>IIS-URIs are not case sensitive, that means that /Default.asp will 404 on Apache when the filename is /default.asp. Usually there are lowercase/uppercase issues with query string variables and values as well.</li>
<li>Most probably search engines have URL variants in their indexes, so we want to adapt their URL canonicalization, at least where possible.</li>
<li>HTML editors like Microsoft Visual Studio tend to duplicate the HTML code of templated page areas. Instead of editing menus or footers in all scripts we want to encapsulate them.</li>
<li>If the navigation makes use of relative links, we need to convert those to absolute URLs.</li>
<li>Error handling isn&#8217;t convertible. Improper error handling can cause decreasing search engine traffic.</li>
</ul>
<h3>Running /default.asp, /home.asp etc. as PHP scripts</h3>
<p>When you upload an .asp file to an Apache Web server, most user agents can&#8217;t handle it. Browsers treat them as unknown file types and force downloads instead of rendering them. Next those files aren&#8217;t parsed for PHP statements, provided you&#8217;ve rewritten the ASP code already.</p>
<p>To tell Apache that .asp files are valid PHP scripts outputting X/HTML, add this code to your server config or your .htaccess file in the root: <code><b><br />
AddType text/html .asp<br />
AddHandler application/x-httpd-php .asp </b></code><br />
The first line says that .asp files shall be treated as HTML documents, and should force the server to send a <code>Content-Type: text/html</code> HTTP header. The second line tells Apache that it must parse .asp files for PHP code. </p>
<p>Just in case the AddType statement above doesn&#8217;t produce a <code>Content-Type: text/html</code> header, here is another way to tell all user agents requesting .asp files from your server that the content type for .asp is text/html. If you&#8217;ve mod_headers available, you can accomplish that with this .htaccess code: <code><b><br />
&lt;IfModule mod_headers.c&gt;<br />
SetEnvIf Request_URI \.asp is_asp=is_asp<br />
Header set &quot;Content-type&quot; &quot;text/html&quot; env=is_asp<br />
Header set imagetoolbar &quot;no&quot;<br />
&lt;/IfModule&gt; </b></code><br />
(The imagetoolbar=no header tells IE to behave nicely; you can use this directive in a meta tag too.)<br />
If for some reason mod_headers doesn&#8217;t work well with mod_setenvif, giving 500 error codes or so, then you can set the content-type with PHP too. Add this to a PHP script file which is included in all your scripts at the very top: <code><b><br />
@header(&quot;Content-type: text/html&quot;, TRUE);  </b></code><br />
Instead of &#8220;text/html&#8221; alone, you can define the character set too: &#8220;text/html; charset=UTF-8&#8243;</p>
<h3>Sanitizing the home page URL by eliminating &#8220;default.asp&#8221;</h3>
<p>Instead of slowing down Apache by defining just another default document name (<code>DirectoryIndex index.html index.shtml index.htm index.php [...] default.asp</code>), we get rid of &#8220;/default.asp&#8221; with this &#8220;/index.php&#8221; script: <code><b><br />
&lt;?php<br />
@require(&quot;default.asp&quot;);<br />
?&gt; </b></code><br />
Now every request of http://example.com/ executes /index.php which includes /default.asp. This works with subdirectories too.</p>
<p>Just in case someone requests /default.asp directly (search engines keep forgotten links!), we perform a permanent redirect in .htaccess: <code><b><br />
Redirect 301 /default.asp http://example.com/<br />
Redirect 301 /Default.asp http://example.com/ </b></code></p>
<h3>Converting the ASP code for server name canonicalization</h3>
<p>If you find ASP canonicalization routines like <code><b><br />
&lt;%@ Language=VBScript %&gt;<br />
&lt;%<br />
if strcomp(Request.ServerVariables(&quot;SERVER_NAME&quot;), &quot;www.example.com&quot;, vbCompareText) = 0 then<br />
   Response.Clear<br />
   Response.Status = &quot;301 Moved Permanently&quot;<br />
   strNewUrl = Request.ServerVariables(&quot;URL&quot;)<br />
   if instr(1,strNewUrl, &quot;/default.asp&quot;, vbCompareText) &gt; 0 then<br />
     strNewUrl = replace(strNewUrl, &quot;/Default.asp&quot;, &quot;/&quot;)<br />
     strNewUrl = replace(strNewUrl, &quot;/default.asp&quot;, &quot;/&quot;)<br />
   end if<br />
   if Request.QueryString &lt;&gt; &quot;&quot; then<br />
       Response.AddHeader &quot;Location&quot;,&quot;http://example.com&quot; &amp; strNewUrl &amp; &quot;?&quot; &amp; Request.QueryString<br />
   else<br />
       Response.AddHeader &quot;Location&quot;,&quot;http://example.com&quot; &amp; strNewUrl<br />
   end if<br />
   Response.End<br />
end if<br />
%&gt;  </b></code><br />
(or the other way round) at the top of all scripts, just select and delete. This .htaccess code works way better, because it takes care of other server name garbage too: <code><b><br />
RewriteEngine On<br />
RewriteCond %{HTTP_HOST} !^example\.com [NC]<br />
RewriteRule (.*) http://example.com/$1 [R=301,L] </b></code><br />
(you need mod_rewrite, that&#8217;s usually enabled with the default configuration of Apache Web servers). </p>
<h3>Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value</h3>
<p>Probably a M$ developer didn&#8217;t read more than the scheme and server name chapter of the URL/URI standards, at least I&#8217;ve no better explanation for the fact that these clowns made the path and query string segment of URIs case-insensitive. (Ok, I have an idea, but nobody wants to read about M$ world domination plans.)</p>
<p>Just because &#8211;contrary to Web standards&#8211; M$ finds it funny to serve the same contents on request of /Home.asp as well as /home.ASP, such crap doesn&#8217;t fly on the World Wide Web. Search engines &#8211;and other Web services which store URLs&#8211; treat them as different URLs, and consider everything except one version duplicate content.</p>
<p>Creating hyperlinks in HTML editors by picking the script files from the Windows Explorer can result in HREF values like &#8220;/Script.asp&#8221;, although the file itself is stored with an all-lowercase name, and the FTP client uploads &#8220;/script.asp&#8221; to the Web server. There are more ways to fuck up file names with improper use of (leading) uppercase characters. Typos like that are somewhat undetectable with IIS, because the developer surfing the site won&#8217;t get 404-Not found responses. </p>
<p>Don&#8217;t misunderstand me, you&#8217;re free to camel-case file names for improved readability, but then make sure that the file system&#8217;s notation matches the URIs in HREF/SRC values. (Of course hyphened file names like &#8220;buy-cheap-viagra.asp&#8221; top the CamelCased version &#8220;BuyCheapViagra.asp&#8221; when it comes to search engine rankings, but don&#8217;t freak out about keywords in URLs, that&#8217;s ranking factor #202 or so.)</p>
<p>Technically spoken, converting all file names, variable names and values as well to all-lowercase is the simplest solution. This way it&#8217;s quite easy to 301-redirect all invalid requests to the canonical URLs. </p>
<p>However, each redirect puts search engine traffic at risk. Not all search engines process 301 redirects as they should (<a href="http://sphinn.com/story/16345">MSN Live Search</a> for example doesn&#8217;t follow permanent redirects and doesn&#8217;t pass the reputation earned by the old URL over to the new URL). So if you&#8217;ve good SERP positions for &#8220;misspelled&#8221; URLs, it might make sense to stick with ugly directory/file names. Check your search engine rankings, perform [site:example.com] search queries on all major engines, and read the SERP referrer reports from the old site&#8217;s server stats to identify all URLs you don&#8217;t want to redirect. By the way, the link reports in <a href="http://www.google.com/webmasters/tools/">Google&#8217;s Webmaster Console</a> and <a href="http://siteexplorer.search.yahoo.com/">Yahoo&#8217;s Site Explorer</a> reveal invalid URLs with (internal as well as external) inbound links too.</p>
<p>Whatever strategy fits your needs best, you&#8217;ve to call a script handling invalid URLs from your .htaccess file. You can do that with the ErrorDocument directive: <code><b><br />
ErrorDocument 404 /404handler.php </b></code><br />
That&#8217;s safe with static URLs without parameters and should work with dynamic URIs too. When you &#8211;in some cases&#8211; deal with query strings and/or virtual URIs, the .htaccess code becomes more complex, but handling virtual paths and query string parameters in the PHP scripts might be easier: <code><b><br />
&lt;IfModule mod_rewrite.c&gt;<br />
RewriteEngine On<br />
RewriteBase /<br />
RewriteCond %{REQUEST_FILENAME} !-f<br />
RewriteCond %{REQUEST_FILENAME} !-d<br />
RewriteRule . /404handler.php [L]<br />
&lt;/IfModule&gt; </b></code><br />
In both cases Apache will process /404handler.php if the requested URI is invalid, that is if the path segment (/directory/file.extension) points to a file that doesn&#8217;t exist.</p>
<p>And here is the PHP script /404handler.php:<br />
<b><a onclick="showContent('php-code-404-handler'); return false;">View</a>|<a onclick="hideContent('php-code-404-handler'); return false;">hide</a> PHP code.</b> (If you&#8217;ve disabled JavaScript you can&#8217;t grab the PHP source code!)<code id="php-code-404-handler" style="display:none;"><b><br />
&lt;?php // 404handler.php<br />
      // called from .htaccess if the requested path doesn&#8217;t exist<br />
&nbsp;<br />
$thisFileName    = &quot;404handler.php&quot;;  // change this<br />
$canonicalScheme = &quot;http://&quot;;<br />
$canonicalServer = &quot;example.com&quot;; // change this<br />
$errorPageUri    = &quot;/error.asp&quot;;  // change this<br />
$documentRoot    = $_SERVER[&quot;DOCUMENT_ROOT&quot;];<br />
$requestUri      = $_SERVER[&quot;REQUEST_URI&quot;];<br />
$canonicalUri    = &quot;&quot;;<br />
$requestedUrl    = $canonicalScheme .$canonicalServer .$requestUri;<br />
$canonicalUrl    = &quot;&quot;;<br />
$url             = parse_url($requestedUrl);<br />
$requestPath     = $url[&quot;path&quot;];<br />
$includeScript   = &quot;&quot;;<br />
$queryString     = $url[&quot;query&quot;];<br />
&nbsp;<br />
// keep misspelled URIs with nice search engine rankings<br />
if (&quot;$requestPath&quot; == &quot;/Sample.asp&quot;) {  // change this<br />
   $includeScript = $documentRoot .&quot;/sample.asp&quot;;  // change this<br />
}<br />
// &#8230;<br />
if (!empty($includeScript)) {<br />
   @header(&quot;HTTP/1.1 200 OK&quot;, TRUE, 200);<br />
   @include($includeScript);<br />
   exit;<br />
}<br />
&nbsp;<br />
// if the lowercase version exists, redirect to it<br />
$lcPath = strtolower($url[&quot;path&quot;]);<br />
$lcFile = $documentRoot .$lcPath;<br />
if (file_exists($lcFile) &#038;&#038; !stristr($requestUri,$thisFileName)) {<br />
    $canonicalUrl = $canonicalScheme .$canonicalServer .$lcPath;<br />
    if ($queryString) {<br />
        $canonicalUrl .= &quot;?&quot; .$queryString;<br />
    }<br />
    if ($url[&quot;fragment&quot;]) {<br />
        $canonicalUrl .= &quot;#&quot; .$url[&quot;fragment&quot;];<br />
    }<br />
}<br />
if (!empty($canonicalUrl)) {<br />
    @header(&quot;HTTP/1.1 301 Moved Permanently&quot;, TRUE, 301);<br />
    @header(&quot;Location: $canonicalUrl&quot;);<br />
    exit;<br />
}<br />
&nbsp;<br />
// serve the 404 error page<br />
@header(&quot;HTTP/1.1 404 Not found&quot;, TRUE, 404);<br />
@include($documentRoot .$errorPageUri);<br />
exit;<br />
?&gt;   </b></code><br />
(Edit the values in all lines marked with &#8220;// change this&#8221;.)</p>
<p>This script doesn&#8217;t handle case issues with query string variables and values. Query string canonicalization must be developed for each individual site. Also, capturing misspelled URLs with nice search engine rankings should be implemented utilizing a database table when you&#8217;ve more than a dozen or so. </p>
<p>Lets see what the /404handler.php script does with requests of non-existing files. </p>
<p>First we test the requested URI for invalid URLs which are nicely ranked at search engines. We don&#8217;t care much about duplicate content issues when the engines deliver targeted traffic. Here is an example (which admittedly doesn&#8217;t rank for anything but illustrates the functionality): both <a href="http://link-condom.com/sample.asp">/sample.asp</a> as well as <a href="http://link-condom.com/Sample.asp">/Sample.asp</a> deliver the same content, although there&#8217;s no /Sample.asp script. Of course a better procedure would be renaming /sample.asp to /Sample.asp, permanently redirecting /sample.asp to /Sample.asp in .htaccess, and changing all internal links accordinly.</p>
<p>Next we lookup the all lowercase version of the requested path. If such a file exists, we perform a permanent redirect to it. Example: <a href="http://link-condom.com/About.asp">/About.asp</a> 301-redirects to <a href="http://link-condom.com/about.asp">/about.asp</a>, which is the file that exists.</p>
<p>Finally, if everything we tried to find a suitable URI for the actual request failed, we send the client a 404 error code and output the error page. Example: <a href="http://link-condom.com/gimme404.asp" rel="nofollow crap">/gimme404.asp</a> doesn&#8217;t exist, hence /404handler.php responds with a 404-Not Found header and displays /error.asp, but <a href="http://link-condom.com/error.asp">/error.asp</a> directly requested responds with a 200-OK.</p>
<p>You can easily refine the script with other algorithms and mappings to adapt its somewhat primitive functionality to your project&#8217;s needs. </p>
<h3>Tweaking code for future maintenance</h3>
<p>Legacy code comes with repetition, redundancy and duplication caused by developers who love copy+paste respectively copy+paste+modify, or Web design software that generates static files from templates. Even when you&#8217;re not willing to do a complete revamp by shoving your contents into a CMS, you must replace the ASP code anyway, what gives you the opportunity to encapsulate all templated page areas. </p>
<p>Say your design tool created a bunch of .asp files which all contain the same sidebars, headers and footers. When you move those files to your new server, create PHP include files from each templated page area, then replace the duplicated HTML code with <code>&lt;?php @include("header.php"); ?&gt;</code>, <code>&lt;?php @include("sidebar.php"); ?&gt;</code>, <code>&lt;?php @include("footer.php"); ?&gt;</code> and so on. Note that when you&#8217;ve HTML code in a PHP include file, you must add <code>&lt;?php ?&gt;</code> before the first line of HTML code or contents in included files. Also, leading spaces, empty lines and such which don&#8217;t hurt in HTML, can result in errors with PHP statements like header(), because those fail when the server has sent anything to the user agent (even a single space, new line or tab is too much).</p>
<p>It&#8217;s a good idea to use PHP scripts that are included at the very top and bottom of all scripts, even when you currently have no idea what to put into those. Trust me and create top.php and bottom.php, then add the calls (<code>&lt;?php @include("top.php"); ?&gt;</code> [&#8230;] <code>&lt;?php @include("bottom.php"); ?&gt;</code>) to all scripts. Tomorrow you&#8217;ll write a generic routine that you must have in all scripts, and you&#8217;ll happily do that in top.php. The day after tomorrow you&#8217;ll paste the GoogleAnalytics tracking code into bottom.php. With complex sites you need more hooks. </p>
<h3>Using absolute URLs on different systems</h3>
<p>Another weak point is the use of relative URIs in links, image sources or references to feeds or external scripts. The lame excuse of most developers is that they need to test the site on their local machine, and that doesn&#8217;t work with absolute URLs. Crap. Of course it works. The first statement in top.php is <code><b><br />
@require($_SERVER[&quot;SERVER_NAME&quot;] .&quot;.php&quot;); </b></code><br />
This way you can set the base URL for each environment and your code runs everywhere. For development purposes on a subdomain you&#8217;ve a &#8220;dev.example.com.php&#8221; include file, on the production system example.com the file name resolves to &#8220;www.example.com.php&#8221;: <code><b><br />
&lt;?php<br />
$baseUrl = &#8220;http://example.com&#8221;;<br />
?&gt;  </b></code><br />
Then the menu in sidebar.php looks like: <code><b><br />
&lt;?php<br />
$classVMenu = &quot;vmenu&quot;;<br />
print &quot;<br />
&lt;img src=\&quot;$baseUrl/vmenuheader.png\&quot; width=\&quot;128\&quot; height=\&quot;16\&quot; alt=\&quot;MENU\&quot; /&gt;<br />
&lt;ul&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/\&quot;&gt;Home&lt;/a&gt;&lt;/li&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/contact.asp\&quot;&gt;Contact&lt;/a&gt;&lt;/li&gt;<br />
&lt;li&gt;&lt;a class=\&quot;$classVMenu\&quot; href=\&quot;$baseUrl/sitemap.asp\&quot;&gt;Sitemap&lt;/a&gt;&lt;/li&gt;<br />
&#8230;<br />
&lt;/ul&gt;<br />
&quot;;<br />
?&gt; </b></code><br />
Mixing X/HTML with server sided scripting languages is fault-prone and makes maintenance a nightmare. Don&#8217;t make the same mistake as WordPress. Avoid crap like that: <code><br />
&lt;li&gt;&lt;a class=&quot;&lt;?php print $classVMenu; ?&gt;&quot; href=&quot;&lt;?php print $baseUrl; ?&gt;/contact.asp&quot;&gt;&lt;/a&gt;&lt;/li&gt; </code></p>
<h3>Error handling</h3>
<p>I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root&#8217;s .htaccess file: <code><b><br />
ErrorDocument 401 /get-the-fuck-outta-here.asp<br />
ErrorDocument 403 /get-the-fudge-outta-here.asp<br />
ErrorDocument 404 /404handler.php<br />
ErrorDocument 410 /410-gone-forever.asp<br />
ErrorDocument 503 /410-down-for-maintenance.asp<br />
# &#8230;<br />
Options -Indexes </b></code><br />
Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script: <code></b><br />
ErrorDocument 401 /error.php?errno=401<br />
ErrorDocument 403 /error.php?errno=403<br />
ErrorDocument 404 /404handler.php<br />
ErrorDocument 410 /error.php?errno=410<br />
ErrorDocument 503 /error.php?errno=503<br />
# &#8230;<br />
Options -Indexes </b></code><br />
Note that relative URLs in pages or scripts called by ErrorDocument directives don&#8217;t work. <b>Don&#8217;t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that.</b> If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that&#8217;s bad enough.) For more information please read my pamphlet <a href="http://sebastians-pamphlets.com/why-proper-error-handling-is-important/">Why error handling is important</a>. </p>
<p>Last but not least create a robots.txt file in the root. If you&#8217;ve nothing to hide from search engine crawlers, this one will suffice: <code></b><br />
User-agent: *<br />
Disallow:<br />
Allow: /<br />
</b></code></p>
<p>I&#8217;m aware that this tiny guide can&#8217;t cover everything. It should give you an idea of the pitfalls and possible solutions. If you&#8217;re somewhat code-savvy my code snippets will get you started, but hire an expert when you plan to migrate a large site. And don&#8217;t view the source code of <a href="http://link-condom.com/">link-condom.com</a> pages where I didn&#8217;t implement all tips from this tutorial. <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-migrate-a-website-from-iis-asp-to-apache-php/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Shit happens, your redirects hit the fan!</title>
		<link>http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/</link>
		<comments>http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/#comments</comments>
		<pubDate>Wed, 26 Sep 2007 20:35:27 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[404grabber]]></category>

		<category><![CDATA[Redirects]]></category>

		<category><![CDATA[Web development]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/</guid>
		<description><![CDATA[Although robust search engine crawlers are rather fault-tolerant creatures, there is an often overlooked but quite safe procedure to piss off the spiders. Playing redirect ping pong mostly results in unindexed contents. Google reports chained redirects under the initially requested URL as URLs not followed due to redirect errors, and recommends:
Minimize the number of redirects [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sebastians-pamphlets.com/img/posts/confuse-the-hell-out-of-the-spiders.png" border="0" width="200" height="168" align="right" alt="confused spider" title="Don't confuse the spiders! They might puke ..."  />Although robust search engine crawlers are rather fault-tolerant creatures, there is an often overlooked but quite safe procedure to piss off the spiders. Playing redirect ping pong mostly results in unindexed contents. Google reports chained redirects under the initially requested URL as <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35156">URLs not followed</a> due to <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=35157">redirect errors</a>, and recommends:<br />
<blockquote>Minimize the number of redirects needed to follow a link from one page to another.</p></blockquote>
<p>The same goes for other search engines, they can&#8217;t handle longish chains of redirecting URLs. In other words: all search engines consider URLs involved in longish redirect chains unreliable, not trustworthy, low quality &#8230;</p>
<p>What&#8217;s that to you? Well, you might play redirect ping pong with search engine crawlers unknowingly. If you&#8217;ve ever redesigned a site, chances are you&#8217;ve build chained redirects. In most cases those chains aren&#8217;t too complex, but it&#8217;s worth checking. Bear in mind that Apache, .htaccess, scripts or CMS software and whatnot can perform redirects, often without notice and undetectable with a browser. </p>
<p>I made up this example, but I&#8217;ve seen worse redirect chains. Here is the transcript of Ms. Googlebot&#8217;s chat with your Web server:<br />
<img src="http://sebastians-pamphlets.com/img/posts/redirect-chains-crawler-server-dialog.png" border="0" width="498" height="277" align="center" alt="crappy redirect chain" title="Avoid chained redirects!"  /></p>
<p class="question"><span style="text-transform:uppercase; text-decoration:overline;">Googlebot:</span> Now that&#8217;s a nice link I&#8217;ve discovered on this old and trusted page. I can&#8217;t wait to fetch it. Hey port 80 at yourstuff.com, would you please be so kind to serve me <code>/some-page</code>?</p>
<p class="answer"><span style="text-transform:uppercase; text-decoration:overline; font-weight:bold;">.htaccess:</span> Oh silly Googlebot, don&#8217;t you read <a href="http://www.mattcutts.com/blog/seo-advice-url-canonicalization/">Matt&#8217;s blog</a>? He told me that a 301 redirect is the canonical answer when someone requests my stuff without the www-prefix. I didn&#8217;t bother to lookup the resource you&#8217;ve asked for, and why should I since your request is wrong, wrong, wrong! Here is the canonical URL: <code>301-Moved permanently, Location: http://www.yourstuff.com/some-page</code>.</p>
<p class="question"  id="incomplete-uri"><span style="text-transform:uppercase; text-decoration:overline;">Googlebot:</span> Dear Web server, of course that&#8217;s the best thing to do. I apologize for my silly request, but please note that I&#8217;m following a link where the Webmaster of the authority site forgot the dup-dup-dup stuff in the HREF attribute&#8217;s value. Can you please serve me <code>/some-page</code> now?</p>
<p class="answer"><span style="text-transform:uppercase; text-decoration:overline; font-weight:bold;">Apache:</span> Doh! Why does this silly bot bother me with an incomplete URI? Since <code>/some-page</code> is not an existing file, probably it&#8217;s a directory and a brain dead Webmaster forgot the trailing slash. Lets see &#8230; yep, that&#8217;s it, and it has even a default document <code>index.php</code>. Dear Ms. Googlebot, here is the correct URL: <code>302-Found elsewhere, Location: http://www.yourstuff.com/some-page/index.php</code>.</p>
<p><small>[Most servers will respond with <code>301-Moved permanently, Location: http://www.yourstuff.com/some-page/</code>, but however the server is configured, it will be a redirect.]</small></p>
<p class="question"><span style="text-transform:uppercase; text-decoration:overline;">Googlebot:</span> Oh I see, I&#8217;ve got a really malformed URI. Sorry for bothering you again, could you please deliver the content of <code>http://www.yourstuff.com/some-page/index.php</code>? Thanks in advance!</p>
<p class="answer"><span style="text-transform:uppercase; text-decoration:overline; font-weight:bold;">/some-page/index.php:</span> Uuaahhh &#8230; who&#8217;s knocking? I slept sooo long &#8230; not fully awake &#8230; please be patient while Apache hands me over to the PHP parser. Actually, I&#8217;m totally outdated because my Webmaster discovered some blogging software and moved my contents to a WordPress database. Please wait. Where is the darn post-it with the new address? Ahh, just found it under my pillow. Please don&#8217;t disturb me again, bye bye, and here you go: <code>301-Moved permanantly, Location: http://www.yourstuff.com/blog/some-post/</code>.</p>
<p class="question"><span style="text-transform:uppercase; text-decoration:overline;">Googlebot:</span> You&#8217;re kidding! Why didn&#8217;t you give me this URI in the first place? Never mind, I want <code>http://www.yourstuff.com/blog/some-post/</code> now.  </p>
<p class="answer"><span style="text-transform:uppercase; text-decoration:overline; font-weight:bold;">Hack in /blog/index.php:</span> Oh my god, another bot following outdated links. My Webmaster is clever, he read somewhere that category names in URLs boost search engine rankings and changed the permalink structure to category/postname. Wait a moment, I&#8217;m looking up the category for this post, be right back. Hmmm, this category name is not exactly a money term &#8230; the heck with it! Here is the keyword stuffed URL I just made for you: <code>301-Moved permanantly, Location: http://www.yourstuff.com/blog/cat/some-post/</code>.</p>
<p class="question"><span style="text-transform:uppercase; text-decoration:overline;">Googlebot:</span> Four crappy redirects in a row, what a bunch of bull! This is my ultimately last try: gimme <code>http://www.yourstuff.com/blog/cat/some-post/</code>!</p>
<p class="answer"><span style="text-transform:uppercase; text-decoration:overline; font-weight:bold;">WordPress:</span> See, that didn&#8217;t hurt, did it? Here you are: <code>200-OK, the content</code></p>
<p class="question"><span style="text-transform:uppercase; text-decoration:overline;">Googlebot:</span> Here&#8217;s your shabby page. Don&#8217;t send me to this assclowns again! I&#8217;ve better things to do than running five times to this misconfigured server only to fetch a keyword laden sales pitch with 34 grammar errors, 67 typos, and a ton of affiliate links. Grrrr! </p>
<p class="answer"><span style="text-transform:uppercase; text-decoration:overline; font-weight:bold;">Crawl scheduler:</span> Sorry dear. I can&#8217;t blacklist them for stuff like that, but I&#8217;ve set the crawl frequency for yourstuff.com to once a year, and I&#8217;ve red-flagged the document ID so that the indexer can downrank it accordingly.</p>
<p>Do you really want to treat Ms. Googlebot so badly? Not to speak of the minus points you gain for playing redirect ping pong with a search engine. Maybe most search engines index a page served after four redirects, but I won&#8217;t rely on such a redirect chain. It&#8217;s quite easy to shorten it. Just delete outdated stuff so that all requests run into a 404-Not found, then write up a list in a format like<br />
<table align="center" style="margin-left:20px; margin-bottom:15px;" cellpadding="5" cellspacing="5">
<tr>
<td><code>Old URI 1</code></td>
<td>Delimiter</td>
<td><code>New URI 1</code></td>
<td>\n</td>
</tr>
<tr>
<td><code>Old URI 2</td>
<td>Delimiter</td>
<td><code>New URI 2</code></td>
<td>\n</td>
</tr>
<tr>
<td>&nbsp;&nbsp;&#8230;</td>
<td>Delimiter</td>
<td>&nbsp;&nbsp;&#8230;</td>
<td>\n</td>
</tr>
</table>
<p><span style="margin-top:30px;"></span>and write a simple redirect script which reads this file and performs a 301 redirect to <code>New URI</code> when <code>REQUEST_URI == Old URI</code>. If <code>REQUEST_URI</code> doesn&#8217;t match any entry, then send a 404 header and include your actual error page. If you need to change the final URLs later on, you can easily do that in the text file&#8217;s right column with search and replace.</p>
<p>Next point the <code>ErrorDocument 404</code> directive in your root&#8217;s .htaccess file to this script. Done. Not looking at possible www/non-www canonicalization redirects, you&#8217;ve shortened the number of redirects to one, regardless how often you&#8217;ve moved your pages. Don&#8217;t forget to add all outdated URLs to the list when you redesign your stuff again, and cover common 3rd party sins like truncating trailing slashes too. The flat file from the example above would look like:</p>
<table  align="center" style="margin-left:20px; margin-bottom:15px;"  cellpadding="5" cellspacing="5">
<tr>
<td><code>/some-page</code></td>
<td>Delimiter</td>
<td><code>/blog/cat/some-post/</code></td>
<td>\n</td>
</tr>
<tr>
<td><code>/some-page/</code></td>
<td>Delimiter</td>
<td><code>/blog/cat/some-post/</code></td>
<td>\n</td>
</tr>
<tr>
<td><code>/some-page/index.php</code></td>
<td>Delimiter</td>
<td><code>/blog/cat/some-post/</code></td>
<td>\n</td>
</tr>
<tr>
<td><code>/blog/some-post</code></td>
<td>Delimiter</td>
<td><code>/blog/cat/some-post/</code></td>
<td>\n</td>
</tr>
<tr>
<td><code>/blog/some-post/</code></td>
<td>Delimiter</td>
<td><code>/blog/cat/some-post/</code></td>
<td>\n</td>
</tr>
<tr>
<td>&nbsp;&nbsp;&#8230;</td>
<td>Delimiter</td>
<td>&nbsp;&nbsp;&#8230;</td>
<td>\n</td>
</tr>
</table>
<p><span style="margin-top:30px;"></span>With a large site consider a database table, processing huge flat files with every 404 error can come with disadvantages. Also, if you&#8217;ve patterns like <code>/blog/post-name/ ==&gt; /blog/cat/post-name/</code> then don&#8217;t generate and process longish mapping tables but cover these redirects algorithmically.</p>
<p>To gather URLs worth a 301 redirect use these sources:</p>
<ul>
<li>Your server logs.</li>
<li>404/301/302/&#8230; reports from your server stats.</li>
<li>Google&#8217;s <a href="https://www.google.com/webmasters/tools/webcrawlerrors?siteUrl=http://example.com/&#038;hl=en" rel="example nofollow">Web crawl error reports</a>.</li>
<li>Tools like <a href="http://home.snafu.de/tilman/xenulink.html">XENU&#8217;s Link Sleuth</a> which crawl your site and output broken links as well as all sorts of redirects, and can even check your complete Web space for orphans.</li>
<li>Sitemaps of outdated structures/site areas.</li>
<li><a href="http://www.seoconsultants.com/tools/headers/">Server header checkers</a> which follow all redirects to the final destination.</li>
<li>&#8230;</li>
</ul>
<p><b>Disclaimer:</b> If you suffer from IIS/ASP, free hosts, restrictive hosts like Yahoo or other serious maladies, this post is not for you.</p>
<p><b>I&#8217;m curious, <strike>does</strike> did your site play redirect ping pong with search engine crawlers?</b> </p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/how-to-avoid-troubles-caused-by-chained-redirects/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Getting the most out of Google&#8217;s 404 stats</title>
		<link>http://sebastians-pamphlets.com/getting-the-most-out-of-googles-404-stats/</link>
		<comments>http://sebastians-pamphlets.com/getting-the-most-out-of-googles-404-stats/#comments</comments>
		<pubDate>Mon, 16 Jul 2007 23:10:00 +0000</pubDate>
		<dc:creator>Sebastian</dc:creator>
		
		<category><![CDATA[Tools]]></category>

		<category><![CDATA[Testing]]></category>

		<category><![CDATA[404grabber]]></category>

		<category><![CDATA[Hotlinking]]></category>

		<category><![CDATA[.htaccess]]></category>

		<category><![CDATA[SEO]]></category>

		<category><![CDATA[Webmaster Central]]></category>

		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://sebastians-pamphlets.com/getting-the-most-out-of-googles-404-stats/</guid>
		<description><![CDATA[The 404 reports in Google&#8217;s Webmaster Central panel are great to debug your site, but they contain URLs generated by invalid &#8211;respectively truncated&#8211; URL drops or typos of other Webmasters too. Are you sick of wasting the link love from invalid inbound links, just because you lack a suitable procedure to 301-redirect all these 404 [...]]]></description>
			<content:encoded><![CDATA[<p>The 404 reports in Google&#8217;s Webmaster Central panel are great to debug your site, but they contain URLs generated by invalid &#8211;respectively truncated&#8211; URL drops or typos of other Webmasters too. Are you sick of wasting the link love from invalid inbound links, just because you lack a suitable procedure to 301-redirect all these 404 errors to canonical URLs? </p>
<p>Your pain ends here. At least when you&#8217;re on a *ix server running Apache with PHP 4+ or 5+ and .htaccess enabled. (If you suffer from IIS <a href="http://www.google.com/search?num=100&#038;hl=en&#038;safe=off&#038;q=funeral+services">go</a> search another hobby.)</p>
<p>I&#8217;ve developed a tool which grabs all 404 requests, letting you map a canonical URL to each 404 error. The tool captures and records 404s, and you can add invalid URLs from Google&#8217;s 404-reports, if these aren&#8217;t recorded (yet) from requests by Ms. Googlebot.  </p>
<p>It&#8217;s kinda layer between your standard 404 handling and your error page. If a request results in a 404 error, your .htaccess calls the tool instead of the error page. If you&#8217;ve assigned a canonical URL to an invalid URL, the tool 301-redirects the request to the canonical URL. Otherwise it sends a 404 header and outputs your standard 404 error page. Google&#8217;s 404-probe requests during the Webmaster Tools verification procedure are unredirectable (is this a word?).</p>
<p>Besides 1:1 mappings of invalid URLs to canonical URLs you can assign keywords to canonical URLs. For example you can define that all invalid requests go to <code>/fruit</code> when the requested URI or the HTTP referrer (usually a SERP) contain the strings &#8220;apple&#8221;, &#8220;orange&#8221;, &#8220;banana&#8221; or &#8220;strawberry&#8221;. If there&#8217;s no persistent mapping, these requests get 302-redirected to the guessed canonical URL, thus you should view the redirect log frequently to find invalid URLs which deserve a persistent 301-redirect.</p>
<p>Next there are tons of bogus requests from spambots searching for exploits or whatever, or hotlinkers, resulting in 404 errors, where it makes no sense to maintain URL mappings. Just update an ignore list to make sure those get 301-redirected to <code>example.com/goFuckYourself</code> or a cruel and scary image hosted on your domain or a free host of your choice. </p>
<p>Everything not matching a persistent redirect rule or an expression ends up in a 404 response, as before, but logged so that you can define a mapping to a canonical URL. Also, you can use this tool when you plan to change (a lot of) URLs, it can 301-redirect the old URL to the new one without adding those to your .htaccess file.</p>
<p>I&#8217;ve tested this tool for a while on a couple of smaller sites and I think it can get trained to run smoothly without too many edits once the ignore lists etcetera are up to date, that is matching the site&#8217;s requisites. A couple of friends got the script and they will provide useful input. Thanks! <a href="http://www.smart-it-consulting.com/contact.htm?cSubject=404Grabber_BETA">If you&#8217;d like to join the BETA test drop me a message</a>. </p>
<p>Disclaimer: All data get stored in flat files. With large sites we&#8217;d need to change that to a database. The UI sucks, I mean it&#8217;s usable but it comes with the browser&#8217;s default fonts and all that. IOW the current version is still in the stage of &#8220;proof of concept&#8221;. But it works just fine <img src='http://sebastians-pamphlets.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /></p>
<hr />Copyright &copy; 2008 <strong><a href="http://sebastians-pamphlets.com/">Sebastian`s Pamphlets</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator/feed reader, the site you are looking at is guilty of copyright infringement and will be put down immediately. Please contact sebastians-pamphlets.com so we can take legal action immediately.<br /><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://sebastians-pamphlets.com/getting-the-most-out-of-googles-404-stats/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
