IIS

Archived posts from the 'IIS' Category

Upgrading from IIS/ASP to Apache/PHP

Posted on 11 December, 2007

Upgrade from Windows/IIS/ASP to Unix/Apache/PHP Once you’re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won’t work, and you can’t run MS-Access “databases” and such stuff under Apache.

Here is my idea of a smoothly migration from IIS/ASP to Apache/PHP. Grab any Unix box from your hoster’s portfolio and start over.

(Recently I got a tiny IIS/ASP site about uses & abuses of link condoms and moved it to an Apache server. I’m well known for brutal IIS rants, but so far I didn’t discuss a way out of such a dilemma, so I thought blogging this move could be a good idea.)

I don’t want to make this piece too complex, so I skip database and code migration strategies. Read Mike Hillyer’s article Migrating from Microsoft Access/MS-SQL to MySQL, and try tools like ASP to PHP. (With my tiny link condom site I overwrote the ASP code with PHP statements in my primitive text editor.)

From an SEO perspective such an upgrade comes with pitfalls:

Changing file extensions from .asp to .php is not an option. We want to keep the number of unavoidable redirects as low as possible.
Default.asp is usually not configured as a valid default document under Apache, hence requests of http://example.com/ run into 404 errors.
Basic server name canonicalization routines (www vs. non-www) from ASP scripts are not convertible.
IIS-URIs are not case sensitive, that means that /Default.asp will 404 on Apache when the filename is /default.asp. Usually there are lowercase/uppercase issues with query string variables and values as well.
Most probably search engines have URL variants in their indexes, so we want to adapt their URL canonicalization, at least where possible.
HTML editors like Microsoft Visual Studio tend to duplicate the HTML code of templated page areas. Instead of editing menus or footers in all scripts we want to encapsulate them.
If the navigation makes use of relative links, we need to convert those to absolute URLs.
Error handling isn’t convertible. Improper error handling can cause decreasing search engine traffic.

Running /default.asp, /home.asp etc. as PHP scripts

When you upload an .asp file to an Apache Web server, most user agents can’t handle it. Browsers treat them as unknown file types and force downloads instead of rendering them. Next those files aren’t parsed for PHP statements, provided you’ve rewritten the ASP code already.

To tell Apache that .asp files are valid PHP scripts outputting X/HTML, add this code to your server config or your .htaccess file in the root: AddType text/html .asp AddHandler application/x-httpd-php .asp
The first line says that .asp files shall be treated as HTML documents, and should force the server to send a Content-Type: text/html HTTP header. The second line tells Apache that it must parse .asp files for PHP code.

Just in case the AddType statement above doesn’t produce a Content-Type: text/html header, here is another way to tell all user agents requesting .asp files from your server that the content type for .asp is text/html. If you’ve mod_headers available, you can accomplish that with this .htaccess code: <IfModule mod_headers.c> SetEnvIf Request_URI \.asp is_asp=is_asp Header set "Content-type" "text/html" env=is_asp Header set imagetoolbar "no" </IfModule>
(The imagetoolbar=no header tells IE to behave nicely; you can use this directive in a meta tag too.)
If for some reason mod_headers doesn’t work well with mod_setenvif, giving 500 error codes or so, then you can set the content-type with PHP too. Add this to a PHP script file which is included in all your scripts at the very top: @header("Content-type: text/html", TRUE);
Instead of “text/html” alone, you can define the character set too: “text/html; charset=UTF-8″

Sanitizing the home page URL by eliminating “default.asp”

Instead of slowing down Apache by defining just another default document name (DirectoryIndex index.html index.shtml index.htm index.php [...] default.asp), we get rid of “/default.asp” with this “/index.php” script: <?php @require("default.asp"); ?>
Now every request of http://example.com/ executes /index.php which includes /default.asp. This works with subdirectories too.

Just in case someone requests /default.asp directly (search engines keep forgotten links!), we perform a permanent redirect in .htaccess: Redirect 301 /default.asp http://example.com/ Redirect 301 /Default.asp http://example.com/

Converting the ASP code for server name canonicalization

If you find ASP canonicalization routines like <%@ Language=VBScript %> <% if strcomp(Request.ServerVariables("SERVER_NAME"), "www.example.com", vbCompareText) = 0 then Response.Clear Response.Status = "301 Moved Permanently" strNewUrl = Request.ServerVariables("URL") if instr(1,strNewUrl, "/default.asp", vbCompareText) > 0 then strNewUrl = replace(strNewUrl, "/Default.asp", "/") strNewUrl = replace(strNewUrl, "/default.asp", "/") end if if Request.QueryString <> "" then Response.AddHeader "Location","http://example.com" & strNewUrl & "?" & Request.QueryString else Response.AddHeader "Location","http://example.com" & strNewUrl end if Response.End end if %>
(or the other way round) at the top of all scripts, just select and delete. This .htaccess code works way better, because it takes care of other server name garbage too: RewriteEngine On RewriteCond %{HTTP_HOST} !^example\.com [NC] RewriteRule (.*) http://example.com/$1 [R=301,L]
(you need mod_rewrite, that’s usually enabled with the default configuration of Apache Web servers).

Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value

Probably a M$ developer didn’t read more than the scheme and server name chapter of the URL/URI standards, at least I’ve no better explanation for the fact that these clowns made the path and query string segment of URIs case-insensitive. (Ok, I have an idea, but nobody wants to read about M$ world domination plans.)

Just because -contrary to Web standards- M$ finds it funny to serve the same contents on request of /Home.asp as well as /home.ASP, such crap doesn’t fly on the World Wide Web. Search engines -and other Web services which store URLs- treat them as different URLs, and consider everything except one version duplicate content.

Creating hyperlinks in HTML editors by picking the script files from the Windows Explorer can result in HREF values like “/Script.asp”, although the file itself is stored with an all-lowercase name, and the FTP client uploads “/script.asp” to the Web server. There are more ways to fuck up file names with improper use of (leading) uppercase characters. Typos like that are somewhat undetectable with IIS, because the developer surfing the site won’t get 404-Not found responses.

Don’t misunderstand me, you’re free to camel-case file names for improved readability, but then make sure that the file system’s notation matches the URIs in HREF/SRC values. (Of course hyphened file names like “buy-cheap-viagra.asp” top the CamelCased version “BuyCheapViagra.asp” when it comes to search engine rankings, but don’t freak out about keywords in URLs, that’s ranking factor #202 or so.)

Technically spoken, converting all file names, variable names and values as well to all-lowercase is the simplest solution. This way it’s quite easy to 301-redirect all invalid requests to the canonical URLs.

However, each redirect puts search engine traffic at risk. Not all search engines process 301 redirects as they should (MSN Live Search for example doesn’t follow permanent redirects and doesn’t pass the reputation earned by the old URL over to the new URL). So if you’ve good SERP positions for “misspelled” URLs, it might make sense to stick with ugly directory/file names. Check your search engine rankings, perform [site:example.com] search queries on all major engines, and read the SERP referrer reports from the old site’s server stats to identify all URLs you don’t want to redirect. By the way, the link reports in Google’s Webmaster Console and Yahoo’s Site Explorer reveal invalid URLs with (internal as well as external) inbound links too.

Whatever strategy fits your needs best, you’ve to call a script handling invalid URLs from your .htaccess file. You can do that with the ErrorDocument directive: ErrorDocument 404 /404handler.php
That’s safe with static URLs without parameters and should work with dynamic URIs too. When you -in some cases- deal with query strings and/or virtual URIs, the .htaccess code becomes more complex, but handling virtual paths and query string parameters in the PHP scripts might be easier: <IfModule mod_rewrite.c> RewriteEngine On RewriteBase / RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /404handler.php [L] </IfModule>
In both cases Apache will process /404handler.php if the requested URI is invalid, that is if the path segment (/directory/file.extension) points to a file that doesn’t exist.

And here is the PHP script /404handler.php:
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)<?php // 404handler.php // called from .htaccess if the requested path doesn’t exist $thisFileName = "404handler.php"; // change this $canonicalScheme = "http://"; $canonicalServer = "example.com"; // change this $errorPageUri = "/error.asp"; // change this $documentRoot = $_SERVER["DOCUMENT_ROOT"]; $requestUri = $_SERVER["REQUEST_URI"]; $canonicalUri = ""; $requestedUrl = $canonicalScheme .$canonicalServer .$requestUri; $canonicalUrl = ""; $url = parse_url($requestedUrl); $requestPath = $url["path"]; $includeScript = ""; $queryString = $url["query"]; // keep misspelled URIs with nice search engine rankings if ("$requestPath" == "/Sample.asp") { // change this $includeScript = $documentRoot ."/sample.asp"; // change this } // … if (!empty($includeScript)) { @header("HTTP/1.1 200 OK", TRUE, 200); @include($includeScript); exit; } // if the lowercase version exists, redirect to it $lcPath = strtolower($url["path"]); $lcFile = $documentRoot .$lcPath; if (file_exists($lcFile) && !stristr($requestUri,$thisFileName)) { $canonicalUrl = $canonicalScheme .$canonicalServer .$lcPath; if ($queryString) { $canonicalUrl .= "?" .$queryString; } if ($url["fragment"]) { $canonicalUrl .= "#" .$url["fragment"]; } } if (!empty($canonicalUrl)) { @header("HTTP/1.1 301 Moved Permanently", TRUE, 301); @header("Location: $canonicalUrl"); exit; } // serve the 404 error page @header("HTTP/1.1 404 Not found", TRUE, 404); @include($documentRoot .$errorPageUri); exit; ?>
(Edit the values in all lines marked with “// change this”.)

This script doesn’t handle case issues with query string variables and values. Query string canonicalization must be developed for each individual site. Also, capturing misspelled URLs with nice search engine rankings should be implemented utilizing a database table when you’ve more than a dozen or so.

Lets see what the /404handler.php script does with requests of non-existing files.

First we test the requested URI for invalid URLs which are nicely ranked at search engines. We don’t care much about duplicate content issues when the engines deliver targeted traffic. Here is an example (which admittedly doesn’t rank for anything but illustrates the functionality): both /sample.asp as well as /Sample.asp deliver the same content, although there’s no /Sample.asp script. Of course a better procedure would be renaming /sample.asp to /Sample.asp, permanently redirecting /sample.asp to /Sample.asp in .htaccess, and changing all internal links accordinly.

Next we lookup the all lowercase version of the requested path. If such a file exists, we perform a permanent redirect to it. Example: /About.asp 301-redirects to /about.asp, which is the file that exists.

Finally, if everything we tried to find a suitable URI for the actual request failed, we send the client a 404 error code and output the error page. Example: /gimme404.asp doesn’t exist, hence /404handler.php responds with a 404-Not Found header and displays /error.asp, but /error.asp directly requested responds with a 200-OK.

You can easily refine the script with other algorithms and mappings to adapt its somewhat primitive functionality to your project’s needs.

Tweaking code for future maintenance

Legacy code comes with repetition, redundancy and duplication caused by developers who love copy+paste respectively copy+paste+modify, or Web design software that generates static files from templates. Even when you’re not willing to do a complete revamp by shoving your contents into a CMS, you must replace the ASP code anyway, what gives you the opportunity to encapsulate all templated page areas.

Say your design tool created a bunch of .asp files which all contain the same sidebars, headers and footers. When you move those files to your new server, create PHP include files from each templated page area, then replace the duplicated HTML code with <?php @include("header.php"); ?>, <?php @include("sidebar.php"); ?>, <?php @include("footer.php"); ?> and so on. Note that when you’ve HTML code in a PHP include file, you must add <?php ?> before the first line of HTML code or contents in included files. Also, leading spaces, empty lines and such which don’t hurt in HTML, can result in errors with PHP statements like header(), because those fail when the server has sent anything to the user agent (even a single space, new line or tab is too much).

It’s a good idea to use PHP scripts that are included at the very top and bottom of all scripts, even when you currently have no idea what to put into those. Trust me and create top.php and bottom.php, then add the calls (<?php @include("top.php"); ?> […] <?php @include("bottom.php"); ?>) to all scripts. Tomorrow you’ll write a generic routine that you must have in all scripts, and you’ll happily do that in top.php. The day after tomorrow you’ll paste the GoogleAnalytics tracking code into bottom.php. With complex sites you need more hooks.

Using absolute URLs on different systems

Another weak point is the use of relative URIs in links, image sources or references to feeds or external scripts. The lame excuse of most developers is that they need to test the site on their local machine, and that doesn’t work with absolute URLs. Crap. Of course it works. The first statement in top.php is @require($_SERVER["SERVER_NAME"] .".php");
This way you can set the base URL for each environment and your code runs everywhere. For development purposes on a subdomain you’ve a “dev.example.com.php” include file, on the production system example.com the file name resolves to “www.example.com.php”: <?php $baseUrl = “http://example.com”; ?>
Then the menu in sidebar.php looks like: <?php $classVMenu = "vmenu"; print " <img src=\"$baseUrl/vmenuheader.png\" width=\"128\" height=\"16\" alt=\"MENU\" /> <ul> <li><a class=\"$classVMenu\" href=\"$baseUrl/\">Home</a></li> <li><a class=\"$classVMenu\" href=\"$baseUrl/contact.asp\">Contact</a></li> <li><a class=\"$classVMenu\" href=\"$baseUrl/sitemap.asp\">Sitemap</a></li> … </ul> "; ?>
Mixing X/HTML with server sided scripting languages is fault-prone and makes maintenance a nightmare. Don’t make the same mistake as WordPress. Avoid crap like that: <li><a class="<?php print $classVMenu; ?>" href="<?php print $baseUrl; ?>/contact.asp"></a></li>

Error handling

I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root’s .htaccess file: ErrorDocument 401 /get-the-fuck-outta-here.asp ErrorDocument 403 /get-the-fudge-outta-here.asp ErrorDocument 404 /404handler.php ErrorDocument 410 /410-gone-forever.asp ErrorDocument 503 /410-down-for-maintenance.asp # … Options -Indexes
Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script: ErrorDocument 401 /error.php?errno=401 ErrorDocument 403 /error.php?errno=403 ErrorDocument 404 /404handler.php ErrorDocument 410 /error.php?errno=410 ErrorDocument 503 /error.php?errno=503 # … Options -Indexes
Note that relative URLs in pages or scripts called by ErrorDocument directives don’t work. Don’t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that. If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that’s bad enough.) For more information please read my pamphlet Why error handling is important.

Last but not least create a robots.txt file in the root. If you’ve nothing to hide from search engine crawlers, this one will suffice: User-agent: * Disallow: Allow: /

I’m aware that this tiny guide can’t cover everything. It should give you an idea of the pitfalls and possible solutions. If you’re somewhat code-savvy my code snippets will get you started, but hire an expert when you plan to migrate a large site. And don’t view the source code of link-condom.com pages where I didn’t implement all tips from this tutorial.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

22 comments Sebastian | 404grabber, Duplicate Content, Redirects, Web development, Copy+Paste-Penalties, .htaccess, IIS, SEO

Better don’t run a web server under Windows

Posted on 11 April, 2007

IIS defaults can produce serious troubles with search engines. That’s a common problem and not even all .nhs.uk (UK Government National
Health Service) admins have spotted it. I’ve alerted the Whipps Cross University Hospital but can’t email all NHS sites suffering from IIS and lazy or uninformed webmasters. So here’s the fix:

Create a server without subdomain domain.nhs.uk, then go to the “Home Directory” tab and click the option “Redirection to a URL”. As “Redirect to” enter the destination, for example “http://www.domain.nhs.uk$S$Q”, without a slash after “.uk” because the path ($S placeholder) begins with a slash. The $Q placeholder represents the query string. Next check “Exact URL entered above” and “Permanent redirection for this resource”, and submit. Test the redirection with a suitable tool.

Now when a user enters a URL without the “www” prefix s/he gets the requested page from the canonical server name. Also search engine crawlers following non-canonical links like http://whippsx.nhs.uk/ will transmit the link love to the desired URL, and will index more pages instead of deleting them in their search indexes after a while because the server is not reachable. I’m not joking. Under some circumstances all or many www-URLs of pages referenced by relative links resolving to the non-existent server will get deleted in the search index after a couple of unsuccessfull attempts to fetch them without the www-prefix.

Hat tip to Robbo
Tags: Search Engine Optimization (SEO) IIS National Health Service (NHS) UK

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

Be the first to comment Sebastian | MSN, IIS, Crap, SEO