Archived posts from the '404grabber' Category

How to cleverly integrate your own URI shortener

This pamphlet is somewhat geeky. Don’t necessarily understand it as a part of my ongoing jihad holy war on URI shorteners.

Clever implementation of an URL shortenerAssuming you’re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don’t want to make use of unholy crap, so you need to roll your own. Here’s how you can (could) integrate a URI shortener into your site’s architecture.

Please note that my design suggestions ain’t black nor white. Your site’s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they’re applicable.

At the first sight, searching for a free URI shortener script to implement it on a dedicated domain looks like a pretty simple solution. It’s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that’s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.

So if you’re not running cheap-student-loans-with-debt-consolidation-on-each-payday-is-a-must-have-for-sexual-heroes-desperate-for-a-viagra-overdose-and-extreme-penis-length-enhancement.info and your domain’s name without the “www” prefix plus a few characters gives URIs of 20 (30) characters or less, you don’t need a short domain name to host your shortened URIs.

As a side note, when you’re shortening your URIs for Twitter you should know that shortened URIs aren’t mandatory any more. If your message doesn’t exceed 139 characters, you don’t need to shorten embedded URIs.

By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.

In the following I’ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you’ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider alternatives). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.

Table of contents

To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I’ll bore you with a ton of dull and totally unrelated stuff:

Introduction

There’s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we’re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:

RewriteEngine On
RewriteCond %{SERVER_PORT} ^80$
RewriteRule . /requestHandler.php [L]

Please note that the code above kinda disables the Web server’s error handling. If
/requestHandler.php
exists in the root directory, all ErrorDocument directives (except some 5xx) et cetera will be ignored. You need to take care of errors yourself.

/requestHandler.php (Warning: untested and simplified code snippets below)
/* Initialization */
$serverName = strtolower($_SERVER["SERVER_NAME"]);
$canonicalServerName = "sebastians-pamphlets.com";
$scheme = "http://";
$rootUri = $scheme .$canonicalServerName; /* if used w/o path add a
slash */
$rootPath = $_SERVER["DOCUMENT_ROOT"];
$includePath = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */
$requestIp = $_SERVER["REMOTE_ADDR"];
$reverseIp = NULL;
$requestReferrer = $_SERVER["HTTP_REFERER"];
$requestUserAgent = $_SERVER["HTTP_USER_AGENT"];
$isRogueBot = FALSE;
$isCrawler = NULL;
$requestUri = $_SERVER["REQUEST_URI"];
$absoluteUri = $scheme .$canonicalServerName .$requestUri;
$uriParts = parse_url($absoluteUri);
$requestScript = $PHP_SELF;
$httpResponseCode = NULL;

Block rogue bots

You don’t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, …) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: IncrediBill.

While handling bots, you should detect search engine crawlers, too:

/* lookup your crawler IP database to populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */
if ($isCrawler !== TRUE) {
$crawlerName = NULL;
$crawlerHost = NULL;
$crawlerServer = NULL;
if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = ".crawl.baidu.com";}
...
if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ".googlebot.com"; }
if ($crawlerName != NULL) {
$reverseIp = @gethostbyaddr($requestIp);
if (!stristr($reverseIp,$crawlerServer)) {
$isCrawler = FALSE;
}
if ("$reverseIp" == "$requestIp") {
$isCrawler = FALSE;
}
if ($isCrawler !== FALSE;) {
$chkIpAddyRev = @gethostbyname($reverseIp);
if ("$chkIpAddyRev" == "$requestIp") {
$isCrawler = TRUE;
$crawlerHost = $reverseIp;
// store the newly discovered crawler IP
}
}
}
}

If Baidu doesn’t send you any traffic, it makes sense to block its crawler. This piece of crap doesn’t behave anyway.
if ($isCrawler &&
"$crawlerName" == "Baiduspider") {
$isRogueBot = TRUE;
}

Another SE candidate is Bing’s spam bot that tries to manipulate stats on search engine usage. If you don’t approve such scams, block incoming! from the IP address range 65.52.0.0 to 65.55.255.255 (131.107.0.0 to 131.107.255.255 …) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren’t aware of their company’s spammy activities, so make sure you serve them a friendly GFY page that explains the issue.

Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you’re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:
if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}

If you operate a porn site you should refine this approach.

As for blocking requests by IP addy I’d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a @gethostbyaddr($requestIp) DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.
if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}

You won’t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [block user agent], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or code
if (isBlacklistedUa($requestUserAgent) ||

stristr($requestUserAgent,”ThingFetcher”)) {$isRogueBot = TRUE;}

By the way, the owner of ThingFetcher really should stand up now. I’ve sent a complaint to Rackspace and I’ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn’t bother to check for robots.txt.

Finally, serve rogue bots what they deserve:
if ($isRogueBot === TRUE) {

header("HTTP/1.1 403 Go fuck yourself", TRUE, 403);
exit;
}

If you’re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a file_get_contents($requestReferrer); and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to http://example.com/go-fuck-yourself, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I’d go for the 403-GFY response.

Server name canonicalization

Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under one server name:
if (”$serverName” != “$canonicalServerName”) {
header(”HTTP/1.1 301 Please use the canonical URI”, TRUE, 301);
header(”Location: $absoluteUri”);
header(”X-Canonical-URI: $absoluteUri”); //
experimental
header("Link: <$absoluteUri>; rel=canonical"); // experimental
exit;
}

Subdomains are so 1999, also 2010 is the year of non-’.www’ URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite … the code from this pamphlet as you like. However, you must not change the $canonicalServerName = "sebastians-pamphlets.com"; statement. I’ll appreciate the traffic. ;)

When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes –if necessary–, and remove clutter from query strings.

Sometimes even smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers :( ):
$testForUriClutter = $absoluteUri;
if (isset($_GET)) {
foreach ($_GET as $var => $crap) {
if ( stristr($var,”utm_”) ) {
$testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);
$testForUriClutter = str_replace($testForUriClutter, “&amp;$var=$crap”, “”);

unset ($_GET[$var]);
}
}
$uriPartsSanitized = parse_url($testForUriClutter);
$qs = $uriPartsSanitized["query"];
$qs = str_replace($qs, "?", "");
if ("$qs" != $uriParts["query"]) {
$canonicalUri = $scheme .$canonicalServerName .$requestScript;
if (!empty($qs)) {
$canonicalUri .= "?" .$qs;
}
if (!empty($uriParts["fragment"])) {
$canonicalUri .= "#" .$uriParts["fragment"];
}
header("HTTP/1.1 301 URI messed up by Google", TRUE, 301);
header("Location: $canonicalUri");
exit;
}
}

By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.

Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (see below), but some are self-made.

Therefore I’d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don’t need a log entry.


Before you’re going to serve your content, do a lookup in your shortUri table. If the requested URI is a shortened URI pointing to your own stuff, don’t perform a redirect but serve the content under the shortened URI.

Deliver static stuff (images …)

Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we’ve bypassed this functionality, do it yourself:
if (empty($uriParts[”query”])) && empty($uriParts[”fragment”])) && file_exists(”$rootPath$requestUri”)) {
header(”Content-type: ” .getContentType(”$rootPath$requestUri”), TRUE);
readfile(”$rootPath$requestUri”);
exit;
}
/* getContentType($filename) returns a
MIME media type like 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */

If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don’t handle them here.

Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your smart robots.txt to maintain your crawler-IP database and more.

Execute script (dynamic URI)

Say you’ve a WP blog in /blog/, then you can invoke WordPress with
if (substring($requestUri, 0, 6) == “/blog/”) {
require(”$rootPath/blog/index.php”);
exit;
}

(Perhaps the WP configuration needs a tweak to make this work.) There’s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.

Fortunately, when WordPress calls the 404 page (wp-content/themes/yourtheme/404.php), it hasn’t sent any output or headers yet. That means you can include the procedures discussed below in WP’s 404.php:
$httpResponseCode = “404″;
$errSrc = “WordPress”;
$errMsg = “The blog couldn’t make sense out of this request.”;
require(”$includePath/err.php”);
exit;

Like in my WordPress example, you’ll find a way to call your scripts so that they don’t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.

Resolve shortened URI

If you’re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.

Excursus: URI shortener components

Using the hints below you should be able to code your own URI shortener. You don’t need all the balls and whistles (like stats) overloading most scripts available on the Web.

  • A database table with at least these attributes:

    • shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)
    • shortUri.suriUri, text, indexed, stores the original URI
    • shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)

    Storing page titles and content (snippets) makes sense, but isn’t mandatory. For outputs like “recently shortened URIs” you need a timestamp attribute.

  • A method to create a shortened URI.
    Make that an independent script callable from a Web form’s server procedure, via Ajax, SOAP, etc.

    Without a given shortcut, use the primary key to create one. base_convert(intval($suriId), 10, 36); converts an integer into a short string. If you can’t do that in a database insert/create trigger procedure, retrieve the primary key’s value with LAST_INSERT_ID() or so and perform an update.

    URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your create short URI method should return a previously created shortcut then.

    If you’re storing titles and such stuff grabbed from the destination page, don’t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.

    With the shortcut returned build the short URI on-the-fly $shortUri = getBaseUri() ."/" .$suriShortcut; (so you can use your URI shortener across all your sites).

  • A method to retrieve the original URI.
    Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.

    Bear in mind that shortened URIs spread via social media do get abused. A shortcut like ‘xxyyzz’ can appear as ‘xxyyz..’, ‘xxy’, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an error page with suggestions.

  • A Web form to create and edit shortened URIs.
    Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
  • If you want to use your URI shortener with a Twitter client, then build an API.
  • If you need particular stats for your short URIs pointing to foreign sites that your analytics package can’t deliver, then store those click data separately.
    // end excursus

If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.
$suriUri = resolveShortUri($requestUri);
if ($suriUri === FALSE) {
$httpResponseCode = “404″;
$errSrc = “sUri”;
$errMsg = “Invalid short URI. Shortcut resolves to more than one result.”;
require(”$includePath/err.php”);
exit;
}
if (!empty($suriUri))
if (!stristr($suriUri, $canonicalServerName)) {
header(”HTTP/1.1 301 Here you go”, TRUE, 301);
header(”Location: $suriUri”);
exit;
}
}

Otherwise ($suriUri is yours) deliver your content without redirecting.

Redirect to destination (invalid request)

From reading your raw logs (404 stats don’t cover 302-Found crap) you’ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn’t make sense to show visitors following such a link your 404 page.

Most screwed URIs are unique in a way that they still ‘address’ one particular resource on your server. You should maintain a mapping table for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.

When you feature a “product of the week”, “hottest blog post”, “today’s joke” or so, then bookmarkers will love it when its URI doesn’t change. For such transient URIs do a 307 redirect to the currently featured page. Don’t fear non-existing ‘duplicate content penalties’. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you’ll still get the SERP traffic you deserve.

Guess destination (invalid request)

For many screwed URIs you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.

Once you’ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your screwed URIs mapping table (see above).

When you can’t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.

Serve a useful error page

Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don’t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page’s prominent message should consist of hints how the visitor can reach the estimated contents.

A central error page gets invoked from various scripts. Unfortunately, err.php can’t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can’t send an HTTP response header. Hence prefix the header() statement with a ‘@’ to supress PHP error messages, and catch and log errors.

Before you output your wonderful error page, send a 404 header:
if ($httpResponseCode == NULL) {
$httpResponseCode = “404″;
}
if (empty($httpResponseCode)) {
$httpResponseCode = “501″; // log for debugging
}
@header(”HTTP/1.1 $httpResponseCode Shit happens”, TRUE, intval($httpResponseCode));
logHeaderErr(error_get_last());

In rare cases you better send a 410-Gone header, for example when Matt’s team has discovered a shitload of questionable pages and you’ve filed a reconsideration request.

In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.

Use possible input ($errSrc, $errMsg, …) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [* 404 page *] might inspire you (WordPress users click here).


All errors are mine. In other words, be careful when you grab my untested code examples. It’s all dumped from memory without further thoughts and didn’t face a syntax checker.

I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I’d be happy to discuss your thoughts in the comments. Thanks for your time.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Upgrading from IIS/ASP to Apache/PHP

Upgrade from Windows/IIS/ASP to Unix/Apache/PHPOnce you’re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won’t work, and you can’t run MS-Access “databases” and such stuff under Apache.

Here is my idea of a smoothly migration from IIS/ASP to Apache/PHP. Grab any Unix box from your hoster’s portfolio and start over.

(Recently I got a tiny IIS/ASP site about uses & abuses of link condoms and moved it to an Apache server. I’m well known for brutal IIS rants, but so far I didn’t discuss a way out of such a dilemma, so I thought blogging this move could be a good idea.)

I don’t want to make this piece too complex, so I skip database and code migration strategies. Read Mike Hillyer’s article Migrating from Microsoft Access/MS-SQL to MySQL, and try tools like ASP to PHP. (With my tiny link condom site I overwrote the ASP code with PHP statements in my primitive text editor.)

From an SEO perspective such an upgrade comes with pitfalls:

  • Changing file extensions from .asp to .php is not an option. We want to keep the number of unavoidable redirects as low as possible.
  • Default.asp is usually not configured as a valid default document under Apache, hence requests of http://example.com/ run into 404 errors.
  • Basic server name canonicalization routines (www vs. non-www) from ASP scripts are not convertible.
  • IIS-URIs are not case sensitive, that means that /Default.asp will 404 on Apache when the filename is /default.asp. Usually there are lowercase/uppercase issues with query string variables and values as well.
  • Most probably search engines have URL variants in their indexes, so we want to adapt their URL canonicalization, at least where possible.
  • HTML editors like Microsoft Visual Studio tend to duplicate the HTML code of templated page areas. Instead of editing menus or footers in all scripts we want to encapsulate them.
  • If the navigation makes use of relative links, we need to convert those to absolute URLs.
  • Error handling isn’t convertible. Improper error handling can cause decreasing search engine traffic.

Running /default.asp, /home.asp etc. as PHP scripts

When you upload an .asp file to an Apache Web server, most user agents can’t handle it. Browsers treat them as unknown file types and force downloads instead of rendering them. Next those files aren’t parsed for PHP statements, provided you’ve rewritten the ASP code already.

To tell Apache that .asp files are valid PHP scripts outputting X/HTML, add this code to your server config or your .htaccess file in the root:
AddType text/html .asp
AddHandler application/x-httpd-php .asp

The first line says that .asp files shall be treated as HTML documents, and should force the server to send a Content-Type: text/html HTTP header. The second line tells Apache that it must parse .asp files for PHP code.

Just in case the AddType statement above doesn’t produce a Content-Type: text/html header, here is another way to tell all user agents requesting .asp files from your server that the content type for .asp is text/html. If you’ve mod_headers available, you can accomplish that with this .htaccess code:
<IfModule mod_headers.c>
SetEnvIf Request_URI \.asp is_asp=is_asp
Header set "Content-type" "text/html" env=is_asp
Header set imagetoolbar "no"
</IfModule>

(The imagetoolbar=no header tells IE to behave nicely; you can use this directive in a meta tag too.)
If for some reason mod_headers doesn’t work well with mod_setenvif, giving 500 error codes or so, then you can set the content-type with PHP too. Add this to a PHP script file which is included in all your scripts at the very top:
@header("Content-type: text/html", TRUE);

Instead of “text/html” alone, you can define the character set too: “text/html; charset=UTF-8″

Sanitizing the home page URL by eliminating “default.asp”

Instead of slowing down Apache by defining just another default document name (DirectoryIndex index.html index.shtml index.htm index.php [...] default.asp), we get rid of “/default.asp” with this “/index.php” script:
<?php
@require("default.asp");
?>

Now every request of http://example.com/ executes /index.php which includes /default.asp. This works with subdirectories too.

Just in case someone requests /default.asp directly (search engines keep forgotten links!), we perform a permanent redirect in .htaccess:
Redirect 301 /default.asp http://example.com/
Redirect 301 /Default.asp http://example.com/

Converting the ASP code for server name canonicalization

If you find ASP canonicalization routines like
<%@ Language=VBScript %>
<%
if strcomp(Request.ServerVariables("SERVER_NAME"), "www.example.com", vbCompareText) = 0 then
Response.Clear
Response.Status = "301 Moved Permanently"
strNewUrl = Request.ServerVariables("URL")
if instr(1,strNewUrl, "/default.asp", vbCompareText) > 0 then
strNewUrl = replace(strNewUrl, "/Default.asp", "/")
strNewUrl = replace(strNewUrl, "/default.asp", "/")
end if
if Request.QueryString <> "" then
Response.AddHeader "Location","http://example.com" & strNewUrl & "?" & Request.QueryString
else
Response.AddHeader "Location","http://example.com" & strNewUrl
end if
Response.End
end if
%>

(or the other way round) at the top of all scripts, just select and delete. This .htaccess code works way better, because it takes care of other server name garbage too:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^example\.com [NC]
RewriteRule (.*) http://example.com/$1 [R=301,L]

(you need mod_rewrite, that’s usually enabled with the default configuration of Apache Web servers).

Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value

Probably a M$ developer didn’t read more than the scheme and server name chapter of the URL/URI standards, at least I’ve no better explanation for the fact that these clowns made the path and query string segment of URIs case-insensitive. (Ok, I have an idea, but nobody wants to read about M$ world domination plans.)

Just because –contrary to Web standards– M$ finds it funny to serve the same contents on request of /Home.asp as well as /home.ASP, such crap doesn’t fly on the World Wide Web. Search engines –and other Web services which store URLs– treat them as different URLs, and consider everything except one version duplicate content.

Creating hyperlinks in HTML editors by picking the script files from the Windows Explorer can result in HREF values like “/Script.asp”, although the file itself is stored with an all-lowercase name, and the FTP client uploads “/script.asp” to the Web server. There are more ways to fuck up file names with improper use of (leading) uppercase characters. Typos like that are somewhat undetectable with IIS, because the developer surfing the site won’t get 404-Not found responses.

Don’t misunderstand me, you’re free to camel-case file names for improved readability, but then make sure that the file system’s notation matches the URIs in HREF/SRC values. (Of course hyphened file names like “buy-cheap-viagra.asp” top the CamelCased version “BuyCheapViagra.asp” when it comes to search engine rankings, but don’t freak out about keywords in URLs, that’s ranking factor #202 or so.)

Technically spoken, converting all file names, variable names and values as well to all-lowercase is the simplest solution. This way it’s quite easy to 301-redirect all invalid requests to the canonical URLs.

However, each redirect puts search engine traffic at risk. Not all search engines process 301 redirects as they should (MSN Live Search for example doesn’t follow permanent redirects and doesn’t pass the reputation earned by the old URL over to the new URL). So if you’ve good SERP positions for “misspelled” URLs, it might make sense to stick with ugly directory/file names. Check your search engine rankings, perform [site:example.com] search queries on all major engines, and read the SERP referrer reports from the old site’s server stats to identify all URLs you don’t want to redirect. By the way, the link reports in Google’s Webmaster Console and Yahoo’s Site Explorer reveal invalid URLs with (internal as well as external) inbound links too.

Whatever strategy fits your needs best, you’ve to call a script handling invalid URLs from your .htaccess file. You can do that with the ErrorDocument directive:
ErrorDocument 404 /404handler.php

That’s safe with static URLs without parameters and should work with dynamic URIs too. When you –in some cases– deal with query strings and/or virtual URIs, the .htaccess code becomes more complex, but handling virtual paths and query string parameters in the PHP scripts might be easier:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /404handler.php [L]
</IfModule>

In both cases Apache will process /404handler.php if the requested URI is invalid, that is if the path segment (/directory/file.extension) points to a file that doesn’t exist.

And here is the PHP script /404handler.php:
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)
(Edit the values in all lines marked with “// change this”.)

This script doesn’t handle case issues with query string variables and values. Query string canonicalization must be developed for each individual site. Also, capturing misspelled URLs with nice search engine rankings should be implemented utilizing a database table when you’ve more than a dozen or so.

Lets see what the /404handler.php script does with requests of non-existing files.

First we test the requested URI for invalid URLs which are nicely ranked at search engines. We don’t care much about duplicate content issues when the engines deliver targeted traffic. Here is an example (which admittedly doesn’t rank for anything but illustrates the functionality): both /sample.asp as well as /Sample.asp deliver the same content, although there’s no /Sample.asp script. Of course a better procedure would be renaming /sample.asp to /Sample.asp, permanently redirecting /sample.asp to /Sample.asp in .htaccess, and changing all internal links accordinly.

Next we lookup the all lowercase version of the requested path. If such a file exists, we perform a permanent redirect to it. Example: /About.asp 301-redirects to /about.asp, which is the file that exists.

Finally, if everything we tried to find a suitable URI for the actual request failed, we send the client a 404 error code and output the error page. Example: /gimme404.asp doesn’t exist, hence /404handler.php responds with a 404-Not Found header and displays /error.asp, but /error.asp directly requested responds with a 200-OK.

You can easily refine the script with other algorithms and mappings to adapt its somewhat primitive functionality to your project’s needs.

Tweaking code for future maintenance

Legacy code comes with repetition, redundancy and duplication caused by developers who love copy+paste respectively copy+paste+modify, or Web design software that generates static files from templates. Even when you’re not willing to do a complete revamp by shoving your contents into a CMS, you must replace the ASP code anyway, what gives you the opportunity to encapsulate all templated page areas.

Say your design tool created a bunch of .asp files which all contain the same sidebars, headers and footers. When you move those files to your new server, create PHP include files from each templated page area, then replace the duplicated HTML code with <?php @include("header.php"); ?>, <?php @include("sidebar.php"); ?>, <?php @include("footer.php"); ?> and so on. Note that when you’ve HTML code in a PHP include file, you must add <?php ?> before the first line of HTML code or contents in included files. Also, leading spaces, empty lines and such which don’t hurt in HTML, can result in errors with PHP statements like header(), because those fail when the server has sent anything to the user agent (even a single space, new line or tab is too much).

It’s a good idea to use PHP scripts that are included at the very top and bottom of all scripts, even when you currently have no idea what to put into those. Trust me and create top.php and bottom.php, then add the calls (<?php @include("top.php"); ?> […] <?php @include("bottom.php"); ?>) to all scripts. Tomorrow you’ll write a generic routine that you must have in all scripts, and you’ll happily do that in top.php. The day after tomorrow you’ll paste the GoogleAnalytics tracking code into bottom.php. With complex sites you need more hooks.

Using absolute URLs on different systems

Another weak point is the use of relative URIs in links, image sources or references to feeds or external scripts. The lame excuse of most developers is that they need to test the site on their local machine, and that doesn’t work with absolute URLs. Crap. Of course it works. The first statement in top.php is
@require($_SERVER["SERVER_NAME"] .".php");

This way you can set the base URL for each environment and your code runs everywhere. For development purposes on a subdomain you’ve a “dev.example.com.php” include file, on the production system example.com the file name resolves to “www.example.com.php”:
<?php
$baseUrl = “http://example.com”;
?>

Then the menu in sidebar.php looks like:
<?php
$classVMenu = "vmenu";
print "
<img src=\"$baseUrl/vmenuheader.png\" width=\"128\" height=\"16\" alt=\"MENU\" />
<ul>
<li><a class=\"$classVMenu\" href=\"$baseUrl/\">Home</a></li>
<li><a class=\"$classVMenu\" href=\"$baseUrl/contact.asp\">Contact</a></li>
<li><a class=\"$classVMenu\" href=\"$baseUrl/sitemap.asp\">Sitemap</a></li>

</ul>
";
?>

Mixing X/HTML with server sided scripting languages is fault-prone and makes maintenance a nightmare. Don’t make the same mistake as WordPress. Avoid crap like that:
<li><a class="<?php print $classVMenu; ?>" href="<?php print $baseUrl; ?>/contact.asp"></a></li>

Error handling

I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root’s .htaccess file:
ErrorDocument 401 /get-the-fuck-outta-here.asp
ErrorDocument 403 /get-the-fudge-outta-here.asp
ErrorDocument 404 /404handler.php
ErrorDocument 410 /410-gone-forever.asp
ErrorDocument 503 /410-down-for-maintenance.asp
# …
Options -Indexes

Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script:
ErrorDocument 401 /error.php?errno=401
ErrorDocument 403 /error.php?errno=403
ErrorDocument 404 /404handler.php
ErrorDocument 410 /error.php?errno=410
ErrorDocument 503 /error.php?errno=503
# …
Options -Indexes

Note that relative URLs in pages or scripts called by ErrorDocument directives don’t work. Don’t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that. If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that’s bad enough.) For more information please read my pamphlet Why error handling is important.

Last but not least create a robots.txt file in the root. If you’ve nothing to hide from search engine crawlers, this one will suffice:
User-agent: *
Disallow:
Allow: /

I’m aware that this tiny guide can’t cover everything. It should give you an idea of the pitfalls and possible solutions. If you’re somewhat code-savvy my code snippets will get you started, but hire an expert when you plan to migrate a large site. And don’t view the source code of link-condom.com pages where I didn’t implement all tips from this tutorial. ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Shit happens, your redirects hit the fan!

confused spiderAlthough robust search engine crawlers are rather fault-tolerant creatures, there is an often overlooked but quite safe procedure to piss off the spiders. Playing redirect ping pong mostly results in unindexed contents. Google reports chained redirects under the initially requested URL as URLs not followed due to redirect errors, and recommends:

Minimize the number of redirects needed to follow a link from one page to another.

The same goes for other search engines, they can’t handle longish chains of redirecting URLs. In other words: all search engines consider URLs involved in longish redirect chains unreliable, not trustworthy, low quality …

What’s that to you? Well, you might play redirect ping pong with search engine crawlers unknowingly. If you’ve ever redesigned a site, chances are you’ve build chained redirects. In most cases those chains aren’t too complex, but it’s worth checking. Bear in mind that Apache, .htaccess, scripts or CMS software and whatnot can perform redirects, often without notice and undetectable with a browser.

I made up this example, but I’ve seen worse redirect chains. Here is the transcript of Ms. Googlebot’s chat with your Web server:
crappy redirect chain

Googlebot: Now that’s a nice link I’ve discovered on this old and trusted page. I can’t wait to fetch it. Hey port 80 at yourstuff.com, would you please be so kind to serve me /some-page?

.htaccess: Oh silly Googlebot, don’t you read Matt’s blog? He told me that a 301 redirect is the canonical answer when someone requests my stuff without the www-prefix. I didn’t bother to lookup the resource you’ve asked for, and why should I since your request is wrong, wrong, wrong! Here is the canonical URL: 301-Moved permanently, Location: http://www.yourstuff.com/some-page.

Googlebot: Dear Web server, of course that’s the best thing to do. I apologize for my silly request, but please note that I’m following a link where the Webmaster of the authority site forgot the dup-dup-dup stuff in the HREF attribute’s value. Can you please serve me /some-page now?

Apache: Doh! Why does this silly bot bother me with an incomplete URI? Since /some-page is not an existing file, probably it’s a directory and a brain dead Webmaster forgot the trailing slash. Lets see … yep, that’s it, and it has even a default document index.php. Dear Ms. Googlebot, here is the correct URL: 302-Found elsewhere, Location: http://www.yourstuff.com/some-page/index.php.

[Most servers will respond with 301-Moved permanently, Location: http://www.yourstuff.com/some-page/, but however the server is configured, it will be a redirect.]

Googlebot: Oh I see, I’ve got a really malformed URI. Sorry for bothering you again, could you please deliver the content of http://www.yourstuff.com/some-page/index.php? Thanks in advance!

/some-page/index.php: Uuaahhh … who’s knocking? I slept sooo long … not fully awake … please be patient while Apache hands me over to the PHP parser. Actually, I’m totally outdated because my Webmaster discovered some blogging software and moved my contents to a WordPress database. Please wait. Where is the darn post-it with the new address? Ahh, just found it under my pillow. Please don’t disturb me again, bye bye, and here you go: 301-Moved permanantly, Location: http://www.yourstuff.com/blog/some-post/.

Googlebot: You’re kidding! Why didn’t you give me this URI in the first place? Never mind, I want http://www.yourstuff.com/blog/some-post/ now.

Hack in /blog/index.php: Oh my god, another bot following outdated links. My Webmaster is clever, he read somewhere that category names in URLs boost search engine rankings and changed the permalink structure to category/postname. Wait a moment, I’m looking up the category for this post, be right back. Hmmm, this category name is not exactly a money term … the heck with it! Here is the keyword stuffed URL I just made for you: 301-Moved permanantly, Location: http://www.yourstuff.com/blog/cat/some-post/.

Googlebot: Four crappy redirects in a row, what a bunch of bull! This is my ultimately last try: gimme http://www.yourstuff.com/blog/cat/some-post/!

WordPress: See, that didn’t hurt, did it? Here you are: 200-OK, the content

Googlebot: Here’s your shabby page. Don’t send me to this assclowns again! I’ve better things to do than running five times to this misconfigured server only to fetch a keyword laden sales pitch with 34 grammar errors, 67 typos, and a ton of affiliate links. Grrrr!

Crawl scheduler: Sorry dear. I can’t blacklist them for stuff like that, but I’ve set the crawl frequency for yourstuff.com to once a year, and I’ve red-flagged the document ID so that the indexer can downrank it accordingly.

Do you really want to treat Ms. Googlebot so badly? Not to speak of the minus points you gain for playing redirect ping pong with a search engine. Maybe most search engines index a page served after four redirects, but I won’t rely on such a redirect chain. It’s quite easy to shorten it. Just delete outdated stuff so that all requests run into a 404-Not found, then write up a list in a format like

Old URI 1 Delimiter New URI 1 \n
Old URI 2 Delimiter New URI 2 \n
  … Delimiter   … \n

and write a simple redirect script which reads this file and performs a 301 redirect to New URI when REQUEST_URI == Old URI. If REQUEST_URI doesn’t match any entry, then send a 404 header and include your actual error page. If you need to change the final URLs later on, you can easily do that in the text file’s right column with search and replace.

Next point the ErrorDocument 404 directive in your root’s .htaccess file to this script. Done. Not looking at possible www/non-www canonicalization redirects, you’ve shortened the number of redirects to one, regardless how often you’ve moved your pages. Don’t forget to add all outdated URLs to the list when you redesign your stuff again, and cover common 3rd party sins like truncating trailing slashes too. The flat file from the example above would look like:

/some-page Delimiter /blog/cat/some-post/ \n
/some-page/ Delimiter /blog/cat/some-post/ \n
/some-page/index.php Delimiter /blog/cat/some-post/ \n
/blog/some-post Delimiter /blog/cat/some-post/ \n
/blog/some-post/ Delimiter /blog/cat/some-post/ \n
  … Delimiter   … \n

With a large site consider a database table, processing huge flat files with every 404 error can come with disadvantages. Also, if you’ve patterns like /blog/post-name/ ==> /blog/cat/post-name/ then don’t generate and process longish mapping tables but cover these redirects algorithmically.

To gather URLs worth a 301 redirect use these sources:

  • Your server logs.
  • 404/301/302/… reports from your server stats.
  • Google’s Web crawl error reports.
  • Tools like XENU’s Link Sleuth which crawl your site and output broken links as well as all sorts of redirects, and can even check your complete Web space for orphans.
  • Sitemaps of outdated structures/site areas.
  • Server header checkers which follow all redirects to the final destination.

Disclaimer: If you suffer from IIS/ASP, free hosts, restrictive hosts like Yahoo or other serious maladies, this post is not for you.

I’m curious, does did your site play redirect ping pong with search engine crawlers?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Getting the most out of Google’s 404 stats

The 404 reports in Google’s Webmaster Central panel are great to debug your site, but they contain URLs generated by invalid –respectively truncated– URL drops or typos of other Webmasters too. Are you sick of wasting the link love from invalid inbound links, just because you lack a suitable procedure to 301-redirect all these 404 errors to canonical URLs?

Your pain ends here. At least when you’re on a *ix server running Apache with PHP 4+ or 5+ and .htaccess enabled. (If you suffer from IIS go search another hobby.)

I’ve developed a tool which grabs all 404 requests, letting you map a canonical URL to each 404 error. The tool captures and records 404s, and you can add invalid URLs from Google’s 404-reports, if these aren’t recorded (yet) from requests by Ms. Googlebot.

It’s kinda layer between your standard 404 handling and your error page. If a request results in a 404 error, your .htaccess calls the tool instead of the error page. If you’ve assigned a canonical URL to an invalid URL, the tool 301-redirects the request to the canonical URL. Otherwise it sends a 404 header and outputs your standard 404 error page. Google’s 404-probe requests during the Webmaster Tools verification procedure are unredirectable (is this a word?).

Besides 1:1 mappings of invalid URLs to canonical URLs you can assign keywords to canonical URLs. For example you can define that all invalid requests go to /fruit when the requested URI or the HTTP referrer (usually a SERP) contain the strings “apple”, “orange”, “banana” or “strawberry”. If there’s no persistent mapping, these requests get 302-redirected to the guessed canonical URL, thus you should view the redirect log frequently to find invalid URLs which deserve a persistent 301-redirect.

Next there are tons of bogus requests from spambots searching for exploits or whatever, or hotlinkers, resulting in 404 errors, where it makes no sense to maintain URL mappings. Just update an ignore list to make sure those get 301-redirected to example.com/goFuckYourself or a cruel and scary image hosted on your domain or a free host of your choice.

Everything not matching a persistent redirect rule or an expression ends up in a 404 response, as before, but logged so that you can define a mapping to a canonical URL. Also, you can use this tool when you plan to change (a lot of) URLs, it can 301-redirect the old URL to the new one without adding those to your .htaccess file.

I’ve tested this tool for a while on a couple of smaller sites and I think it can get trained to run smoothly without too many edits once the ignore lists etcetera are up to date, that is matching the site’s requisites. A couple of friends got the script and they will provide useful input. Thanks! If you’d like to join the BETA test drop me a message.

Disclaimer: All data get stored in flat files. With large sites we’d need to change that to a database. The UI sucks, I mean it’s usable but it comes with the browser’s default fonts and all that. IOW the current version is still in the stage of “proof of concept”. But it works just fine ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments