Archived posts from the 'Usability' Category

About time: EU crumbles monster cookies from hell

Some extremely bright bravehearts at the European Union headquarters in Bruxelles finally took the initiative and launched a law to fight the Interweb’s gremlins, that play down their sneaky and destructive behavior behind an innocent as well as inapplicable term: COOKIE.

Back in the good old days when every dog and its fleas used an InternetExplorer to consume free porn, and to read unbalanced left leaning news published by dubious online tabloids based in communist strongholds, only the Vatican spread free cookies to its occasional visitor. Today, every site you visit makes use of toxic cookies, thanks to Google Web Analytics, Facebook, Amazon, eBay and countless smut peddlers.

Not that stone age browsers like IE6 could handle neat 3rd party cookies (that today’s advertising networks use to shove targeted product news down your throat) without a little help from an 1×1 pixel iFrame at all, but that’s a completely other story. The point I want to bring home is: cookies never were harmless at all!

Quite the opposite is true. As a matter of fact, Internet cookies pose as digestible candies, but once swallowed they turn into greedy germs that produce torturous flatulence, charge your credit card with overprized Rolex® replicas and other stuff you really don’t need, and spam all your email accounts for the time being until you actually need your daily dose of Viagra® to handle all the big boobs and stiff enlarged dicks delivered to your inbox.

Now that you’re well informed on the increasing cookie pest, a.k.a. cookie pandemic, I’m dead sure you’ll applaud the EU anti cookie law that’s going to get enforced by the end of May 2012, world-wide. Common sense and experience of life tells us, that indeed local laws can tame the Wild Wild West (WWW).

Well, at least in the UK, so far. That’s quite astonishing by the way, because usually the UK vetoes or boycotts everything EU, until their lowbrow thinking and underpaid lawyers discover that previous governments already have signed some long-forgotten contracts defining EU regulations as binding for tiny North Sea islands, even if they’re located somewhere in the Atlantic Ocean and consider themselves huge.

Anyway, although literally nobody except a few Web savvy UK webmasters (but not even its creators who can’t find their asshole with both hands fumbling in the dark) know what the fuck this outlandish law is all about, we need to comply. For the sake of our unborn children, civic duty, or whatever.

Of course you can’t be bothered with researching all this complex stuff. Unfortunately, I can’t link to authorative sources, because not even the almighty Google told me how alien webmasters can implement a diffuse EU policy that didn’t make it to the code of law of any EU member state yet (except of the above mentioned remote islands, though even those have no fucking clue with regard to reasonable procedures and such). That makes this red crab the authorative source on everything ‘EU cookie law’. Sigh.

So here comes the ultimative guide for this planet’s webmasters who’d like to do business with EU countries (or suffer from an EU citizenship).

Step 1: Obfuscate your cookies

In order to make your most sneaky cookies undetectable, flood your vistor’s computer with a shitload of randomly generated and totally meaningless cookies. Make sure that everything important for advertising, shopping cart, user experience and so on gets set first, because the 1024th and all following cookies face the risk of getting ignored by the user agent.

Do not use meaningful variable names for cookies and decode all values. That is, instead of setting added_not_exactly_willingly_purchased_items_to_shopping_cart[13] = golden_humvee_with_diamond_break_pads just create an unobtrusive cookie like additional_discount_upc_666[13] = round(99.99, 0) + '%'.

Step 2: Ask your visitors for permission to accept your cookies

When a new visitor hits your site, create a hidden popunder window with a Web form like this one:

Of course

Why not

Yes, and don’t ask me again

Yup, get me to the free porn asap

I’ve read the TOS and I absolutely agree


Don’t forget to test the auto-submit functionality with all user agents (browsers) out there. Also, log the visitor’s IP addy, browser version and such stuff. Just in case you need to present it in a lawsuit later on.

Step 3: Be totally honest and explain every cookie to your visitors

Somewhere on a deeply buried TOS page linked from your privacy policy page that’s no-followed across your site with an anchor text formatted in 0.001pt, create an ugly table like this one:

My awesome Web site’s wonderful cookies:
_preg=true This cookie makes you pregnant. Also, it creates an order for 100 diapers, XXS, assorted pink and blue, to be delivered in 9 months. Your PayPal account (taken from a befriended Yahoo cookie) gets charged today.
_vote_rig=conditional If you’ve participated in a poll and your vote doesn’t match my current mood, I’ll email your mother in law that you’re cheating on your spouse. Also, regardless what awkward vote you’ve submitted, I’ll change it in a way that’s compatible with my opinion on the topic in question.
_auto_start=daily Adds my product of the day page to your auto start group. Since I’ve collected your credit card details already, I’m nice enough to automate the purchase process in an invisible browser window that closes after I’ve charged your credit card. If you dare to not reboot your pathetic computer at least once a day, I’ll force an hourly reboot in order to teach you how the cookie crumbles.
_joke=send If you see this cookie, I found a .pst file on your computer. All your contacts will enjoy links to questionable (that is NotSafeAtWork) jokes delivered by email from your account, often.
_boobs=show If you’re a male adult, you’ve just subscribed to my ‘weird boob job’ paysite.
_dicks=show That’s the female version of the _boobs cookie. Also delivered to my gay readers, just the landing page differs a little bit.
_google=provided You were thoughtless enough to surf my blog while logged into your Google account. You know, Google just stole my HTTP_REFERER data, so in revenge I overtook your account in order to gather the personal and very private information the privacy nazis at Google don’t deliver for free any more.
_twitter=approved Just in case you check out your Twitter settings by accident, do not go to the ‘Apps’ page and do not revoke my permissions. The few DMs I’ve sent to all your followers only feed my little very hungry monsters, so please leave my tiny spam operation alone.
_fb=new Heh. You zucker (pronounced sucker) lack a Facebook account. I’ve stepped in and assigned it to my various interests. Don’t you dare to join Facebook manually, I do own your name!
_443=nope Removes the obsolete ’s’ (SSL) from URIs in your browser’s address bar. That’s a prerequisite for my free services, like maintaining a backup of your Web mail as user generated content (UGC) in my x-rated movie blog’s comment area. Don’t whine, it’s only visible to search engine crawlers, so your dirty little secrets are totally safe. Also, I don’t publish emails containing Web site credentials, bank account details and such, because sharing those with my fellow blackhat webmasters would be plain silly.
eol=granted Your right to exist has expired, coz your bank account’s balance doesn’t allow any further abuse. This status is also known als ‘end of life’. Say thanks to the cookie community and get yourself a tombstone as long as you (respectively your clan, coz you went belly up by now) can afford it.

Because I’m somewhat lazy, the list above isn’t made up but an excerpt of my blog’s actual cookies.

As a side note, don’t forget to collect local VAT (different percentages per EU country, depending on the type of goods you don’t plan to deliver across the pond) from your EU customers, and do pay the taxman. If you’ve troubles finding the taxman in charge, ask your offshore bank for assistance.

Have fun maintaining a Web site that totally complies to international laws. And thanks for your time (which you would better have invested in developing a Web site that doesn’t rely on cookies for a great user experience).

Summary: The stupid EU cookie law in 2.5 minutes:

If you still don’t grasp how an Internet cookie really tastes, here is the explanation for the geeky preschooler: RFC 2109.

By the way, this comprehensive tutorial might make you believe that only the UK has implemented the EU cookie law yet. Of course the Brits wouldn’t have the balls to perform such a risky solo stunt, without being accompanied by two tiny countries bordering the Baltic Sea: Denmark and Estonia (don’t even try to find european ministates and islands on your globe without a precision magnifier). As soon as the Internet comes to these piddly shore lines, I’ll report on their progress (frankly, don’t really expect an update anytime soon).

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Geo targeting without IP delivery is like throwing a perfectly grilled steak at a vegan

So Gareth James asked me to blather about the role of IP delivery in geo targeting. I answered “That’s a complex topic with gazillions of ‘depends’ lacking the potential of getting handled with a panacea”, and thought he’d just bugger off before I’ve to write a book published on his pathetic UK SEO blog. Unfortunately, it didn’t work according to plan A. This @seo_doctor dude is as persistent as a blowfly attacking a huge horse dump. He dared to reply “lol thats why I asked you!”. OMFG! Usually I throw insults at folks starting a sentence with “lol”, and I don’t communicate with native speakers who niggardly shorten “that’s” to “thats” and don’t capitalize any letter except of “I” for egomaniac purposes.

However, I didn’t annoy the Interwebz with a pamphlet for (perceived) ages, and the topic doesn’t exactly lacks controversial discussion, so read on. By the way, Gareth James is a decent guy. I’m just not fair making fun out of his interesting question for the sake of a somewhat funny opening. (That’s why you’ve read this pamphlet on his SEO blog earlier.)

How to increase your bounce rate and get your site tanked on search engine result pages with IP delivery in geo targeting

A sure fire way to make me use my browser’s back button is any sort of redirect based on my current latitude and longitude. If you try it, you can measure my blood pressure in comparision to an altitude some light-years above mother earth’s ground. You’ve seriously fucked up my surfing experience, therefore you’re blacklisted back to the stone age, and even a few stones farther just to make sure your shitty Internet outlet can’t make it to my browser’s rendering engine any more. Also, I’ll report your crappy attempt to make me sick of you to all major search engines for deceptive cloaking. Don’t screw red crabs.

Related protip: Treat your visitors with due respect.

Geo targeted ads are annoying enough. When I’m in a Swiss airport’s transit area reading an article on any US news site about the congress’ latest fuck-up in foreign policy, most probably it’s not your best idea to plaster my cell phone’s limited screen real estate with ads recommending Zurich’s hottest brothel that offers a flat rate as low as 500 ‘fränkli’ (SFR) per night. It makes no sense to make me horny minutes before I enter a plane where I can’t smoke for fucking eight+ hours!

Then if you’re the popular search engine that in its almighty wisdom decides that I’ve to seek a reservation Web form of Boston’s best whorehouse for 10am local time (that’s ETA Logan + 2 hours) via in french language, you’re totally screwed. In other words, because it’s not Google, I go search for it at Bing. (The “goto” thingy is not exactly reliable, and a totally obsolete detour when I come by with a cookie.)

The same goes for a popular shopping site that redirects me to its Swiss outlet based on my location, although I want to order a book to be delivered to the United States. I’ll place my order elsewhere.

Got it? It’s perfectly fine with me to ask “Do you want to visit our Swiss site? Click here for its version in French, German, Italian or English language”. Just do not force me to view crap I can’t read and didn’t expect to see when I clicked a link!

Regardless whether you redirect me server sided using a questionable ip2location lookup, or client sided evaluating the location I carelessly opened up to your HTML5 based code, you’re doomed coz I’m pissed. (Regardless whether you do that under one URI, respectively the same URI with different hashbang crap, or a chain of actual redirects.) I’ve just increased your bounce rate in lightning speed, and trust me that’s not just yours truly alone who tells click tracking search engines that your site is scum.

How to fuck up your geo targeting with IP delivery, SEO-wise

Of course there’s no bullet proof way to obtain a visitor’s actual location based on the HTTP request’s IP address. Also, if the visitor is a search engine crawler, it requests your stuff from Mountain View, Redmond, or an undisclosed location in China, Russia, or some dubious banana republic. I bet that as a US based Internet marketer offering local services accross all states you can’t serve a meaningful ad targeting Berlin, Paris, Moscow or Canton. Not that Ms Googlebot appreciates cloaked content tailored for folks residing at 1600 Amphitheatre Parkway, by the way.

There’s nothing wrong with delivering a cialis™ or viagra® peddler’s sales pitch to search engine users from a throwaway domain that appeared on a [how to enhance my sexual performance] SERP for undisclosable reasons, but you really shouldn’t do that (or something similar) from your bread and butter site.

When you’ve content in different languages and/or you’re targeting different countries, regions, or whatever, you shall link that content together by language and geographical targets, providing prominent but not obfuscating links to other areas of your site (or local domains) for visitors who –indicated by browser language settings, search terms taken from the query string of the referring page, detected (well, guessed) location, or other available signals– might be interested in these versions. Create kinda regional sites within your site which are easy to navigate for the targeted customers. You can and should group those site areas by sitemaps as well as reasonable internal linkage, and use other techniques that distribute link love to each localized version.

Thou shalt not serve more than one version of localized content under one URI! If you can’t resist, you’ll piss off your visitors and you’ll ask for troubles with search engines. Most of your stuff will never see the daylight of a SERP by design.

This golden rule applies to IP delivery as well as to any other method that redirects users without explicit agreement. Don’t rely on cookies and such to determine the user’s preferred region or language, always provide visible alternatives when you serve localized content based on previously collected user decisions.

But …

Of course there are exceptions to this rule. For example it’s not exactly recommended to provide content featuring freedom of assembly and expression in fascist countries like Iran, Russia or China, and bare boobs as well as Web analytics or Facebook ‘like’ buttons can get you into deep shit in countries like Germany, where last century nazis make the Internat laws. So sometimes, IP delivery is the way to go.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Cloaking is good for you. Just ignore Bing’s/Google’s guidelines.

Summary first: If you feel the need to cloak, just do it within reason. Don’t cloak because you can, but because it’s technically the most elegant procedure to accomplish a Web development task. Bing and Google can’t detect your (in no way deceptive) intend algorithmically. Don’t spam away, though, because you might leave trails besides cloaking alone, if you aren’t good enough at spamming search engines. Keep your users interests in mind. Don’t comply to search engine guidelines as set in stone, but to a reasonable level, for example when those force you to comply to Web standards that make more sense than the fancy idea you’ve developed on internationalization, based on detecting browser language settings or so.

search engine guidelines are bullshit WRT cloakingThis pamphlet is an opinion piece. The above said should be considered best practice, even by search engines. Of course it’s not, because search engines can and do fail, just like a webmaster who takes my statement “go cloak away if it makes sense” as technical advice and gets his search engine visibility tanked the hard way.

WTF is cloaking?

Cloaking, also known as IP delivery, means delivering content tailored for specific users who are identified primarily by their IP addresses, but also by user agent (browser, crawler, screen reader…) names, and whatnot. Here’s a simple demonstration of this technique. The content of the next paragraph differs depending on the user requesting this page. Googlebot, Googlers, as well as Matt Cutts at work, will read a personalized message:

Dear visitor, thanks for your visit from (

You surely can imagine that cloaking opens a can of worms lots of opportunities to enhance a user’s surfing experience, besides “stalking” particular users like Google’s head of WebSpam.

Why do search engines dislike cloaking?

Apparently they don’t. They use IP delivery themselves. When you’re traveling in europe, you’ll get hints like “go to” or “go to” all the time. That’s checking where you are, trying to lure you into their regional services.

More seriously, there’s a so-called “dark side of cloaking”. Say you’re a seasoned Internet marketer, then you could show Googlebot an educational page with compelling content under an URI like “/games/poker” with an X-Robots-Tag HTTP header telling “noarchive”, whilst surfers (search engine users) supplying an HTTP_REFERER and not coming from get redirected to poker dot com (simplified example).

That’s hard to detect for Google’s WebSpam team. Because they don’t do evil themselves, they can’t officially operate sneaky bots that use for example AOL as their ISP to compare your spider fodder to pages/redirects served to actual users.

Bing sends out spam bots that request your pages “as a surfer” in order to discover deceptive cloaking. Of course those bots can be identified, so professional spammers serve them their spider fodder. Besides burning the bandwidth of non-cloaking sites, Bing doesn’t accomplish anything useful in terms of search quality.

Because search engines can’t detect cloaking properly, not to speak of a cloaking webmaster’s intentions, they’ve launched webmaster guidelines (FUD) that forbid cloaking at all. All Google/Bing reps tell you that cloaking is an evil black hat tactic that will get your site penalized or even banned. By the way, the same goes for perfectly legit “hidden content” that’s invisible on page load, but viewable after a mouse click on a “learn more” widget/link or so.


If your competitor makes creative use of IP delivery to enhance their visitors’ surfing experience, you can file a spam report for cloaking and Google/Bing will ban the site eventually. Just because cloaking can be used with deceptive intent. And yes, it works this way. See below.

Actually, those spam reports trigger a review by a human, so maybe your competitor gets away with it. But search engines also use spam reports to develop spam filters that penalize crawled pages totally automatted. Such filters can fail, and –trust me– they do fail often. Once you must optimize your content delivery for particular users or user groups yourself, such a filter could tank your very own stuff by accident. So don’t snitch on your competitors, because tomorrow they’ll return the favor.

Enforcing a “do not cloak” policy is evil

At least Google’s WebSpam team comes with cojones. They’ve even banned their very own help pages for “cloaking“, although those didn’t serve porn to minors searching for SpongeBob images with safe-search=on.

That’s overdrawn, because the help files of any Google product aren’t usable without a search facility. When I click “help” in any Google service like AdWords, I get either blank pages, and/or links within the help system are broken because the destination pages were deindexed for cloaking. Plain evil, and counter productive.

Just because Google’s help software doesn’t show ads and related links to Googlebot, those pages aren’t guilty of deceptive cloaking. Ms Googlebot won’t pull the plastic, so it makes no sense to serve her advertisements. Related links are context sensitive just like ads, so it makes no sense to persist them in Google’s crawling cache, or even in Google’s search index. Also, as a user I really don’t care whether Google has crawled the same heading I see on a help page or not, as long as I get directed to relevant content, that is a paragraph or more that answers my question.

When a search engine doesn’t deliver the very best search results intentionally, just because those pages violate an outdated and utterly useless policy that rules fraudulent tactics in a shape lastly used in the last century and doesn’t take into account how the Internet works today, I’m pissed.

Maybe that’s not bad at all when applied to Google products? Bullshit, again. The same happens to any other website that doesn’t fit Google’s weird idea of “serving the same content to users and crawlers”. I mean, as long as Google’s crawlers come from US IPs only, how can a US based webmaster serve the same content in German language to a user coming from Austria and Googlebot, both requesting a URI like “/shipping-costs?lang=de” that has to be different for each user because shipping a parcel to Germany costs $30.00 and a parcel of the same weight shipped to Vienna costs $40.00? Don’t tell me bothering a user with shipping fees for all regions in CH/AT/DE all on one page is a good idea, when I can reduce the information overflow to a tailored info of just one shipping fee that my user expects to see, followed by a link to a page that lists shipping costs for all european countries, or all countries where at least some folks might speak/understand German.

Back to Google’s ban of its very own help pages that hid AdSense code from Googlebot. Of course Google wants to see what surfers see in order to deliver relevant search results, and that might include advertisements. However, surrounding ads don’t necessarily obfuscate the page’s content. Ads served instead of content do. So when Google wants to detect ad laden thin pages, they need to become smarter. Penalizing pages that don’t show ads to search engine crawlers is a bad idea for a search engine, because not showing ads to crawlers is a good idea, not only bandwidth-wise, for a webmaster.

Managing this dichotomy is the search engine’s job. They shouldn’t expect webmasters to help them solving their very own problems (maintaining search quality). In fact, bothering webmasters with policies solely put because search engine algos are fallible and incapable is plain evil. The same applies to instruments like rel-nofollow (launched to help Google devaluing spammy links but backfiring enormously) or Google’s war on paid links (as if not each and every link on the whole Internet is paid/bartered for, somehow).

What do you think, should search engines ditch their way too restrictive “don’t cloak” policies? Click to vote: Stop search engines that tyrannize webmasters!


Update 2010-07-06: Don’t miss out on Danny Sullivan’s “Google be fair!” appeal, posted today: Why Google Should Ban Its Own Help Pages — But Also Shouldn’t

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

How to cleverly integrate your own URI shortener

This pamphlet is somewhat geeky. Don’t necessarily understand it as a part of my ongoing jihad holy war on URI shorteners.

Clever implementation of an URL shortenerAssuming you’re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don’t want to make use of unholy crap, so you need to roll your own. Here’s how you can (could) integrate a URI shortener into your site’s architecture.

Please note that my design suggestions ain’t black nor white. Your site’s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they’re applicable.

At the first sight, searching for a free URI shortener script to implement it on a dedicated domain looks like a pretty simple solution. It’s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that’s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.

So if you’re not running and your domain’s name without the “www” prefix plus a few characters gives URIs of 20 (30) characters or less, you don’t need a short domain name to host your shortened URIs.

As a side note, when you’re shortening your URIs for Twitter you should know that shortened URIs aren’t mandatory any more. If your message doesn’t exceed 139 characters, you don’t need to shorten embedded URIs.

By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.

In the following I’ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you’ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider alternatives). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.

Table of contents

To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I’ll bore you with a ton of dull and totally unrelated stuff:


There’s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we’re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:

RewriteEngine On
RewriteCond %{SERVER_PORT} ^80$
RewriteRule . /requestHandler.php [L]

Please note that the code above kinda disables the Web server’s error handling. If
exists in the root directory, all ErrorDocument directives (except some 5xx) et cetera will be ignored. You need to take care of errors yourself.

/requestHandler.php (Warning: untested and simplified code snippets below)
/* Initialization */
$serverName = strtolower($_SERVER["SERVER_NAME"]);
$canonicalServerName = "";
$scheme = "http://";
$rootUri = $scheme .$canonicalServerName; /* if used w/o path add a
slash */
$rootPath = $_SERVER["DOCUMENT_ROOT"];
$includePath = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */
$requestIp = $_SERVER["REMOTE_ADDR"];
$reverseIp = NULL;
$requestReferrer = $_SERVER["HTTP_REFERER"];
$requestUserAgent = $_SERVER["HTTP_USER_AGENT"];
$isRogueBot = FALSE;
$isCrawler = NULL;
$requestUri = $_SERVER["REQUEST_URI"];
$absoluteUri = $scheme .$canonicalServerName .$requestUri;
$uriParts = parse_url($absoluteUri);
$requestScript = $PHP_SELF;
$httpResponseCode = NULL;

Block rogue bots

You don’t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, …) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: IncrediBill.

While handling bots, you should detect search engine crawlers, too:

/* lookup your crawler IP database to populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */
if ($isCrawler !== TRUE) {
$crawlerName = NULL;
$crawlerHost = NULL;
$crawlerServer = NULL;
if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = "";}
if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ""; }
if ($crawlerName != NULL) {
$reverseIp = @gethostbyaddr($requestIp);
if (!stristr($reverseIp,$crawlerServer)) {
$isCrawler = FALSE;
if ("$reverseIp" == "$requestIp") {
$isCrawler = FALSE;
if ($isCrawler !== FALSE;) {
$chkIpAddyRev = @gethostbyname($reverseIp);
if ("$chkIpAddyRev" == "$requestIp") {
$isCrawler = TRUE;
$crawlerHost = $reverseIp;
// store the newly discovered crawler IP

If Baidu doesn’t send you any traffic, it makes sense to block its crawler. This piece of crap doesn’t behave anyway.
if ($isCrawler &&
"$crawlerName" == "Baiduspider") {
$isRogueBot = TRUE;

Another SE candidate is Bing’s spam bot that tries to manipulate stats on search engine usage. If you don’t approve such scams, block incoming! from the IP address range to ( to …) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren’t aware of their company’s spammy activities, so make sure you serve them a friendly GFY page that explains the issue.

Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you’re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:
if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}

If you operate a porn site you should refine this approach.

As for blocking requests by IP addy I’d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a @gethostbyaddr($requestIp) DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.
if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}

You won’t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [block user agent], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or code
if (isBlacklistedUa($requestUserAgent) ||

stristr($requestUserAgent,”ThingFetcher”)) {$isRogueBot = TRUE;}

By the way, the owner of ThingFetcher really should stand up now. I’ve sent a complaint to Rackspace and I’ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn’t bother to check for robots.txt.

Finally, serve rogue bots what they deserve:
if ($isRogueBot === TRUE) {

header("HTTP/1.1 403 Go fuck yourself", TRUE, 403);

If you’re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a file_get_contents($requestReferrer); and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I’d go for the 403-GFY response.

Server name canonicalization

Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under one server name:
if (”$serverName” != “$canonicalServerName”) {
header(”HTTP/1.1 301 Please use the canonical URI”, TRUE, 301);
header(”Location: $absoluteUri”);
header(”X-Canonical-URI: $absoluteUri”); //
header("Link: <$absoluteUri>; rel=canonical"); // experimental

Subdomains are so 1999, also 2010 is the year of non-’.www’ URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite … the code from this pamphlet as you like. However, you must not change the $canonicalServerName = ""; statement. I’ll appreciate the traffic. ;)

When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes –if necessary–, and remove clutter from query strings.

Sometimes even smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers :( ):
$testForUriClutter = $absoluteUri;
if (isset($_GET)) {
foreach ($_GET as $var => $crap) {
if ( stristr($var,”utm_”) ) {
$testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);
$testForUriClutter = str_replace($testForUriClutter, “&amp;$var=$crap”, “”);

unset ($_GET[$var]);
$uriPartsSanitized = parse_url($testForUriClutter);
$qs = $uriPartsSanitized["query"];
$qs = str_replace($qs, "?", "");
if ("$qs" != $uriParts["query"]) {
$canonicalUri = $scheme .$canonicalServerName .$requestScript;
if (!empty($qs)) {
$canonicalUri .= "?" .$qs;
if (!empty($uriParts["fragment"])) {
$canonicalUri .= "#" .$uriParts["fragment"];
header("HTTP/1.1 301 URI messed up by Google", TRUE, 301);
header("Location: $canonicalUri");

By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.

Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (see below), but some are self-made.

Therefore I’d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don’t need a log entry.

Before you’re going to serve your content, do a lookup in your shortUri table. If the requested URI is a shortened URI pointing to your own stuff, don’t perform a redirect but serve the content under the shortened URI.

Deliver static stuff (images …)

Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we’ve bypassed this functionality, do it yourself:
if (empty($uriParts[”query”])) && empty($uriParts[”fragment”])) && file_exists(”$rootPath$requestUri”)) {
header(”Content-type: ” .getContentType(”$rootPath$requestUri”), TRUE);
/* getContentType($filename) returns a
MIME media type like 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */

If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don’t handle them here.

Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your smart robots.txt to maintain your crawler-IP database and more.

Execute script (dynamic URI)

Say you’ve a WP blog in /blog/, then you can invoke WordPress with
if (substring($requestUri, 0, 6) == “/blog/”) {

(Perhaps the WP configuration needs a tweak to make this work.) There’s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.

Fortunately, when WordPress calls the 404 page (wp-content/themes/yourtheme/404.php), it hasn’t sent any output or headers yet. That means you can include the procedures discussed below in WP’s 404.php:
$httpResponseCode = “404″;
$errSrc = “WordPress”;
$errMsg = “The blog couldn’t make sense out of this request.”;

Like in my WordPress example, you’ll find a way to call your scripts so that they don’t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.

Resolve shortened URI

If you’re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.

Excursus: URI shortener components

Using the hints below you should be able to code your own URI shortener. You don’t need all the balls and whistles (like stats) overloading most scripts available on the Web.

  • A database table with at least these attributes:

    • shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)
    • shortUri.suriUri, text, indexed, stores the original URI
    • shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)

    Storing page titles and content (snippets) makes sense, but isn’t mandatory. For outputs like “recently shortened URIs” you need a timestamp attribute.

  • A method to create a shortened URI.
    Make that an independent script callable from a Web form’s server procedure, via Ajax, SOAP, etc.

    Without a given shortcut, use the primary key to create one. base_convert(intval($suriId), 10, 36); converts an integer into a short string. If you can’t do that in a database insert/create trigger procedure, retrieve the primary key’s value with LAST_INSERT_ID() or so and perform an update.

    URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your create short URI method should return a previously created shortcut then.

    If you’re storing titles and such stuff grabbed from the destination page, don’t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.

    With the shortcut returned build the short URI on-the-fly $shortUri = getBaseUri() ."/" .$suriShortcut; (so you can use your URI shortener across all your sites).

  • A method to retrieve the original URI.
    Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.

    Bear in mind that shortened URIs spread via social media do get abused. A shortcut like ‘xxyyzz’ can appear as ‘xxyyz..’, ‘xxy’, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an error page with suggestions.

  • A Web form to create and edit shortened URIs.
    Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
  • If you want to use your URI shortener with a Twitter client, then build an API.
  • If you need particular stats for your short URIs pointing to foreign sites that your analytics package can’t deliver, then store those click data separately.
    // end excursus

If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.
$suriUri = resolveShortUri($requestUri);
if ($suriUri === FALSE) {
$httpResponseCode = “404″;
$errSrc = “sUri”;
$errMsg = “Invalid short URI. Shortcut resolves to more than one result.”;
if (!empty($suriUri))
if (!stristr($suriUri, $canonicalServerName)) {
header(”HTTP/1.1 301 Here you go”, TRUE, 301);
header(”Location: $suriUri”);

Otherwise ($suriUri is yours) deliver your content without redirecting.

Redirect to destination (invalid request)

From reading your raw logs (404 stats don’t cover 302-Found crap) you’ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn’t make sense to show visitors following such a link your 404 page.

Most screwed URIs are unique in a way that they still ‘address’ one particular resource on your server. You should maintain a mapping table for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.

When you feature a “product of the week”, “hottest blog post”, “today’s joke” or so, then bookmarkers will love it when its URI doesn’t change. For such transient URIs do a 307 redirect to the currently featured page. Don’t fear non-existing ‘duplicate content penalties’. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you’ll still get the SERP traffic you deserve.

Guess destination (invalid request)

For many screwed URIs you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.

Once you’ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your screwed URIs mapping table (see above).

When you can’t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.

Serve a useful error page

Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don’t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page’s prominent message should consist of hints how the visitor can reach the estimated contents.

A central error page gets invoked from various scripts. Unfortunately, err.php can’t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can’t send an HTTP response header. Hence prefix the header() statement with a ‘@’ to supress PHP error messages, and catch and log errors.

Before you output your wonderful error page, send a 404 header:
if ($httpResponseCode == NULL) {
$httpResponseCode = “404″;
if (empty($httpResponseCode)) {
$httpResponseCode = “501″; // log for debugging
@header(”HTTP/1.1 $httpResponseCode Shit happens”, TRUE, intval($httpResponseCode));

In rare cases you better send a 410-Gone header, for example when Matt’s team has discovered a shitload of questionable pages and you’ve filed a reconsideration request.

In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.

Use possible input ($errSrc, $errMsg, …) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [* 404 page *] might inspire you (WordPress users click here).

All errors are mine. In other words, be careful when you grab my untested code examples. It’s all dumped from memory without further thoughts and didn’t face a syntax checker.

I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I’d be happy to discuss your thoughts in the comments. Thanks for your time.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

The most sexy browsers screw your analytics

Chrome and Safari fuck with the HTTP_REFERERNow that IE is quite unusable due to the lack of websites that support its non-standard rendering, and the current FireFox version suffers from various maladies, more and more users switch to browsers that are supposed to comply to Web standards, such as Chrome, Safari, or Opera.

Those sexy user agents execute client sided scripts in lightning speed, making surfers addicted to nifty rounded corners very very happy. Of course they come with massive memory leaks, but surfers who shut down their browser every once in a while won’t notice such geeky details.

Why is that bad news for Internet marketers? Because Chrome and Safari screw your analytics. Your stats are useless with regard to bookmarkers and type-in traffic. Your referrer stats lack all hits from Chrome/Safari users who have opened your landing page in a new tab or window.

Google’s Chrome and Apple’s Safari do not provide an HTTP_REFERER. (The typo is standardized, too.)

This bug was reported in September 2008. It’s not yet fixed. Not even in beta versions.

Guess from which (optional) HTTP header line your preferred stats tool compiles the search terms to create all the cool keyword statistics? Yup, that’s the HTTP_REFERER’s query string when the visitor came from a search result page (SERP). Especially on SERPs many users open links in new tabs. That means with every searcher switching to a sexy browser your keyword analysis becomes more useless.

That’s not only an analytics issue. Many sites provide sensible functionality based on the referrer (the Web page a user came from), for example default search terms for site-search facilities gathered from SERP-referrers. Many sites evaluate the HTTP_REFERER to prevent themselves from hotlinking, so their users can’t view the content they’ve paid for when they open a link in a new tab or window.

Passing a blank HTTP_REFERER when this information is available to the user agent is plain evil. Of course lots of so-called Internet security apps do this by default, but just because others do evil that doesn’t mean a top-notch Web browser like Safari or Chrome can get away with crap like this for months and years to come.

Please nudge the developers!

Here you go. Post in this thread why you want them to fix this bug asap. Tell the developers that you can’t live with screwed analytics, and that your site’s users rely on reliable HTTP_REFERERs. Even if you don’t run a website yourself, tell them that your favorite porn site bothers you with countless error messages instead of delivering smut, just because WebKit browsers are buggy.

You can test whether your browser passes the HTTP_REFERER or not: Go to this Google SERP. On the link to this post chose “Open link in new tab” (or window) in the context menu (right click over the link). Scroll down.

Your browser passed this HTTP_REFERER: None

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Less is more. Google Chrome is my preferred browser. Here’s why:

Recently I’ve bitched a lot, especially tearing utterly useles microformats that the InterWeb really doesn’t need (rel-nofollow, common tag …). Naturally, those pamplets get noticed as Google search engine bashing. Wait. Of course not everything a search engine company launches is crap. Actually, I do love –and use daily– lots of awesome services provided by Google, Yahoo & Co.

Google Chrome BrowserWhilst some SE engineers –probably due to my endless rants– have unsubscribed from my various you-porn social media streams, others have noticed that there’s also laudatory stuff in a grumpy old fart’s Twitter output, and asked for input. Thank you! Dear Johannes Müller, bypassing your WebForm, here is my greedy Google Chrome wish list (you do know the goodies yourself, hence I skip the cute stuff I should praise).

I’ll focus on functionality that I like or miss as a plain user, but I can’t resist to mention a few geeky thingies upfront. As a developer I do love that Chrome doesn’t die on faulty scripts (or on .htpasswd protected pages during startup with session restore like the current FF … on evaling perfectly valid JavaScript code from a server’s response to an AJAX request that exceeds 50 or 65k like IE …). Also, the debugging facilities are awesome (although I still can’t throw away Firebug and a few more FireFox plug-ins). I very much appreciate Chrome’s partial HTML-5 support, but besides neat video controls I’d love to see it render plain HTML-5 stuff like CITE attributes in Q elements correctly (4.7.3. User agents should allow users to follow such citation links), even when DOCTYPE says HTML 4.x or XHTML. ;)

WebKit is great, but it comes with disadvantages. Try to put radio buttons in a SPAN or DIV element with CSS controlling horizontal/vertical appearance as well as special label formats –instead of a RADIO-GROUP– and you’re toast. FF can handle that. Or set the MULTIPLE attribute of a SELECT element to FALSE (instead of ommitting it for combo-boxes) and you’ll suffer from select lists that you just can’t handle as a user, because WebKit (as well as other layout engines!) doesn’t render the element as a drop down list any more. Of course that’s non-standard coding, but stuff like that isn’t really uncommon on the Web. Just because other layout engines handle crap like this equally wrong, that doesn’t mean that the WebKit version used by Google Chrome must come with the same maladies, right?

What totally annoys me is that on the WordPress /wp-admin/post.php page the plus icons of “Post Slug” or “Post Status” just don’t work with Chrome. That means I’ve to fire up FF only to type in a value in a form field that Google Chrome sneakily hides from me. Nasty. Really nasty. Don’t tell me that I’m using an outdated WordPress version. I do know that, but I won’t upgrade because WP 0.87 (beta) perfectly fits my needs.

Ok, what do I like as a user? Google Chrome is lean and very easy to use, it eats less memory than any other browser I allow on my machines, and it executes JavaScript as well as nifty rounded corners amazingly fast. Because –at least with the naked version– I can’t install a gazillion of add-ons, I usually see complete landing pages rendered — instead of just the H1, an advertisement, and the very first P element along with an 1/6 clip of an image or video, because all the FF toolbars occupy nearly 3/4 of the browser window’s height. Try FireFox with a few plug-ins vs. Chrome on a machine running 1024*768 (not that unusual when traveling) and you’re convinced in a fraction of a nanosecond.

Now that I’ve completely switched to Chrome, at least at home (at work I have to test my stuff with everything except IE because that’s a not supportable user agent), I preferably sooner than later do want the FireFox nuggets. Dear Google Chrome developers, please find a way to extract the most wanted stuff from FF plug-ins. You can implement those as right-click popup menus, as well as an one-line toolbar (not stealing too much screen real estate), or both, or otherwise. It’s not too hard to detect that a user has a delicious or stumble-upon account (you read the cookies anyway …). You easily could show icons for the core functionality of such services, along with context sensitive menus enabling the whole functionality of a particular service as provided with overcrowded toolbars in other browsers. Examples:

Delicious  An icon “Remember this” to submit a page to delicious is enough, when “my delicious” and so on is available via context menu.
StumbleUpon  The same goes for StumbleUpon. Two icons, thumbs-up and thumbs-down, would provide 99% of the functionality I need quite often. Ok, my thumbs-down votes are rare, so you can even dump the second one.
TinyUrl  How cool would it be to create a tiny URI for the current tab with just one click?
PrefBar checkboxes  Next up, please feel somewhat challenged by PrefBar, an instrument I really can’t miss on the long haul.PrefBar combo-boxes 
Switching user agent strings, faking referrers, checking out Web pages without cookies, JavaScript and so on is a must have. Ok, I admit that’s geek stuff, so take it as an example transferable to some girlish stuff I refuse to recognize in my monster’s Web browsers.
Twitter  Also, let’s not forget Twitter, blogstuff and whatnot.
Imagine your preferred services, iconized in a one-line toolbar configurable compiled from single items of various 3rd party toolbars available on the InterWeb (of course you should enable Google Toolbar icons too). How cool would that be, in comparisation to the bookmarklets I must live with now?

Google Chrome bookmarklets


Context-menu stuff like “image properties” et cetera –as well known from other browsers– would be very helpful too. “Inspect element” is really neat and informative (for geeks), but way too complicated for the average user.

Another issue is Chrome’s lack of “Babylon functionality”. I want to configure my native language as well as a preferred language (read that as “at least one“). Say I’ve set native language to de-DE and preferred language to en-US, then when hovering a word or phrase on any Web object, I want to see a tooltip displaying the english translation from whatever gibberish the Web page is written in (of course for english text I’d expect the german translation); and when I select a piece of text I want to read the german (english) translation on right-click:translate in a popup dialog that allows copying to the clipboard as well as changing languages. I know you’ve the technology at your hands.

Oh, and please disable the defaulted DNS caching, that’s a royal PITA when you mostly consume dynamic contents because lots of previously visited URIs get displayed as error messages. Also, “reload” should pull images again, replacing their cached copies; right-click:reload should reposition to the current viewpoint.

I’d like to have “project windows”, that is on-demand Chrome windows loaded with particular tabs with URIs I’ve previouisly saved from a window under a project name. Those shouldn’t come up when I’ve set “load previous session at start-up”, but only when I want to restore such a window.

After a quite longish test phase I’d say that Google Chrome’s advantages beat the lack of functionality with ease. Pretty often the snipping of a particular commonly supplied feature (like search boxes in toolbars) dramatically enhances Chrome’s usability. Chrome’s KISS approach kicks ass. And I see it evolve.

Now that you’ve read my appraisal and suggestions, please consider picking a few items from my t-shirt wish list. You know, I’ve promised to link out to everybody sending me a (geeky|pornographic|funny|) XX(X)L t-shirt that I really like. ;) Just in case you’re not the type of reader who buys the author of a pamphlet a t-shirt, please subscribe to my RSS feed. Thanks.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Why storing URLs with truncated trailing slashes is an utterly idiocy

Yahoo steals my trailing slashesWith some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like Yahoo thinks circumcising URLs is cool. Proper URL canonicalization might, for example, screw your blog’s reputation at Technorati.

In fact the problem is not your URL canonicalization, e.g. 301 redirects from to respectively to, but crappy software that removes trailing forward slashes from your URLs.

Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that “” as anchor text makes any sense when the URL in HREF points to “”). Spreading invalid URLs is not funny. People as well as Web robots take invalid URLs from your pages for various purposes. Many usages of invalid URLs are capable to damage the search engine rankings of the link destinations. You can’t control that, hence don’t screw our URLs. Never. Period.

Folks who don’t agree with the above said read on.


  • What is a trailing slash? About URLs, directory URIs, default documents, directory indexes, …
  • How to rescue stolen trailing slashes About Apache’s handling of directory requests, and rewriting respectively redirecting invalid directory URIs in .htaccess as well as in PHP scripts.
  • Why stealing trailing slashes is not cool Truncating slashes is not only plain robbery (bandwidth theft), it often causes malfunctions at the destination server and 3rd party services as well.
  • How URL canonicalization irritates Technorati 301 redirects that “add” a trailing slash to directory URLs, respectively virtual URIs that mimic directories, seem to irritate Technorati so much that it can’t compute reputation, recent post lists, and so on.

What is a trailing slash?

The Web’s standards say (links and full quotes): The trailing path segment delimiter “/” represents an empty last path segment. Normalization should not remove delimiters when their associated component is empty. (Read the polite “should” as “must”.)

To understand that, lets look at the most common URL components:
scheme:// server-name.tld /path ?query-string #fragment
The (red) path part begins with a forward slash “/” and must consist of at least one byte (the trailing slash itself in case of the home page URL

If an URL ends with a slash, it points to a directory’s default document, or, if there’s no default document, to a list of objects stored in a directory. The home page link lacks a directory name, because “/” after the TLD (.com|net|org|…) stands for the root directory.

Automated directory indexes (a list of links to all files) should be forbidden, use Options -Indexes in .htaccess to send such requests to your 403-Forbidden page.

In order to set default file names and their search sequence for your directories use DirectoryIndex index.html index.htm index.php /error_handler/missing_directory_index_doc.php. In this example: on request of Apache will first look for /directory/index.html, then if that doesn’t exist for /directory/index.htm, then /directory/index.php, and if all that fails, it will serve an error page (that should log such requests so that the Webmaster can upload the missing default document to /directory/).

The URL (without the trailing slash) is invalid, and there’s no specification telling a reason why a Web server should respond to it with meaningful contents. Actually, the location points to Null  (nil, zilch, nada, zip, nothing), hence the correct response is “404 - we haven’t got ‘nothing to serve’ yet”.

The same goes for sub-directories. If there’s no file named “/dir”, the URL points to Null too. If you’ve a directory named “/dir”, the canonical URL either points to a directory index page (an autogenerated list of all files) or the directory’s default document “index.(html|htm|shtml|php|…)”. A request of –without the trailing slash that tells the Web server that the request is for a directory’s index– resolves to “not found”.

You must not reference a default document by its name! If you’ve links like you can’t change the underlying technology without serious hassles. Say you’ve a static site with a file structure like /index.html, /contact/index.html, /about/index.html and so on. Tomorrow you’ll realize that static stuff sucks, hence you’ll develop a dynamic site with PHP. You’ll end up with new files: /index.php, /contact/index.php, /about/index.php and so on. If you’ve coded your internal links as etc. they’ll still work, without redirects from .html to .php. Just change the DirectoryIndex directive from “… index.html … index.php …” to “… index.php … index.html …”. (Of course you can configure Apache to parse .html files for PHP code, but that’s another story.)

It seems that truncating default document names can make sense for services that deal with URLs, but watch out for sites that serve different contents under various extensions of “index” files (intentionally or not). I’d say that folks submitting their ugly index.html files to directories, search engines, top lists and whatnot deserve all the hassles that come with later changes.

How to rescue stolen trailing slashes

Since Web servers know that users are faulty by design, they jump through a couple of resource burning hoops in order to either add the trailing slash so that relative references inside HTML documents (CSS/JS/feed links, image locations, HREF values …) work correctly, or apply voodoo to accomplish that without (visibly) changing the address bar.

With Apache, DirectorySlash On enables this behavior (check whether your Apache version does 301 or 302 redirects, in case of 302s find another solution). You can also rewrite invalid requests in .htaccess when you need special rules:
RewriteEngine on
RewriteBase /content/
RewriteRule ^dir1$ [R=301,L]
RewriteRule ^dir2$ [R=301,L]

With content management systems (CMS) that generate virtual URLs on the fly, often there’s no other chance than hacking the software to canonicalize invalid requests. To prevent search engines from indexing invalid URLs that are in fact duplicates of canonical URLs, you’ll perform permanent redirects (301).

Here is a WordPress (header.php) example:
$requestUri = $_SERVER["REQUEST_URI"];
$queryString = $_SERVER["QUERY_STRING"];
$doRedirect = FALSE;
$fileExtensions = array(".html", ".htm", ".php");
$serverName = $_SERVER["SERVER_NAME"];
$canonicalServerName = $serverName;
// if you prefer* URLs remove the "www.":
$srvArr = explode(".", $serverName);
$canonicalServerName = $srvArr[count($srvArr) - 2] ."." .$srvArr[count($srvArr) - 1];
$url = parse_url ("http://" .$canonicalServerName .$requestUri);
$requestUriPath = $url["path"];
if (substr($requestUriPath, -1, 1) != "/") {
$isFile = FALSE;
foreach($fileExtensions as $fileExtension) {
if ( strtolower(substr($requestUriPath, strlen($fileExtension) * -1, strlen($fileExtension))) == strtolower($fileExtension) ) {
$isFile = TRUE;
if (!$isFile) {
$requestUriPath .= "/";
$doRedirect = TRUE;
$canonicalUrl = "http://" .$canonicalServerName .$requestUriPath;
if ($queryString) {
$canonicalUrl .= "?" . $queryString;
if ($url["fragment"]) {
$canonicalUrl .= "#" . $url["fragment"];
if ($doRedirect) {
@header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
@header("Location: $canonicalUrl");

Check your permalink settings and edit the values of $fileExtensions and $canonicalServerName accordingly. For other CMSs adapt the code, perhaps you need to change the handling of query strings and fragments. The code above will not run under IIS, because it has no REQUEST_URI variable.

Why stealing trailing slashes is not cool

This section expressed in one sentence: Cool URLs don’t change, hence changing other people’s URLs is not cool.

Folks should understand the “U” in URL as unique. Each URL addresses one and only one particular resource. Technically spoken, if you change one single character of an URL, the altered URL points to a different resource, or nowhere.

Think of URLs as phone numbers. When you call 555-0100 you reach the switchboard, 555-0101 is the fax, and 555-0109 is the phone extension of somebody. When you steal the last digit, dialing 555-010, you get nowhere.

Yahoo'ish fools steal our trailing slashesOnly a fool would assert that a phone number shortened by one digit is way cooler than the complete phone number that actually connects somewhere. Well, the last digit of a phone number and the trailing slash of a directory link aren’t much different. If somebody hands out an URL (with trailing slash), then use it as is, or don’t use it at all. Don’t “prettify” it, because any change destroys its serviceability.

If one requests a directory without the trailing slash, most Web servers will just reply to the user agent (brower, screen reader, bot) with a redirect header telling that one must use a trailing slash, then the user agent has to re-issue the request in the formally correct way. From a Webmaster’s perspective, burning resources that thoughtlessly is plain theft. From a user’s perspective, things will often work without the slash, but they’ll be quicker with it. “Often” doesn’t equal “always”:

  • Some Web servers will serve the 404 page.
  • Some Web servers will serve the wrong content, because /dir is a valid script, virtual URI, or page that has nothing to do with the index of /dir/.
  • Many Web servers will respond with a 302 HTTP response code (Found) instead of a correct 301-redirect, so that most search engines discovering the sneakily circumcised URL will index the contents of the canonical URL under the invalid URL. Now all search engine users will request the incomplete URL too, running into unnecessary redirects.
  • Some Web servers will serve identical contents for /dir and /dir/, that leads to duplicate content issues with search engines that index both URLs from links. Most Web services that rank URLs will assign different scorings to all known URL variants, instead of accumulated rankings to both URLs (which would be the right thing to do, but is technically, well, challenging).
  • Some user agents can’t handle (301) redirects properly. Exotic user agents might serve the user an empty page or the redirect’s “error message”, and Web robots like the crawlers sent out by Technorati or MSN-LiveSearch hang up respectively process garbage.

Does it really make sense to maliciously manipulate URLs just because some clueless developers say “dude, without the slash it looks way cooler”? Nope. Stealing trailing slashes in general as well as storing amputated URLs is a brain dead approach.

KISS (keep it simple, stupid) is a great principle. “Cosmetic corrections” like trimming URLs add unnecessary complexity that leads to erroneous behavior and requires even more code tweaks. GIGO (garbage in, garbage out) is another great principle that applies here. Smart algos don’t change their inputs. As long as the input is processible, they accept it, otherwise they skip it.


URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - “” in print and “domain dot co dot uk” on radio is enough. The necessary redirect is cheaper than a visitor who doesn’t type in the canonical URL including scheme, www-prefix, and trailing slash.

How URL canonicalization seems to irritate Technorati

Due to the not exactly responsively (respectively swamped) Technorati user support parts of this section should be interpreted as educated speculation. Also, I didn’t research enough cases to come to a working theory. So here is just the story “how Technorati fails to deal with my blog”.

When I moved my blog from blogspot to this domain, I’ve enhanced the faulty WordPress URL canonicalization. If any user agent requests it gets redirected to Invalid post/page URLs like redirect to All redirects are permanent, returning the HTTP response code “301″.

I’ve claimed my blog as, but Technorati shows its URL without the trailing slash.
…<div class="url"><a href=""></a> </div> <a class="image-link" href="/blogs/"><img …

By the way, they forgot dozens of fans (folks who “fave’d” either my old blogspot outlet or this site) too.
Blogs claimed at Technorati

I’ve added a description and tons of tags, that both don’t show up on public pages. It seems my tags were deleted, at least they aren’t visible in edit mode any more.
Edit blog settings at Technorati

Shortly after the submission, Technorati stopped to adjust the reputation score from newly discovered inbound links. Furthermore, the list of my recent posts became stale, although I’ve pinged Technorati with every update, and technorati received my update notifications via ping services too. And yes, I’ve tried manual pings to no avail.

I’ve gained lots of fresh inbound links, but the authority score didn’t change. So I’ve asked Technorati’s support for help. A few weeks later, in December/2007, I’ve got an answer:

I’ve taken a look at the issue regarding picking up your pings for “”. After making a small adjustment, I’ve sent our spiders to revisit your page and your blog should be indexed successfully from now on.

Please let us know if you experience any problems in the future. Do not hesitate to contact us if you have any other questions.

Indeed, Technorati updated the reputation score from “56″ to “191″, and refreshed the list of posts including the most recent one.

Of course the “small adjustment” didn’t persist (I assume that a batch process stole the trailing slash that the friendly support person has added). I’ve sent a follow-up email asking whether that’s a slash issue or not, but didn’t receive a reply yet. I’m quite sure that Technorati doesn’t follow 301-redirects, so that’s a plausible cause for this bug at least.

Since December 2007 Technorati didn’t update my authority score (just the rank goes up and down depending on the number of inbound links Technorati shows on the reactions page - by the way these numbers are often unreal and change in the range of hundreds from day to day).
Blog reactions and authority scoring at Technorati

It seems Technorati didn’t index my posts since then (December/18/2007), so probably my outgoing links don’t count for their destinations.
Stale list of recent posts at Technorati

(All screenshots were taken on February/05/2008. When you click the Technorati links today, it could hopefully will look differently.)

I’m not amused. I’m curious what would happen when I add
if (!preg_match("/Technorati/i", "$userAgent")) {/* redirect code */}

to my canonicalization routine, but I can resist to handle particular Web robots. My URL canonicalization should be identical both for visitors and crawlers. Technorati should be able to fix this bug without code changes at my end or weeky support requests. Wishful thinking? Maybe.

Update 2008-03-06: Technorati crawls my blog again. The 301 redirects weren’t the issue. I’ll explain that in a follow-up post soon.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Comment rating and filtering with SezWho

I’ve added SezWho to the comment area. SezWho enables rating and filtering of comments, and shows you even comments an author has left on other blogs. Neat.

Currently there are no ratings, so the existing comments are all rated 2.5 (quite good). Once you’ve rated a few comments, you can suppress all lower quality comments (rated below 3), or show high quality comments only (rated 4 or better).

Don’t freak out when you use CSS that highlights nofollow’ed links. SezWho manipulates the original (mostly dofollow’ed) author link with JavaScript, hence search engines still recognize that a link shall pass PageRank and anchor text. (I condomize some link drops, for example when I don’t know a site and can’t afford the time to check it out, see my comments policy.)

I’ll ask SezWho to change that when I’m more familiar with their system (I hate change requests based on a first peek). SezWho should look at the attributes of the original link in order to add rel=”nofollow” to the JS created version only when the blogger actually has condomized a particular link. Their software changes the comment author URL to a JS script that redirects visitors to the URL the commenter has submitted. It would be nice to show the original URL in the status bar on mouse over.

Also, it seems that when you sign up with SezWho, they remove the trailing slash from your blog’s URL. That’s not acceptable. I mean not every startup should do what clueless Yahoo developers still do although they know that it violates several Web standards. Removing trailing slashes from links is not cool, that’s a crappy manipulation that can harm search engine rankings, will lead to bandwidth theft when bots follow castrated links only to get redirected, … ok, ok, ok … that’s stuff for another post rant. Judging from their Web site, SezWho looks like a decent operation, so I’m sure they can change that too.


SezWho sidebar widget

I’ve not yet added the widgets, above is how they would appear in the sidebar.


I consider SezWho useful. All functionality lives in the blog and can access the blog’s database, so in theory it doesn’t slow down the page load time by pulling loads of data from 3rd party sources. Please let me know whether you like it or not. Thanks!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

One out of many sure-fire ways to avoid blog comments

ranting on idiotic comment form designsIf your name is John Doe and you don’t blog this rant is not for you, because you don’t suffer from truncated form field values. Otherwise check here whether you annoy comment authors on your blog or not. “Annoy” is the polite version by the way, I’m pissed on 99% of the blogs I read. It took me years to write about this issue eventually. Today I had enough.

Look at this form designed especially for John Doe ( at, then duplicated onto all blogs out there, and imagine you’re me going to comment on a great post:

I can’t view what I’ve typed in, and even my browser’s suggested values are truncated because the input field is way too narrow. Sometimes I leave post-URLs with a comment, so when I type in the first characters of my URL, I get a long list of shortened entries from which I can’t select anything. When I’m in a bad mood I swear and surf on without commenting.

I’ve looked at a fair amount of WordPress templates recently, and I admit that crappy comment forms are a minor issue with regard to the amount of duplicated hogwash most theme designers steal from each other. However, I’m sick of crappy form usability, so I’ve changed my comment form today:

Now the input fields should display the complete input values in most cases. My content column is 500 pixels wide, so size="42" leaves enough space when a visitor surfs with bigger fonts enlarging the labels. If with very long email addresses or URLs that’s not enough, I’ve added title attributes and onchange triggers which display the new value as tooltip when the visitors navigates to the next input field. Also I’ve maxed out the width of the text area. I hope this 60 seconds hack improves the usability of my comment form.

When do you fire up your editor and FTP client to make your comment form convenient? Even tiny enhancements can make your visitors happier.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Free WordPress Add-on: Categorized Sitemaps

In How to feed all posts on a WordPress blog with link love I’ve outlined a method to create short and topically related paths to each and every post even on a large blog. Since not every blogger is PHP savvy enough to implement the concept, some readers asked me to share the sitemaps script.

Ok, here it is. It wasn’t developed as a plugin, and I’m not sure that’s possible (actually, I didn’t think about it), but I’ll do my best to explain the template hacks necessary to get it running smoothly. Needless to say it’s a quick hack and not exactly elegant, however it works here with WordPress 2.2.2. Use it as is at your own risk, yada yada yada, the usual stuff.

I’m a link whore, so please note: If you implement my sitemap script, please link out to any page on my blog. The script inserts a tiny link at the bottom of the sitemap. If you link to my blog under credits, powered by, in the blogroll or whereever, you can remove it. If you don’t link, the search engines shall ban you. ;)


You should be able to do guided template hacks.

You need a WordPress plugin that enables execution of PHP code within the content of posts and pages. Install one from the list below and test it with a private post or so. Don’t use the visual editor and deactivate the “WordPress should correct invalidly nested XHTML automatically” thingy in Options::Writing. In the post editor write something like
Q: Does my PHP plugin work?
print "A: Yep, It works.";
and check “enable PHP on this page” (labels differ from plug-in to plug-in), save and click preview. If you see the answer it works. Otherwise try another plug-in:

(Maybe you need to quote your PHP code with special tags like <phpcode></phpcode>, RTFM.)

Consider implementing my WordPress-SEO tweaks to avoid unnecessary code changes. If your permalink structure is not set to custom /%postname%/ giving post/page URLs like you need to tweak my code a little. Not that there’s such a thing as a valid reason to use another permalink structure …


Don’t copy and paste PHP code from this page, it might not work because WordPress prettifies quotes etcetera. Everything you need is on the download page.


Copy list_categories.php to your template directory /wp-content/themes/yourtemplatename/ on your local disk and upload it to your server.

Create a new page draft, title it “Category Index” or so, and in page content put
<?php @include(TEMPLATEPATH . "/list_categories.php"); ?>
then save and preview it. You should see a category links list like this one. Click the links, check whether the RSS icons show or not, etcetera.

If anything went wrong, load list_categories.php with your preferred editor (not word processor!). Scroll down to edit these variables:
// Customize if necessary:
//$blogLocaction = “”;
// “”, “” …
// without “http://” and no trailing slash!
//$rssIconPath = “/img/feed-icon-16×16.gif”;
// get a 16*16px rss icon somewhere and upload it you your server,
// then change this path which is relative to the domain’s root.
$rssIconWidth = 16;
$rssIconHeight = 16;
If you edit a variable, remove its “//“. If you use the RSS icon delivered with WordPress, change width and height to 14 pixels. Save the file, upload it to your server, and test again.

If you use Feedburner then click the links to the category feeds, Feedburner shouldn’t redirect them to your blog’s entries feed. I’ve used feed URLs which the Feedburner plug-in doesn’t redirect, but if the shit hits the fan search for the variable $catFeedUrl and experiment with the category-feed URLs.

Your sitemap’s URL is (respectively or so when the sitemap has a parent page).

In theory you’re done. You could put a link to the sitemap in your sidebar and move on. In reality you want to prettify it, and you want to max out the SEO effects. Here comes the step by step guide to optimized WordPress sitemaps / topical hubs.

Category descriptions

On your categorized sitemap click any “[category-name] overview” link. You land on a page listing all posts of [category-name] under the generic title “Category Index”, “Sitemap”, or whatever you’ve put in the page’s title. Donate at least a description. Your visitors will love that and when you install a meta tag plugin the search engines will send a little more targeted traffic because your SERP listings look better (sane meta tags don’t boost your rankings but should improve your SERP CTR).

On your dashboard click Manage::Categories and write a nice but keyword rich description for each category. When you reference other categories by name my script will interlink the categories automatically, so don’t put internal links. Now the category links lists (overview pages) look better and carry (lots of) keywords.

The sitemap URL above will not show the descriptions (respectively only as tooltip), but the topical mini-hubs linked as “overview” (category links lists) have it. Your sitemap’s URL with descriptions is ( or so when the sitemap has a parent page).

If you want to put a different introduction or footer depending on the appearance of descriptions you can replace the code in your page by:
// introduction:
if (strtoupper($_GET["definitions"]) == "TRUE") {
print "<p><strong>All categories with descriptions.</strong> (Example)</p>”;
else {
if (!isset($_GET[”cat”])) {
print “<p><strong>All categories without descriptions.</strong> (Example)</p>”;
@include(TEMPLATEPATH . “/list_categories.php”);
// footer as above
(If you use quotes in the print statements then prefix them with a slash, for example: print "<em>yada \"yada\" <a href=\"url\" title=\"string\">yada</a></em>."; will output yada “yada” yada.)

Title tags

The title of the page listing all categories with links to the category pages and feeds is by design used for the category links pages too. WordPress ignores input parameters in URLs like

To give each category links list its own title tag, replace the PHP code in the title tag. Edit header.php:
// 1. Everything:
$pageTitle = wp_title(“”,false);
if (empty($pageTitle)) {
$pageTitle = get_bloginfo(”name”);
$pageTitle = trim($pageTitle);
// 2. Dynamic category pages:
$input_catName = trim($_GET[”cat”]);
if ($input_catName) {
$input_catName = ucfirst($input_catName);
$pageTitle = $input_catName .” at ” .get_bloginfo(”name”);
// 3. If you need a title depending on the appearance of descriptions
$input_catDefs = trim($_GET[”definitions”]);
if ($input_catDefs) {
$pageTitle = “All tags explained by ” .get_bloginfo(”name”);
print $pageTitle;

The first statements just fix the obscene prefix crap most template designers are obsessed about. The second block generates page titles with the category name in it for the topical hubs (if your category slugs and names are identical). You need 1. and 2.; 3. is optional.

Page headings

Now that you’ve neat title tags, what do you think about accurate headings on the category hub pages? To accomplish that you need to edit page.php. Search for a heading (h3 or so) displaying the_title(); and replace this function by:
<h3 class=”entrytitle” id=”post-<?php the_ID(); ?>”> <a href=”<?php the_permalink() ?>” rel=”bookmark”>
// 1. Dynamic category pages
$input_catName = trim($_GET[”cat”]);
if ($input_catName) {
$input_catName = ucfirst($input_catName);
$dynTitle = “All Posts Tagged ‘” .$input_catName .”‘”;
// 2. If you need a heading depending on the appearance of descriptions
$input_catDefs = trim($_GET[”definitions”]);
if ($input_catDefs) {
$dynTitle = “All tags explained”;
// 3. Output the heading
if ($dynTitle) print $dynTitle; else the_title();

(The surrounding XHTML code may look different in your template! Replace the PHP code leaving the HTML code as is.)

The first block generates headings with the category name in it for the topical hubs (if your category slugs and names are identical). The last statement outputs either the hub’s heading or the standard title if the actual page doesn’t belong to the script. You need 1. and 3.; 2. is optional.

Feeding the category hubs

With most templates each post links to the categories its tagged with. Besides the links to the category archive pages you want to feed your hubs linking to all posts of each category with a little traffic and topical link juice. One method to accomplish that is linking to the category hubs below the comments. If you don’t read this post on the main page or an archive page, click here for an example. Edit single.php, a line below the comments_template(); call insert something like that:
<br />
<p class="post-info" id="related-links-lists">
<em class="cat">Find related posts in
$catString = "";
foreach((get_the_category()) as $catItem) {
if (!empty($catString)) $catString .= ", ";
$catName = $catItem->cat_name;
$catSlug = $catItem->category_nicename;
$catUrl = ""
$catString .= "<a href=\"$catUrl\">$catName</a>";
} // foreach
print $catString;
(Study your template’s “post-info” paragraph and ensure that you use the same class names!)

Also, if your descriptions are of glossary quality, then link to your category hubs in your posts. Since most of my posts are dull as dirt, I decided to make the category descriptions an even duller canonical SEO glossary. It’s up to you to become creative and throw together something better, funnier, more useful … you get the idea. If you blog in english and you honestly believe your WordPress sitemap is outstanding, why not post it in the comments? Links are dofollowed in most cases. ;)


Test everything before you publish the page and link to the sitemaps.

If you have category descriptions and on the sitemap pages links to other categories within the description are broken: Make sure that the sitemap page’s URL does not contain the name or slug of any of your categories. Say the page slug is “sitemaps” and “links” is the parent page of “sitemaps” (URL: /links/sitemaps/), then you must not have a category named “links” nor “sitemaps”. Since a “sitemap” category is somewhat unusual, I’d say serving the sitemaps on a first level page named “sitemap” is safe.


I hope this post isn’t clear as mud and everybody can install my stuff without hassles. However, every change of code comes with pitfalls, and I can’t address each and every possibility, so please backup your code before you change it, or play with my script in a development system. I can’t provide support but I’ll try to reply to comments. Have fun at your own risk! ;)

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

  1 | 2  Next Page »