Site-Search

Archived posts from the 'Site-Search' Category

How to cleverly integrate your own URI shortener

Posted on 30 December, 2009

This pamphlet is somewhat geeky. Don’t necessarily understand it as a part of my ongoing ~~jihad~~ holy war on URI shorteners.

Clever implementation of an URL shortener Assuming you’re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don’t want to make use of unholy crap, so you need to roll your own. Here’s how you can (could) integrate a URI shortener into your site’s architecture.

Please note that my design suggestions ain’t black nor white. Your site’s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they’re applicable.

At the first sight, searching for a free URI shortener script to implement it on a dedicated domain looks like a pretty simple solution. It’s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that’s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.

So if you’re not running cheap-student-loans-with-debt-consolidation-on-each-payday-is-a-must-have-for-sexual-heroes-desperate-for-a-viagra-overdose-and-extreme-penis-length-enhancement.info and your domain’s name without the “www” prefix plus a few characters gives URIs of 20 (30) characters or less, you don’t need a short domain name to host your shortened URIs.

As a side note, when you’re shortening your URIs for Twitter you should know that shortened URIs aren’t mandatory any more. If your message doesn’t exceed 139 characters, you don’t need to shorten embedded URIs.

By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.

In the following I’ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you’ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider alternatives). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.

To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I’ll bore you with a ton of dull and totally unrelated stuff:

Introduction

There’s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we’re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:

RewriteEngine On
RewriteCond %{SERVER_PORT} ^80$
RewriteRule . /requestHandler.php [L]

Please note that the code above kinda disables the Web server’s error handling. If /requestHandler.php exists in the root directory, all ErrorDocument directives (except some 5xx) et cetera will be ignored. You need to take care of errors yourself.

/requestHandler.php (Warning: untested and simplified code snippets below)
/* Initialization */ $serverName = strtolower($_SERVER["SERVER_NAME"]); $canonicalServerName = "sebastians-pamphlets.com"; $scheme = "http://"; $rootUri = $scheme .$canonicalServerName; /* if used w/o path add aslash*/ $rootPath = $_SERVER["DOCUMENT_ROOT"]; $includePath = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */ $requestIp = $_SERVER["REMOTE_ADDR"]; $reverseIp = NULL; $requestReferrer = $_SERVER["HTTP_REFERER"]; $requestUserAgent = $_SERVER["HTTP_USER_AGENT"]; $isRogueBot = FALSE; $isCrawler = NULL; $requestUri = $_SERVER["REQUEST_URI"]; $absoluteUri = $scheme .$canonicalServerName .$requestUri; $uriParts = parse_url($absoluteUri); $requestScript = $PHP_SELF; $httpResponseCode = NULL;

Block rogue bots

You don’t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, …) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: IncrediBill.

While handling bots, you should detect search engine crawlers, too:

/* lookup your crawler IP databaseto populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */ if ($isCrawler !== TRUE) { $crawlerName = NULL; $crawlerHost = NULL; $crawlerServer = NULL; if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = ".crawl.baidu.com";} ... if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ".googlebot.com"; } if ($crawlerName != NULL) { $reverseIp = @gethostbyaddr($requestIp); if (!stristr($reverseIp,$crawlerServer)) { $isCrawler = FALSE; } if ("$reverseIp" == "$requestIp") { $isCrawler = FALSE; } if ($isCrawler !== FALSE;) { $chkIpAddyRev = @gethostbyname($reverseIp); if ("$chkIpAddyRev" == "$requestIp") { $isCrawler = TRUE; $crawlerHost = $reverseIp; // store the newly discovered crawler IP } } } }

If Baidu doesn’t send you any traffic, it makes sense to block its crawler. This piece of crap doesn’t behave anyway.if ($isCrawler &&"$crawlerName" == "Baiduspider") { $isRogueBot = TRUE; }

Another SE candidate is Bing’s spam bot that tries to manipulate stats on search engine usage. If you don’t approve such scams, block incoming! from the IP address range 65.52.0.0 to 65.55.255.255 (131.107.0.0 to 131.107.255.255 …) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren’t aware of their company’s spammy activities, so make sure you serve them a friendly GFY page that explains the issue.

Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you’re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}
If you operate a porn site you should refine this approach.

As for blocking requests by IP addy I’d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a @gethostbyaddr($requestIp) DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}

You won’t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [block user agent], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or codeif (isBlacklistedUa($requestUserAgent) ||stristr($requestUserAgent,”ThingFetcher”)) {$isRogueBot = TRUE;}
By the way, the owner of ThingFetcher really should stand up now. I’ve sent a complaint to Rackspace and I’ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn’t bother to check for robots.txt.

Finally, serve rogue bots what they deserve:if ($isRogueBot === TRUE) {header("HTTP/1.1 403 Go fuck yourself", TRUE, 403); exit; }

If you’re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a file_get_contents($requestReferrer); and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to http://example.com/go-fuck-yourself, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I’d go for the 403-GFY response.

Server name canonicalization

Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under one server name:if (”$serverName” != “$canonicalServerName”) { header(”HTTP/1.1 301 Please use the canonical URI”, TRUE, 301); header(”Location: $absoluteUri”); header(”X-Canonical-URI: $absoluteUri”); //experimentalheader("Link: <$absoluteUri>; rel=canonical"); // experimental exit; }

Subdomains are so 1999, also 2010 is the year of non-’.www’ URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite … the code from this pamphlet as you like. However, you must not change the $canonicalServerName = "sebastians-pamphlets.com"; statement. I’ll appreciate the traffic.

When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes -if necessary-, and remove clutter from query strings.

Sometimes even smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers ):$testForUriClutter = $absoluteUri; if (isset($_GET)) { foreach ($_GET as $var => $crap) { if ( stristr($var,”utm_”) ) { $testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”); $testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);unset ($_GET[$var]); } } $uriPartsSanitized = parse_url($testForUriClutter); $qs = $uriPartsSanitized["query"]; $qs = str_replace($qs, "?", ""); if ("$qs" != $uriParts["query"]) { $canonicalUri = $scheme .$canonicalServerName .$requestScript; if (!empty($qs)) { $canonicalUri .= "?" .$qs; } if (!empty($uriParts["fragment"])) { $canonicalUri .= "#" .$uriParts["fragment"]; } header("HTTP/1.1 301 URI messed up by Google", TRUE, 301); header("Location: $canonicalUri"); exit; } }

By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.

Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (see below), but some are self-made.

Therefore I’d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don’t need a log entry.

Before you’re going to serve your content, do a lookup in your shortUri table. If the requested URI is a shortened URI pointing to your own stuff, don’t perform a redirect but serve the content under the shortened URI.

Deliver static stuff (images …)

Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we’ve bypassed this functionality, do it yourself:if (empty($uriParts[”query”])) && empty($uriParts[”fragment”])) && file_exists(”$rootPath$requestUri”)) { header(”Content-type: ” .getContentType(”$rootPath$requestUri”), TRUE); readfile(”$rootPath$requestUri”); exit; } /* getContentType($filename) returns aMIME media typelike 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */

If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don’t handle them here.

Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your smart robots.txt to maintain your crawler-IP database and more.

Execute script (dynamic URI)

Say you’ve a WP blog in /blog/, then you can invoke WordPress with if (substring($requestUri, 0, 6) == “/blog/”) { require(”$rootPath/blog/index.php”); exit; }

(Perhaps the WP configuration needs a tweak to make this work.) There’s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.

Fortunately, when WordPress calls the 404 page (wp-content/themes/yourtheme/404.php), it hasn’t sent any output or headers yet. That means you can include the procedures discussed below in WP’s 404.php:$httpResponseCode = “404″; $errSrc = “WordPress”; $errMsg = “The blog couldn’t make sense out of this request.”; require(”$includePath/err.php”); exit;

Like in my WordPress example, you’ll find a way to call your scripts so that they don’t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.

Resolve shortened URI

If you’re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.

Excursus: URI shortener components

Using the hints below you should be able to code your own URI shortener. You don’t need all the balls and whistles (like stats) overloading most scripts available on the Web.

A database table with at least these attributes:
- shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)
- shortUri.suriUri, text, indexed, stores the original URI
- shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)
Storing page titles and content (snippets) makes sense, but isn’t mandatory. For outputs like “recently shortened URIs” you need a timestamp attribute.
A method to create a shortened URI.
Make that an independent script callable from a Web form’s server procedure, via Ajax, SOAP, etc.

Without a given shortcut, use the primary key to create one. base_convert(intval($suriId), 10, 36); converts an integer into a short string. If you can’t do that in a database insert/create trigger procedure, retrieve the primary key’s value with LAST_INSERT_ID() or so and perform an update.

URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your create short URI method should return a previously created shortcut then.

If you’re storing titles and such stuff grabbed from the destination page, don’t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.

With the shortcut returned build the short URI on-the-fly $shortUri = getBaseUri() ."/" .$suriShortcut; (so you can use your URI shortener across all your sites).
A method to retrieve the original URI.
Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.

Bear in mind that shortened URIs spread via social media do get abused. A shortcut like ‘xxyyzz’ can appear as ‘xxyyz..’, ‘xxy’, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an error page with suggestions.
A Web form to create and edit shortened URIs.
Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
If you want to use your URI shortener with a Twitter client, then build an API.
If you need particular stats for your short URIs pointing to foreign sites that your analytics package can’t deliver, then store those click data separately.
// end excursus

If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.$suriUri = resolveShortUri($requestUri); if ($suriUri === FALSE) { $httpResponseCode = “404″; $errSrc = “sUri”; $errMsg = “Invalid short URI. Shortcut resolves to more than one result.”; require(”$includePath/err.php”); exit; } if (!empty($suriUri)) if (!stristr($suriUri, $canonicalServerName)) { header(”HTTP/1.1 301 Here you go”, TRUE, 301); header(”Location: $suriUri”); exit; } }

Otherwise ($suriUri is yours) deliver your content without redirecting.

Redirect to destination (invalid request)

From reading your raw logs (404 stats don’t cover 302-Found crap) you’ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn’t make sense to show visitors following such a link your 404 page.

Most screwed URIs are unique in a way that they still ‘address’ one particular resource on your server. You should maintain a mapping table for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.

When you feature a “product of the week”, “hottest blog post”, “today’s joke” or so, then bookmarkers will love it when its URI doesn’t change. For such transient URIs do a 307 redirect to the currently featured page. Don’t fear non-existing ‘duplicate content penalties’. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you’ll still get the SERP traffic you deserve.

Guess destination (invalid request)

For many screwed URIs you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.

Once you’ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your screwed URIs mapping table (see above).

When you can’t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.

Serve a useful error page

Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don’t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page’s prominent message should consist of hints how the visitor can reach the estimated contents.

A central error page gets invoked from various scripts. Unfortunately, err.php can’t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can’t send an HTTP response header. Hence prefix the header() statement with a ‘@’ to supress PHP error messages, and catch and log errors.

Before you output your wonderful error page, send a 404 header:if ($httpResponseCode == NULL) { $httpResponseCode = “404″; } if (empty($httpResponseCode)) { $httpResponseCode = “501″; // log for debugging } @header(”HTTP/1.1 $httpResponseCode Shit happens”, TRUE, intval($httpResponseCode)); logHeaderErr(error_get_last());

In rare cases you better send a 410-Gone header, for example when Matt’s team has discovered a shitload of questionable pages and you’ve filed a reconsideration request.

In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.

Use possible input ($errSrc, $errMsg, …) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [* 404 page *] might inspire you (WordPress users click here).

All errors are mine. In other words, be careful when you grab my untested code examples. It’s all dumped from memory without further thoughts and didn’t face a syntax checker.

I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I’d be happy to discuss your thoughts in the comments. Thanks for your time.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Redirects, URI shortening, Usability, 404grabber, Web development, Social Web, Site-Search

Another way to implement a site search facility

Posted on 12 June, 2007

Providing kick ass navigation and product search is the key to success for e-commerce sites. Conversion rates highly depend on user friendly UIs which enable the shopper to find the desired product with a context sensitive search in combination with few drill-down clicks on navigational links. Unfortunately, the build-in search as well as navigation and site structure of most shopping carts simply sucks. Every online store is different, hence findability must be customizable and very flexible.

I’ve seen online shops crawling their product pages with a 3rd party search engine script because the shopping cart’s search functionality was totally and utterly useless. Others put fantastic efforts in self made search facilities which perfectly implement real life relations beyond the limitations of the e-commerce software’s data model, but need code tweaks for each and every featured product, specials, virtual shops assembling a particular niche from several product lines or whatever. Bugger.

Today I stumbled upon a very interesting approach which could become the holy grail for store owners suffering from crappy software. Progress invited me to discuss a product they’ve bought recently -EasyAsk- from a search geek’s perspective. Long story short, I was impressed. Without digging deep into the technology or reviewing implementations for weaknesses I think the idea behind that tool is promising.

Unfortunately, the EasyAsk Web site doesn’t provide solid technical and architectural information (I admit that I may have missed the tidbits within the promotional chatter), hence I try to explain it from what I’ve gathered today. Progress EasyAsk is a natural language interface connecting users to data sources. Users are shoppers, and staff. Data sources are (relational) databases, or data access layers (that is a logical tier providing a standardized interface to different data pools like all sorts of databases, (Web) services, an enterprise service bus, flat files, XML documents and whatever).

The shopper can submit natural language queries like “yellow XS tops under 30 bucks”. The SRP is a page listing tops and similar garments under 30.00$, size XS, illustrated with thumbnails of pics of yellow tops and bustiers, linked to the product pages. If yellow tops in XS are sold out, EasyAsk recommends beige tops instead of delivering a sorry-page. Now when a search query is submitted from a page listing suits, a search for “black leather belts” lists black leather belts for men. If the result set is too large and exceeds the limitations of one page, EasyAsk delivers drill-down lists of tags, categories and synonyms until the result set is viewable on one page. The context (category/tag tree) changes with each click and can be visualized for example as bread crumb nav link.

Technically spoken, EasyAsk does not deal with the content presentation layer itself. It returns XML which can be used to create a completely new page with a POST/GET request, or it gets invoked as AJAX request whose response just alters DOM objects to visualize the search results (way faster but not exactly search engine friendly - that’s not a big deal because SERPs shouldn’t be crawlable at all). Performance is not an issue from what I’ve seen. EasyAsk caches everything so that the server doesn’t need to bother the hard disk. All points of failure (WRT performance issues) belong to the implementation, thus developing a well thought out software architecture is a must-have.

Well, that’s neat, but where’s the USP? EasyAsk comes with a natural language (search) driven admin interface too. That means that product managers can define and retrieve everything (attributes, synonyms, relations, specials, price ranges, groupings …) using natural language. “Gimme gross sales of leather belts for men II/2007 compared to 2006″ delivers a statistic and “top is a synonym for bustier and the other way round” creates a relation. The admin interface runs in the Web browser, definitions can be submitted via forms and all admin functions come with previews. Really neat. That reduces the workload of the IT dept. WRT ad-hoc queries as well as for lots of structural change requests, and saves maintenance costs (Web design / Web development).

I’ve spotted a few weak points, though. For example in the current version the user has to type in SKUs because there’s no selection box. Or meta data are stored in flat files, but that’s going to change too. There’s no real word stemming, EasyAsk handles singular/plural correctly and interprets “bigger” as “big” or “xx-large” politically correct as “plus”, but typos must be collected from the “searches without results” report and defined as synonym. The visualization of concurrent or sequentially applied business rules is just rudimentary on preview pages in the admin interface, so currently it’s hard to track down why particular products get downranked respectively highlighted when more than one rule applies. Progress told me that they’ll make use of 3rd party tools as well as in house solutions to solve these issues in the near future - the integration of EasyAsk into the Progress landscape has just begun.

The definitions of business language / expected terms used by consumers as well as business rules are painless. EasyAsk has build-in mappings like color codes to common color names and vice versa, understands terms like “best selling” and “overstock”, and these definitions are easy to extend to match actual data structures and niche specific everyday language.

Setting up the product needs consultancy (as a consultant I love that!). To get EasyAsk running it must understand the structure of the customer’s data sources, respectively the methods provided to fetch data from various structured as well as unstructured sources. Once that’s configured, EasyAsk pulls (database) updates on schedule (daily, hourly, minutely or whatever). It caches all information needed to fulfill search requests, but goes back to the data source to fetch real time data when the search query requires knowledge of not (yet) cached details. In the beginning such events must be dealt with, but after a (short) while EasyAsk should run smoothly without requiring much technical interventions (as a consultant I hate that, but the client’s IT department will love it).

Full disclosure: Progress didn’t pay me for that post. For attending the workshop I got two books (”Enterprise Service Bus” by David A. Chappel and “Getting Started with the SID” by John P. Reilly) and a free meal, travel expenses were not refunded. I did not test the software discussed myself (yet), so perhaps my statements (conclusions) are not accurate.