Monthly archive: December, 2009

How to cleverly integrate your own URI shortener

This pamphlet is somewhat geeky. Don’t necessarily understand it as a part of my ongoing jihad holy war on URI shorteners.

Clever implementation of an URL shortenerAssuming you’re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don’t want to make use of unholy crap, so you need to roll your own. Here’s how you can (could) integrate a URI shortener into your site’s architecture.

Please note that my design suggestions ain’t black nor white. Your site’s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they’re applicable.

At the first sight, searching for a free URI shortener script to implement it on a dedicated domain looks like a pretty simple solution. It’s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that’s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.

So if you’re not running cheap-student-loans-with-debt-consolidation-on-each-payday-is-a-must-have-for-sexual-heroes-desperate-for-a-viagra-overdose-and-extreme-penis-length-enhancement.info and your domain’s name without the “www” prefix plus a few characters gives URIs of 20 (30) characters or less, you don’t need a short domain name to host your shortened URIs.

As a side note, when you’re shortening your URIs for Twitter you should know that shortened URIs aren’t mandatory any more. If your message doesn’t exceed 139 characters, you don’t need to shorten embedded URIs.

By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.

In the following I’ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you’ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider alternatives). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.

Table of contents

To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I’ll bore you with a ton of dull and totally unrelated stuff:

Introduction

There’s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we’re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:

RewriteEngine On
RewriteCond %{SERVER_PORT} ^80$
RewriteRule . /requestHandler.php [L]

Please note that the code above kinda disables the Web server’s error handling. If
/requestHandler.php
exists in the root directory, all ErrorDocument directives (except some 5xx) et cetera will be ignored. You need to take care of errors yourself.

/requestHandler.php (Warning: untested and simplified code snippets below)
/* Initialization */
$serverName = strtolower($_SERVER["SERVER_NAME"]);
$canonicalServerName = "sebastians-pamphlets.com";
$scheme = "http://";
$rootUri = $scheme .$canonicalServerName; /* if used w/o path add a
slash */
$rootPath = $_SERVER["DOCUMENT_ROOT"];
$includePath = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */
$requestIp = $_SERVER["REMOTE_ADDR"];
$reverseIp = NULL;
$requestReferrer = $_SERVER["HTTP_REFERER"];
$requestUserAgent = $_SERVER["HTTP_USER_AGENT"];
$isRogueBot = FALSE;
$isCrawler = NULL;
$requestUri = $_SERVER["REQUEST_URI"];
$absoluteUri = $scheme .$canonicalServerName .$requestUri;
$uriParts = parse_url($absoluteUri);
$requestScript = $PHP_SELF;
$httpResponseCode = NULL;

Block rogue bots

You don’t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, …) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: IncrediBill.

While handling bots, you should detect search engine crawlers, too:

/* lookup your crawler IP database to populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */
if ($isCrawler !== TRUE) {
$crawlerName = NULL;
$crawlerHost = NULL;
$crawlerServer = NULL;
if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = ".crawl.baidu.com";}
...
if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ".googlebot.com"; }
if ($crawlerName != NULL) {
$reverseIp = @gethostbyaddr($requestIp);
if (!stristr($reverseIp,$crawlerServer)) {
$isCrawler = FALSE;
}
if ("$reverseIp" == "$requestIp") {
$isCrawler = FALSE;
}
if ($isCrawler !== FALSE;) {
$chkIpAddyRev = @gethostbyname($reverseIp);
if ("$chkIpAddyRev" == "$requestIp") {
$isCrawler = TRUE;
$crawlerHost = $reverseIp;
// store the newly discovered crawler IP
}
}
}
}

If Baidu doesn’t send you any traffic, it makes sense to block its crawler. This piece of crap doesn’t behave anyway.
if ($isCrawler &&
"$crawlerName" == "Baiduspider") {
$isRogueBot = TRUE;
}

Another SE candidate is Bing’s spam bot that tries to manipulate stats on search engine usage. If you don’t approve such scams, block incoming! from the IP address range 65.52.0.0 to 65.55.255.255 (131.107.0.0 to 131.107.255.255 …) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren’t aware of their company’s spammy activities, so make sure you serve them a friendly GFY page that explains the issue.

Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you’re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:
if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}

If you operate a porn site you should refine this approach.

As for blocking requests by IP addy I’d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a @gethostbyaddr($requestIp) DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.
if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}

You won’t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [block user agent], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or code
if (isBlacklistedUa($requestUserAgent) ||

stristr($requestUserAgent,”ThingFetcher”)) {$isRogueBot = TRUE;}

By the way, the owner of ThingFetcher really should stand up now. I’ve sent a complaint to Rackspace and I’ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn’t bother to check for robots.txt.

Finally, serve rogue bots what they deserve:
if ($isRogueBot === TRUE) {

header("HTTP/1.1 403 Go fuck yourself", TRUE, 403);
exit;
}

If you’re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a file_get_contents($requestReferrer); and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to http://example.com/go-fuck-yourself, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I’d go for the 403-GFY response.

Server name canonicalization

Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under one server name:
if (”$serverName” != “$canonicalServerName”) {
header(”HTTP/1.1 301 Please use the canonical URI”, TRUE, 301);
header(”Location: $absoluteUri”);
header(”X-Canonical-URI: $absoluteUri”); //
experimental
header("Link: <$absoluteUri>; rel=canonical"); // experimental
exit;
}

Subdomains are so 1999, also 2010 is the year of non-’.www’ URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite … the code from this pamphlet as you like. However, you must not change the $canonicalServerName = "sebastians-pamphlets.com"; statement. I’ll appreciate the traffic. ;)

When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes -if necessary-, and remove clutter from query strings.

Sometimes even smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers :( ):
$testForUriClutter = $absoluteUri;
if (isset($_GET)) {
foreach ($_GET as $var => $crap) {
if ( stristr($var,”utm_”) ) {
$testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);
$testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);

unset ($_GET[$var]);
}
}
$uriPartsSanitized = parse_url($testForUriClutter);
$qs = $uriPartsSanitized["query"];
$qs = str_replace($qs, "?", "");
if ("$qs" != $uriParts["query"]) {
$canonicalUri = $scheme .$canonicalServerName .$requestScript;
if (!empty($qs)) {
$canonicalUri .= "?" .$qs;
}
if (!empty($uriParts["fragment"])) {
$canonicalUri .= "#" .$uriParts["fragment"];
}
header("HTTP/1.1 301 URI messed up by Google", TRUE, 301);
header("Location: $canonicalUri");
exit;
}
}

By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.

Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (see below), but some are self-made.

Therefore I’d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don’t need a log entry.


Before you’re going to serve your content, do a lookup in your shortUri table. If the requested URI is a shortened URI pointing to your own stuff, don’t perform a redirect but serve the content under the shortened URI.

Deliver static stuff (images …)

Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we’ve bypassed this functionality, do it yourself:
if (empty($uriParts[”query”])) && empty($uriParts[”fragment”])) && file_exists(”$rootPath$requestUri”)) {
header(”Content-type: ” .getContentType(”$rootPath$requestUri”), TRUE);
readfile(”$rootPath$requestUri”);
exit;
}
/* getContentType($filename) returns a
MIME media type like 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */

If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don’t handle them here.

Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your smart robots.txt to maintain your crawler-IP database and more.

Execute script (dynamic URI)

Say you’ve a WP blog in /blog/, then you can invoke WordPress with
if (substring($requestUri, 0, 6) == “/blog/”) {
require(”$rootPath/blog/index.php”);
exit;
}

(Perhaps the WP configuration needs a tweak to make this work.) There’s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.

Fortunately, when WordPress calls the 404 page (wp-content/themes/yourtheme/404.php), it hasn’t sent any output or headers yet. That means you can include the procedures discussed below in WP’s 404.php:
$httpResponseCode = “404″;
$errSrc = “WordPress”;
$errMsg = “The blog couldn’t make sense out of this request.”;
require(”$includePath/err.php”);
exit;

Like in my WordPress example, you’ll find a way to call your scripts so that they don’t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.

Resolve shortened URI

If you’re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.

Excursus: URI shortener components

Using the hints below you should be able to code your own URI shortener. You don’t need all the balls and whistles (like stats) overloading most scripts available on the Web.

  • A database table with at least these attributes:

    • shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)
    • shortUri.suriUri, text, indexed, stores the original URI
    • shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)

    Storing page titles and content (snippets) makes sense, but isn’t mandatory. For outputs like “recently shortened URIs” you need a timestamp attribute.

  • A method to create a shortened URI.
    Make that an independent script callable from a Web form’s server procedure, via Ajax, SOAP, etc.

    Without a given shortcut, use the primary key to create one. base_convert(intval($suriId), 10, 36); converts an integer into a short string. If you can’t do that in a database insert/create trigger procedure, retrieve the primary key’s value with LAST_INSERT_ID() or so and perform an update.

    URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your create short URI method should return a previously created shortcut then.

    If you’re storing titles and such stuff grabbed from the destination page, don’t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.

    With the shortcut returned build the short URI on-the-fly $shortUri = getBaseUri() ."/" .$suriShortcut; (so you can use your URI shortener across all your sites).

  • A method to retrieve the original URI.
    Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.

    Bear in mind that shortened URIs spread via social media do get abused. A shortcut like ‘xxyyzz’ can appear as ‘xxyyz..’, ‘xxy’, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an error page with suggestions.

  • A Web form to create and edit shortened URIs.
    Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
  • If you want to use your URI shortener with a Twitter client, then build an API.
  • If you need particular stats for your short URIs pointing to foreign sites that your analytics package can’t deliver, then store those click data separately.
    // end excursus

If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.
$suriUri = resolveShortUri($requestUri);
if ($suriUri === FALSE) {
$httpResponseCode = “404″;
$errSrc = “sUri”;
$errMsg = “Invalid short URI. Shortcut resolves to more than one result.”;
require(”$includePath/err.php”);
exit;
}
if (!empty($suriUri))
if (!stristr($suriUri, $canonicalServerName)) {
header(”HTTP/1.1 301 Here you go”, TRUE, 301);
header(”Location: $suriUri”);
exit;
}
}

Otherwise ($suriUri is yours) deliver your content without redirecting.

Redirect to destination (invalid request)

From reading your raw logs (404 stats don’t cover 302-Found crap) you’ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn’t make sense to show visitors following such a link your 404 page.

Most screwed URIs are unique in a way that they still ‘address’ one particular resource on your server. You should maintain a mapping table for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.

When you feature a “product of the week”, “hottest blog post”, “today’s joke” or so, then bookmarkers will love it when its URI doesn’t change. For such transient URIs do a 307 redirect to the currently featured page. Don’t fear non-existing ‘duplicate content penalties’. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you’ll still get the SERP traffic you deserve.

Guess destination (invalid request)

For many screwed URIs you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.

Once you’ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your screwed URIs mapping table (see above).

When you can’t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.

Serve a useful error page

Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don’t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page’s prominent message should consist of hints how the visitor can reach the estimated contents.

A central error page gets invoked from various scripts. Unfortunately, err.php can’t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can’t send an HTTP response header. Hence prefix the header() statement with a ‘@’ to supress PHP error messages, and catch and log errors.

Before you output your wonderful error page, send a 404 header:
if ($httpResponseCode == NULL) {
$httpResponseCode = “404″;
}
if (empty($httpResponseCode)) {
$httpResponseCode = “501″; // log for debugging
}
@header(”HTTP/1.1 $httpResponseCode Shit happens”, TRUE, intval($httpResponseCode));
logHeaderErr(error_get_last());

In rare cases you better send a 410-Gone header, for example when Matt’s team has discovered a shitload of questionable pages and you’ve filed a reconsideration request.

In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.

Use possible input ($errSrc, $errMsg, …) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [* 404 page *] might inspire you (WordPress users click here).


All errors are mine. In other words, be careful when you grab my untested code examples. It’s all dumped from memory without further thoughts and didn’t face a syntax checker.

I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I’d be happy to discuss your thoughts in the comments. Thanks for your time.



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

URI canonicalization with an X-Canonical-URI HTTP header

X-Canonical-URI HTTO HeaderDear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I’m desperately in need of a good reason to praise your wisdom and whatnot. From this year’s x-mas wish list:

All search engines obey the X-Canonical-URI HTTP header

The rel=canonical link element is a great tool, at least if applied properly, but sometimes it’s a royal pain in the ass.

Inserting rel=canonical link elements into huge conglomerates of cluttered scripts and static files is a nightmare. Sometimes the scripts creating the most URI clutter are compiled, and there’s no way to get a hand on the source code to change them.

Also, lots of resources can’t be stuffed with HTML’s link elements, for example dynamically created PDFs, plain text files, or images.

It’s not always possible to revamp old scripts, some projects just lack a suitable budget. And in some cases 301 redirects aren’t a doable option, for example when the destination URI is #5 in a redirect chain that can’t get shortened because the redirects are performed by a 3rd party that doesn’t cooperate.

This one, on the other hand, is elegant and scalable:

if (messedUp($_SERVER["REQUEST_URI"])) {
header(”X-Canonical-URI: $canonicalUri”);
}

Or:
header(”Link: <http://example.com/canonical-uri/>; rel=canonical”);

Coding an HTTP request handler that takes care of URI canonicalization before any script gets invoked, and before any static file gets served, is the way to go for such fuddy-duddy sites.

By the way, having all URI canonicalization routines in one piece of code is way more transparent, and way better manageable, than a bazillion of isolated link elements spread over tons of resources. So that might be a feasible procedure for non-ancient sites, too.

red crab blackmailing search enginesDear search engines, if you make that happen, I promise that I don’t tweet your products with a “#crap” hashtag for the whole rest of this year. Deal?

And yes, I know I’m somewhat late, two days before x-mas, but you’ve got smart developers, haven’t you? So please, go get your ‘code monkeys’ to work and surprise me. Thanks.



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The anatomy of a deceptive Tweet spamming Google Real-Time Search

Google real time search spammed and abusedMinutes after the launch of Google’s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:

Ingredients

  • 1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.
  • 1 landing page that makes the punter pull out the plastic in no time.
  • 1 trusted authority page totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.
  • 1 short domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.

Preparation

Create a new text file and name it hot-topic.php or so. Then code:
<?php
$landingPageUri = "http://affiliate-program.com/?your-aff-id";
$trustedPageUri = "http://google.com/something.py";
if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) {
header("HTTP/1.1 307 Here you go today", TRUE, 307);
header("Location: $trustedPageUri");
}
else {
header("HTTP/1.1 301 Happy shopping", TRUE, 301);
header("Location: $landingPageUri");
}
exit;
?>

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.

Serving

  • Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.
  • On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.
  • Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.
  • Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

Hey Google, for the sake of our children, take that as a spam report!



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The perfect robots.txt for News Corp

News Copy kicking out Google NewsI appreciate Google’s brand new News User Agent. It is, however, not a perfect solution, because it doesn’t distinguish indexing and crawling.

Disallow is a crawler directive, that simply tells web robots “do not fetch my content”. It doesn’t prevent contents from indexing. That means, search engines can index content they’re not allowed to fetch from the source, and send free traffic to disallow’ed URIs. In case of news, there are enough 3rd party signals (links, anchor text, quotes, …) out there to create a neat title and snippet on the SERPs.

Fortunately, Google’s REP implementation allows news sites to refine the suggested robots.txt syntax below. Google supports noindex in robots.txt.

Below I’ve edited the robots.txt syntax suggested by Google (source).

Include pages in Google web search, but not in News:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-News
Disallow: /
Noindex: /

This robots.txt file says that no files are disallowed from Google’s general web crawler, called Googlebot, but the user agent “Googlebot-News” is blocked from all files on the website. The “Noindex” directive makes sure that Google News cannot use forbidden stuff indexed from 3rd party signals.

User-agent: Googlebot
Disallow: /
Noindex: /

User-agent: Googlebot-News
Disallow:

When parsing a robots.txt file, Google obeys the most specific directive. The first two lines tell us that Googlebot (the user agent for Google’s web index) is blocked from crawling any pages from the site. The next directive, which applies to the more specific user agent for Google News, overrides the blocking of Googlebot and gives permission for Google News to crawl pages from the website. The “Noindex” directive makes sure that Google Web Search cannot use forbidden stuff indexed from 3rd party signals.

Of course other search engines might handle this differently. So it is obviously a good idea to add indexer directives on page level, too. The most elegant way to do that is a noindex,noarchive,nosnippet X-Robots-Tag in the HTTP header, because images, videos, PDFs etc. can’t be stuffed with HTML’s META elements.

See how this works neatly with Web standards? There’s no need for ACrAP!



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Hard facts about URI spam

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29

GA KrakenI’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds -I really mean that, literally!-, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments