Sebastian's Pamphlets

How to cleverly integrate your own URI shortener

Posted on 30 December, 2009

This pamphlet is somewhat geeky. Don’t necessarily understand it as a part of my ongoing ~~jihad~~ holy war on URI shorteners.

Clever implementation of an URL shortener Assuming you’re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don’t want to make use of unholy crap, so you need to roll your own. Here’s how you can (could) integrate a URI shortener into your site’s architecture.

Please note that my design suggestions ain’t black nor white. Your site’s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they’re applicable.

At the first sight, searching for a free URI shortener script to implement it on a dedicated domain looks like a pretty simple solution. It’s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that’s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.

So if you’re not running cheap-student-loans-with-debt-consolidation-on-each-payday-is-a-must-have-for-sexual-heroes-desperate-for-a-viagra-overdose-and-extreme-penis-length-enhancement.info and your domain’s name without the “www” prefix plus a few characters gives URIs of 20 (30) characters or less, you don’t need a short domain name to host your shortened URIs.

As a side note, when you’re shortening your URIs for Twitter you should know that shortened URIs aren’t mandatory any more. If your message doesn’t exceed 139 characters, you don’t need to shorten embedded URIs.

By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.

In the following I’ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you’ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider alternatives). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.

To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I’ll bore you with a ton of dull and totally unrelated stuff:

Introduction

There’s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we’re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:

RewriteEngine On
RewriteCond %{SERVER_PORT} ^80$
RewriteRule . /requestHandler.php [L]

Please note that the code above kinda disables the Web server’s error handling. If /requestHandler.php exists in the root directory, all ErrorDocument directives (except some 5xx) et cetera will be ignored. You need to take care of errors yourself.

/requestHandler.php (Warning: untested and simplified code snippets below)
/* Initialization */ $serverName = strtolower($_SERVER["SERVER_NAME"]); $canonicalServerName = "sebastians-pamphlets.com"; $scheme = "http://"; $rootUri = $scheme .$canonicalServerName; /* if used w/o path add aslash*/ $rootPath = $_SERVER["DOCUMENT_ROOT"]; $includePath = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */ $requestIp = $_SERVER["REMOTE_ADDR"]; $reverseIp = NULL; $requestReferrer = $_SERVER["HTTP_REFERER"]; $requestUserAgent = $_SERVER["HTTP_USER_AGENT"]; $isRogueBot = FALSE; $isCrawler = NULL; $requestUri = $_SERVER["REQUEST_URI"]; $absoluteUri = $scheme .$canonicalServerName .$requestUri; $uriParts = parse_url($absoluteUri); $requestScript = $PHP_SELF; $httpResponseCode = NULL;

Block rogue bots

You don’t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, …) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: IncrediBill.

While handling bots, you should detect search engine crawlers, too:

/* lookup your crawler IP databaseto populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */ if ($isCrawler !== TRUE) { $crawlerName = NULL; $crawlerHost = NULL; $crawlerServer = NULL; if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = ".crawl.baidu.com";} ... if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ".googlebot.com"; } if ($crawlerName != NULL) { $reverseIp = @gethostbyaddr($requestIp); if (!stristr($reverseIp,$crawlerServer)) { $isCrawler = FALSE; } if ("$reverseIp" == "$requestIp") { $isCrawler = FALSE; } if ($isCrawler !== FALSE;) { $chkIpAddyRev = @gethostbyname($reverseIp); if ("$chkIpAddyRev" == "$requestIp") { $isCrawler = TRUE; $crawlerHost = $reverseIp; // store the newly discovered crawler IP } } } }

If Baidu doesn’t send you any traffic, it makes sense to block its crawler. This piece of crap doesn’t behave anyway.if ($isCrawler &&"$crawlerName" == "Baiduspider") { $isRogueBot = TRUE; }

Another SE candidate is Bing’s spam bot that tries to manipulate stats on search engine usage. If you don’t approve such scams, block incoming! from the IP address range 65.52.0.0 to 65.55.255.255 (131.107.0.0 to 131.107.255.255 …) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren’t aware of their company’s spammy activities, so make sure you serve them a friendly GFY page that explains the issue.

Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you’re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}
If you operate a porn site you should refine this approach.

As for blocking requests by IP addy I’d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a @gethostbyaddr($requestIp) DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}

You won’t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [block user agent], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or codeif (isBlacklistedUa($requestUserAgent) ||stristr($requestUserAgent,”ThingFetcher”)) {$isRogueBot = TRUE;}
By the way, the owner of ThingFetcher really should stand up now. I’ve sent a complaint to Rackspace and I’ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn’t bother to check for robots.txt.

Finally, serve rogue bots what they deserve:if ($isRogueBot === TRUE) {header("HTTP/1.1 403 Go fuck yourself", TRUE, 403); exit; }

If you’re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a file_get_contents($requestReferrer); and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to http://example.com/go-fuck-yourself, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I’d go for the 403-GFY response.

Server name canonicalization

Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under one server name:if (”$serverName” != “$canonicalServerName”) { header(”HTTP/1.1 301 Please use the canonical URI”, TRUE, 301); header(”Location: $absoluteUri”); header(”X-Canonical-URI: $absoluteUri”); //experimentalheader("Link: <$absoluteUri>; rel=canonical"); // experimental exit; }

Subdomains are so 1999, also 2010 is the year of non-’.www’ URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite … the code from this pamphlet as you like. However, you must not change the $canonicalServerName = "sebastians-pamphlets.com"; statement. I’ll appreciate the traffic.

When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes -if necessary-, and remove clutter from query strings.

Sometimes even smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers ):$testForUriClutter = $absoluteUri; if (isset($_GET)) { foreach ($_GET as $var => $crap) { if ( stristr($var,”utm_”) ) { $testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”); $testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);unset ($_GET[$var]); } } $uriPartsSanitized = parse_url($testForUriClutter); $qs = $uriPartsSanitized["query"]; $qs = str_replace($qs, "?", ""); if ("$qs" != $uriParts["query"]) { $canonicalUri = $scheme .$canonicalServerName .$requestScript; if (!empty($qs)) { $canonicalUri .= "?" .$qs; } if (!empty($uriParts["fragment"])) { $canonicalUri .= "#" .$uriParts["fragment"]; } header("HTTP/1.1 301 URI messed up by Google", TRUE, 301); header("Location: $canonicalUri"); exit; } }

By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.

Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (see below), but some are self-made.

Therefore I’d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don’t need a log entry.

Before you’re going to serve your content, do a lookup in your shortUri table. If the requested URI is a shortened URI pointing to your own stuff, don’t perform a redirect but serve the content under the shortened URI.

Deliver static stuff (images …)

Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we’ve bypassed this functionality, do it yourself:if (empty($uriParts[”query”])) && empty($uriParts[”fragment”])) && file_exists(”$rootPath$requestUri”)) { header(”Content-type: ” .getContentType(”$rootPath$requestUri”), TRUE); readfile(”$rootPath$requestUri”); exit; } /* getContentType($filename) returns aMIME media typelike 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */

If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don’t handle them here.

Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your smart robots.txt to maintain your crawler-IP database and more.

Execute script (dynamic URI)

Say you’ve a WP blog in /blog/, then you can invoke WordPress with if (substring($requestUri, 0, 6) == “/blog/”) { require(”$rootPath/blog/index.php”); exit; }

(Perhaps the WP configuration needs a tweak to make this work.) There’s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.

Fortunately, when WordPress calls the 404 page (wp-content/themes/yourtheme/404.php), it hasn’t sent any output or headers yet. That means you can include the procedures discussed below in WP’s 404.php:$httpResponseCode = “404″; $errSrc = “WordPress”; $errMsg = “The blog couldn’t make sense out of this request.”; require(”$includePath/err.php”); exit;

Like in my WordPress example, you’ll find a way to call your scripts so that they don’t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.

Resolve shortened URI

If you’re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.

Excursus: URI shortener components

Using the hints below you should be able to code your own URI shortener. You don’t need all the balls and whistles (like stats) overloading most scripts available on the Web.

A database table with at least these attributes:
- shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)
- shortUri.suriUri, text, indexed, stores the original URI
- shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)
Storing page titles and content (snippets) makes sense, but isn’t mandatory. For outputs like “recently shortened URIs” you need a timestamp attribute.
A method to create a shortened URI.
Make that an independent script callable from a Web form’s server procedure, via Ajax, SOAP, etc.

Without a given shortcut, use the primary key to create one. base_convert(intval($suriId), 10, 36); converts an integer into a short string. If you can’t do that in a database insert/create trigger procedure, retrieve the primary key’s value with LAST_INSERT_ID() or so and perform an update.

URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your create short URI method should return a previously created shortcut then.

If you’re storing titles and such stuff grabbed from the destination page, don’t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.

With the shortcut returned build the short URI on-the-fly $shortUri = getBaseUri() ."/" .$suriShortcut; (so you can use your URI shortener across all your sites).
A method to retrieve the original URI.
Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.

Bear in mind that shortened URIs spread via social media do get abused. A shortcut like ‘xxyyzz’ can appear as ‘xxyyz..’, ‘xxy’, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an error page with suggestions.
A Web form to create and edit shortened URIs.
Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
If you want to use your URI shortener with a Twitter client, then build an API.
If you need particular stats for your short URIs pointing to foreign sites that your analytics package can’t deliver, then store those click data separately.
// end excursus

If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.$suriUri = resolveShortUri($requestUri); if ($suriUri === FALSE) { $httpResponseCode = “404″; $errSrc = “sUri”; $errMsg = “Invalid short URI. Shortcut resolves to more than one result.”; require(”$includePath/err.php”); exit; } if (!empty($suriUri)) if (!stristr($suriUri, $canonicalServerName)) { header(”HTTP/1.1 301 Here you go”, TRUE, 301); header(”Location: $suriUri”); exit; } }

Otherwise ($suriUri is yours) deliver your content without redirecting.

Redirect to destination (invalid request)

From reading your raw logs (404 stats don’t cover 302-Found crap) you’ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn’t make sense to show visitors following such a link your 404 page.

Most screwed URIs are unique in a way that they still ‘address’ one particular resource on your server. You should maintain a mapping table for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.

When you feature a “product of the week”, “hottest blog post”, “today’s joke” or so, then bookmarkers will love it when its URI doesn’t change. For such transient URIs do a 307 redirect to the currently featured page. Don’t fear non-existing ‘duplicate content penalties’. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you’ll still get the SERP traffic you deserve.

Guess destination (invalid request)

For many screwed URIs you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.

Once you’ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your screwed URIs mapping table (see above).

When you can’t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.

Serve a useful error page

Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don’t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page’s prominent message should consist of hints how the visitor can reach the estimated contents.

A central error page gets invoked from various scripts. Unfortunately, err.php can’t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can’t send an HTTP response header. Hence prefix the header() statement with a ‘@’ to supress PHP error messages, and catch and log errors.

Before you output your wonderful error page, send a 404 header:if ($httpResponseCode == NULL) { $httpResponseCode = “404″; } if (empty($httpResponseCode)) { $httpResponseCode = “501″; // log for debugging } @header(”HTTP/1.1 $httpResponseCode Shit happens”, TRUE, intval($httpResponseCode)); logHeaderErr(error_get_last());

In rare cases you better send a 410-Gone header, for example when Matt’s team has discovered a shitload of questionable pages and you’ve filed a reconsideration request.

In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.

Use possible input ($errSrc, $errMsg, …) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [* 404 page *] might inspire you (WordPress users click here).

All errors are mine. In other words, be careful when you grab my untested code examples. It’s all dumped from memory without further thoughts and didn’t face a syntax checker.

I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I’d be happy to discuss your thoughts in the comments. Thanks for your time.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Redirects, URI shortening, Usability, 404grabber, Web development, Social Web, Site-Search

URI canonicalization with an X-Canonical-URI HTTP header

Posted on 22 December, 2009

X-Canonical-URI HTTO Header Dear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I’m desperately in need of a good reason to praise your wisdom and whatnot. From this year’s x-mas wish list:

All search engines obey the X-Canonical-URI HTTP header

The rel=canonical link element is a great tool, at least if applied properly, but sometimes it’s a royal pain in the ass.

Inserting rel=canonical link elements into huge conglomerates of cluttered scripts and static files is a nightmare. Sometimes the scripts creating the most URI clutter are compiled, and there’s no way to get a hand on the source code to change them.

Also, lots of resources can’t be stuffed with HTML’s link elements, for example dynamically created PDFs, plain text files, or images.

It’s not always possible to revamp old scripts, some projects just lack a suitable budget. And in some cases 301 redirects aren’t a doable option, for example when the destination URI is #5 in a redirect chain that can’t get shortened because the redirects are performed by a 3rd party that doesn’t cooperate.

This one, on the other hand, is elegant and scalable:

if (messedUp($_SERVER["REQUEST_URI"])) {
header(”X-Canonical-URI: $canonicalUri”);
}

Or:
header(”Link: <http://example.com/canonical-uri/>; rel=canonical”);

Coding an HTTP request handler that takes care of URI canonicalization before any script gets invoked, and before any static file gets served, is the way to go for such fuddy-duddy sites.

By the way, having all URI canonicalization routines in one piece of code is way more transparent, and way better manageable, than a bazillion of isolated link elements spread over tons of resources. So that might be a feasible procedure for non-ancient sites, too.

red crab blackmailing search engines Dear search engines, if you make that happen, I promise that I don’t tweet your products with a “#crap” hashtag for the whole rest of this year. Deal?

And yes, I know I’m somewhat late, two days before x-mas, but you’ve got smart developers, haven’t you? So please, go get your ‘code monkeys’ to work and surprise me. Thanks.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Web development, MSN, Crawler Directives, SEO, Yahoo, Google

The anatomy of a deceptive Tweet spamming Google Real-Time Search

Posted on 10 December, 2009

Minutes after the launch of Google’s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:

Ingredients

1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.
1 landing page that makes the punter pull out the plastic in no time.
1 trusted authority page totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.
1 short domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.

Preparation

Create a new text file and name it hot-topic.php or so. Then code:<?php $landingPageUri = "http://affiliate-program.com/?your-aff-id"; $trustedPageUri = "http://google.com/something.py"; if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) { header("HTTP/1.1 307 Here you go today", TRUE, 307); header("Location: $trustedPageUri"); } else { header("HTTP/1.1 301 Happy shopping", TRUE, 301); header("Location: $landingPageUri"); } exit; ?>

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.

Serving

Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.
On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.
Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.
Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

Hey Google, for the sake of our children, take that as a spam report!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

20 comments Sebastian | Webspam, Search Quality, Redirects, Internet Marketing, Spam, Twitter, SEO, Cloaking, Crap, Google

The perfect robots.txt for News Corp

Posted on 7 December, 2009

News Copy kicking out Google News I appreciate Google’s brand new News User Agent. It is, however, not a perfect solution, because it doesn’t distinguish indexing and crawling.

Disallow is a crawler directive, that simply tells web robots “do not fetch my content”. It doesn’t prevent contents from indexing. That means, search engines can index content they’re not allowed to fetch from the source, and send free traffic to disallow’ed URIs. In case of news, there are enough 3rd party signals (links, anchor text, quotes, …) out there to create a neat title and snippet on the SERPs.

Fortunately, Google’s REP implementation allows news sites to refine the suggested robots.txt syntax below. Google supports noindex in robots.txt.

Below I’ve edited the robots.txt syntax suggested by Google (source).

Include pages in Google web search, but not in News:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-News
Disallow: /
Noindex: /

This robots.txt file says that no files are disallowed from Google’s general web crawler, called Googlebot, but the user agent “Googlebot-News” is blocked from all files on the website. The “Noindex” directive makes sure that Google News cannot use forbidden stuff indexed from 3rd party signals.

Include pages in Google News, but not Google web search:

User-agent: Googlebot
Disallow: /
Noindex: /

User-agent: Googlebot-News
Disallow:

When parsing a robots.txt file, Google obeys the most specific directive. The first two lines tell us that Googlebot (the user agent for Google’s web index) is blocked from crawling any pages from the site. The next directive, which applies to the more specific user agent for Google News, overrides the blocking of Googlebot and gives permission for Google News to crawl pages from the website. The “Noindex” directive makes sure that Google Web Search cannot use forbidden stuff indexed from 3rd party signals.

Of course other search engines might handle this differently. So it is obviously a good idea to add indexer directives on page level, too. The most elegant way to do that is a noindex,noarchive,nosnippet X-Robots-Tag in the HTTP header, because images, videos, PDFs etc. can’t be stuffed with HTML’s META elements.

See how this works neatly with Web standards? There’s no need for ACrAP!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Crawler Directives, robots.txt, Google

Hard facts about URI spam

Posted on 1 December, 2009

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed &utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29

GA Kraken I’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds -I really mean that, literally!-, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

?utm_source=sebastian&utm_medium=pamphlet&utm_campaign=thou+shalt+not+fuck+with+my+uris

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

19 comments Sebastian | Search Quality, Duplicate Content, Analytics, Internet Marketing, Webspam, Spam, SEO, Crap, Copy+Paste-Penalties, AdSense, Google

How to borrow relevance from authority pages with 307 redirects

Posted on 27 November, 2009

Every once in a while I switch to Dr Evil mode. That’s a “do more evil” type of pamphlet. Don’t bother reading the disclaimer, just spam away …

Content theft with 307 redirects Why the heck should you invest valuable time into crafting out compelling content, when there’s a shortcut?

There are so many awesome Web pages out there, just pick some and steal their content. You say “duplicate content issues”, I say “don’t worry”. You say “copyright violation”, I say “be happy”. Below I explain the setup.

This somewhat shady IM technique is for you when you’re shy of automatted content generation.

Register a new (short!) domain and create a tiny site with a few pages of totally unique and somewhat interesting content. Write opinion pieces, academic papers or whatnot, just don’t use content generators or anything that cannot pass a human bullshit detector. No advertising. No questionable links. Instead, link out to authority pages. No SEO stuff like nofollow’ed links to imprints or so.

Launch with a few links from clean pages. Every now and then drop a deep link in relevant discussions on forums or social media sites. Let the search engines become familiar with your site. That’ll attract even a few natural inbound links, at least if your content is linkworthy.

Use Google’s Webmaster Console (GWC) to monitor your progress. Once all URIs from your sitemap are indexed and show in [site:yourwebspam.com] searches, begin to expand your site’s menu and change outgoing links to authority pages embedded in your content.

Create short URIs (LE 20 characters!) that point to authority pages. Serve search engine crawlers a 307, and human surfers a 301 redirect. Build deep links to those URIs, for example in tweets. Once you’ve gathered 1,000+ inbounds, you’ll receive SERP traffic. By the way, don’t buy the sandbox myths.

Watch the keywords page in you GWC account. It gets populated with keywords that appear only in content of pages you’ve hijacked with redirects. Watch your [site:yourwebspam.com] SERPs. Usually the top 10 keywords listed in the GWC report will originate from pages listed on the first [site:yourwebspam.com] SERPs, provided you’ve hijacked awesome content.

Add (new) keywords from pages that appear both in redirect destinations listed within the first 20 [site:yourwebspam.com] search results, as well as in the first 20 listed keywords, to articles you actually serve on your domain.

Detect SERP referrers (human surfers who’ve clicked your URIs on search result pages) and redirect those to sales pitches. That goes for content pages as well as for redirecting URIs (mimiking shortened URIs). Laugh all the way to the bank.

Search engines rarely will discover your scam. Of course shit happens, though. Once the domain is burned, just block crawlers, redirect everything else to your sponsors, and let the domain expire.

History: Content theft with 307 redirects Disclaimer: Google has put an end to most 307 spam tactics. That’s why I’m publishing all this crap. Because watching decreasing traffic to spammy sites is frustrating. Deceptive 307′ing URIs won’t rank any more. Slowly, actually very slow, GWC reports follow suit.

What can we learn? Do not believe in the truth of search engine reports. Just because Google’s webmaster console tells you that Google thinks a keyword is highly relevant to your site, that doesn’t mean you’ll rank for it on their SERPs. Most probably GWC is not the average search engine spammer’s tool of the trade.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Redirects, Webspam, Webmaster Central, Google

Handle your (UGC) feeds with care!

Posted on 20 November, 2009

When you run a website that deals with user generated content (UGC), this pamphlet is for you. #bad_news Otherwise you might enjoy it. #malicious_joy

Jump station

Not that the recent -and ongoing- paradigm shift shift in crawling, indexing, and ranking is bad news in general. On the contrary, for the tech savvy webmaster it comes with awesome opportunities with regard to traffic generation and search engine optimization. In this pamphlet I’ll blather about pitfalls, leaving chances unmentioned.

Background

For a moment forget everything you’ve heard about traditional crawling, indexing and ranking. Don’t buy that search engines solely rely on ancient technologies like fetching and parsing Web pages to scrape links and on-the-page signals out of your HTML. Just because someone’s link to your stuff is condomized, that doesn’t mean that Google & Co don’t grab its destination instantly.

More and more, search engines crawl, index, and rank stuff hours, days, or even weeks before they actually bother to fetch the first HTML page that carries a (nofollow’ed) link pointing to it. For example, Googlebot might follow a link you’ve tweeted right when the tweet appears on your timeline, without crawling http://twitter.com/your-user-name or the timeline page of any of your followers. Magic? Nope. The same goes for favs/retweets, stumbles, delicious bookmarks etc., by the way.

Guess why Google encourages you to make your ATOM and RSS feeds crawlable. Guess why FeedBurner publishes each and every update of your blog via PubSubHubbub so that Googlebot gets an alert of new and updated posts (and comments) as you release them. Guess why Googlebot is subscribed to FriendFeed, crawling everything (blog feed items, Tweets, social media submissions, comments …) that hits a FriendFeed user account in real-time. Guess why GoogleReader passes all your Likes and Shares to Googlebot. Guess why Bing and Google are somewhat connected to Twitter’s database, getting all updates, retweets, fav-clicks etc. within a few milliseconds.

Because all these data streams transport structured data that are easy to process. Because these data get pushed to the search engine. That’s way cheaper, and faster, than polling a gazillion of sources for updates 24/7/365.

Making use of structured data (XML, RSS, ATOM, PUSHed updates …) enables search engines to index fresh content, as well as reputation and other off-page signals, on the fly. Many ranking signals can be gathered from these data streams and their context, others are already on file. Even if a feed item consists of just a few words and a link (e.g. a tweet or stumble-thumbs-up), processing the relevant on-the-page stuff from the link’s final destination by parsing cluttered HTML to extract content and recommendations (links) doesn’t really slow down the process.

Later on, when the formerly fresh stuff starts to decompose on the SERPs, signals extracted from HTML sources kick in, for example link condoms, link placement, context and so on. Starting with the discovery of a piece of content, search engines permanently refine their scoring, until the content finally drops out of scope (spam filtering, unpaid hosting bills and other events that make Web content disappear).

Traditional discovery crawling, indexing, and ranking doesn’t exactly work for real-time purposes, nor in near real-time. Not even when a search engine assigns a bazillion of computers to this task. Also, submission based crawling is not exactly a Swiss Army knife when it comes to timely content. Although XML-sitemaps were a terrific accelerator, they must be pulled for processing, hence a delay occurs by design.

Nothing is like it used to be. Change happens.

Why does this paradigm shift puts your site at risk?

Spam puts your feeds at risk As a matter of fact, when you publish user generated content, you will get spammed. Of course, that’s bad news of yesterday. Probably you’re confident that your anti-spam defense lines will protect you. You apply link condoms to UGC link drops and all that. You remove UGC once you spot it’s spam that slipped through your filters.

Bad news is, your medieval palisade won’t protect you from a 21th century tank attack with air support. Why not? Because you’ve secured your HTML presentation layer, but not your feeds. There’s no such thing as a rel-nofollow microformat for URIs in feeds, and even condomized links transported as CDATA (in content elements) are surrounded by spammy textual content.

Feed items come with a high risk. Once they’re released, they’re immortal and multiply themselves like rabbits. That’s bad enough in case a pissed employee ‘accidently’ publishes financial statements on your company blog. It becomes worse when seasoned spammers figure out that their submissions can make it into your feeds, and be it only for a few milliseconds.

If your content management system (CMS) creates a feed item on submission, search engines -in good company with legions of Web services- will distribute it all over the InterWeb, before you can hit the delete button. It will be cached, duplicated, published and reprinted … it’s out of your control. You can’t wipe out all of its instances. Never.

Congrats. You found a surefire way to piss off both your audience (your human feed subscribers getting their feed reader flooded with PPC spam), and search engines as well (you send them weird spam signals that rise all sorts of red flags). Also, it’s not desirable to make social media services -that you rely on for marketing purposes- too suspicious (trigger happy anti-spam algos might lock away your site’s base URI in an escape-proof dungeon).

So what can you do to prevent your feeds from unwanted content?

Protect your feeds Before I discuss advanced feed protection, let me point you to a few popular vulnerabilities you might haven’t considered yet:

No nay never use integers as IDs before you’re dead sure that a piece of submitted content is floral white as snow. Integer sequences produce guessable URIs. Instead, generate a UUID (aka GUID) as identifier. Yeah, I know that UUIDs make ugly URIs, but those aren’t predictable and therefore not that vulnerable. Once a content submission is finally approved, you can donate it a nice -maybe even meaningful- URI.
No nay never use titles, subjects or so in URIs, not even converted text from submissions (e.g. ‘My PPC spam’ ==> ‘my_ppc_spam’). Why not? See above. And you don’t really want to create URIs that contain spammy keywords, or keywords that are totally unrelated to your site. Remember that search engines do index even URIs they can’t fetch, or which they can’t refetch, at least for a while.
Before the final approval, serve submitted content with a “noindex,nofollow,noarchive,nosnippet” X-Robots-Tag in the HTTP header, and put a corresponding meta element in the HEAD section. Don’t rely on link condoms. Sometimes search engines ignore rel-nofollow as an indexer directive on link level, and/or decide that they should crawl the link’s destination anyway.
Consider serving social media bots requesting a not yet approved piece of user generated content a 503 HTTP response code. You can compile a list of their IPs and user agent names from your raw logs. These bots don’t obey REP directives, that means they fetch and process your stuff regardless whether you yell “noindex” at them or not.
For all burned (disapproved) URIs that were in use ensure that your server returns a 410-Gone HTTP status code, respectively perform a 301 redirect to a policy page or so to rescue link love that would get wasted otherwise.
Your Web forms for content submissions should be totally AJAX’ed. Use CAPTCHAs and all that. Split the submission process into multiple parts, each of them talking to the server. Reject excessively rapid walk throughs, for example by asking for something unusual when a step gets completed in a too short period of time. With AJAX calls that’s painless for the legit user. Do not accept content submissions via standard GET or POST requests.
Serve link builders coming from SERPs for [URL|story|link submit|submission your site’s topic] etc. your policy page, not the actual Web form.
There’s more. With the above said, I’ve just begun to scrape the surface of a savvy spammer’s technical portfolio. There’s next to nothing a clever programmed bot can’t mimick. Be creative and think outside the box. Otherwise the spammers will be ahead of you in no time, especially when you make use of a standard CMS.

Having said that, lets proceed to feed protection tactics. Actually, there’s just one principle set in stone:

Make absolutely sure that submitted content can’t make it into your feeds (and XML sitemaps) before it’s finally approved!

The interesting question is: what the heck is a “final approval”? Well, that depends on your needs. Ideally, that’s you releasing each and every piece of submitted content. Since this approach doesn’t scale, think of ways to semi-automate the process. Don’t fully automate it, there’s no such thing as an infallible algo. Also, consider the wisdom of the crowd spammable (voting bots). Some spam will slip through, guaranteed.

Each and every content submission must survive a probation period, whereas it will not be included in your site’s feeds. Regardless who contributed it. Stick with the four-eye principle. Here are a few generic procedures you could adapt, respectively ideas which could inspire you:

Queue submissions. Possible queues are Blocked, Quarantaine, Suspect, Probation, and finally Released. Define simple rules and procedures that anyone involved can follow. SOPs lack work arounds and loopholes by design.
Stuck content submissions from new users who didn’t participate in other ways in quarantaine. Moderate this queue and only manually release into the probation queue what passes the moderator’s heuristics. Signup-submit-and-forget is a typical spammer behavior.
Maintain black lists of domain names, IPs, countries, user agent names, unwanted buzzwords and so on. Use filters to arrest submissions that contain keywords you wouldn’t expect to match your site’s theme in the Blocked or Quarantaine queue.
On submission fetch the link’s content and analyze it, don’t stick with heuristic checks of URIs, titles and descriptions. Don’t use methods like PHP’s file_get_contents that don’t return HTTP response codes. You need to know whether a requested URI is the first one of a redirect chain, for example. Double check with a second request from another IP, preferably owned by a widely used ISP, with a standard browser’s user agent string, that provides an HTTP_REFERER, for example a Google SERP with a q parameter populated with a search term compiled from the submission’s suggested anchor text. If the returned content differs too much, set a red flag.
Maintain white lists, too. That’s a great way to reduce the amount of inavoidable false positives.
If you have editorial staff or moderators, they should get a Release to Feed button. You can combine mod releases with a minimum number of user votes or so. For example you could define a rule like “release to feed if mod-release = true and num-trusted-votes > 10″.
Categorize your user’s reputation and trustworthiness. A particular number of votes from trusted users could approve a submission for feed inclusion.
Don’t automatically release submissions that have raised any flag. If that slows down the process, refine your flagging but don’t lower the limits.
With all automatted releases, for example based on votings, oops, especially based on votings, implement at least one additional sanity check. For example discard votes from new users as well as from users with a low participation history, check the sequence of votes for patterns like similar periods of time between votings, and so on.

Disclaimer: That’s just some food for thoughts. I want to make absolutely clear that I can’t provide bullet-proof anti-spam procedures. Feel free to discuss your thoughts, concerns, questions … in the comments.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

3 comments Sebastian | Web development, Webspam, Spam

The most sexy browsers screw your analytics

Posted on 12 November, 2009

Chrome and Safari fuck with the HTTP_REFERER Now that IE is quite unusable due to the lack of websites that support its non-standard rendering, and the current FireFox version suffers from various maladies, more and more users switch to browsers that are supposed to comply to Web standards, such as Chrome, Safari, or Opera.

Those sexy user agents execute client sided scripts in lightning speed, making surfers addicted to nifty rounded corners very very happy. Of course they come with massive memory leaks, but surfers who shut down their browser every once in a while won’t notice such geeky details.

Why is that bad news for Internet marketers? Because Chrome and Safari screw your analytics. Your stats are useless with regard to bookmarkers and type-in traffic. Your referrer stats lack all hits from Chrome/Safari users who have opened your landing page in a new tab or window.

Google’s Chrome and Apple’s Safari do not provide an HTTP_REFERER. (The typo is standardized, too.)

This bug was reported in September 2008. It’s not yet fixed. Not even in beta versions.

Guess from which (optional) HTTP header line your preferred stats tool compiles the search terms to create all the cool keyword statistics? Yup, that’s the HTTP_REFERER’s query string when the visitor came from a search result page (SERP). Especially on SERPs many users open links in new tabs. That means with every searcher switching to a sexy browser your keyword analysis becomes more useless.

That’s not only an analytics issue. Many sites provide sensible functionality based on the referrer (the Web page a user came from), for example default search terms for site-search facilities gathered from SERP-referrers. Many sites evaluate the HTTP_REFERER to prevent themselves from hotlinking, so their users can’t view the content they’ve paid for when they open a link in a new tab or window.

Passing a blank HTTP_REFERER when this information is available to the user agent is plain evil. Of course lots of so-called Internet security apps do this by default, but just because others do evil that doesn’t mean a top-notch Web browser like Safari or Chrome can get away with crap like this for months and years to come.

Please nudge the developers!

Here you go. Post in this thread why you want them to fix this bug asap. Tell the developers that you can’t live with screwed analytics, and that your site’s users rely on reliable HTTP_REFERERs. Even if you don’t run a website yourself, tell them that your favorite porn site bothers you with countless error messages instead of delivering smut, just because WebKit browsers are buggy.

You can test whether your browser passes the HTTP_REFERER or not: Go to this Google SERP. On the link to this post chose “Open link in new tab” (or window) in the context menu (right click over the link). Scroll down.

Your browser passed this HTTP_REFERER: None

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

27 comments Sebastian | Internet Marketing, Usability, Analytics, Crap

As if sloppy social media users ain’t bad enough … search engines support traffic theft

Posted on 26 October, 2009

Prepare for a dose of techy tin foil hattery. [Skip rant] Again, I’m going to rant about a nightmare that Twitter & Co created with their crappy, thoughtless and shortsighted software designs: URI shorteners (yup, it’s URI, not URL).

Recap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential ~~visitors~~ customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of tinyurl.com’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs …

How should a search engine handle a shortened URI?

Handling an URI as shortened URL requires a bullet proof method to detect shortened URIs. That’s a breeze.

Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.
Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).
Seed: trusted sources like LongURL.org provide lists of domains owned by URI shortening services. Social media outlets providing their own URI shorteners don’t hide server name patterns (like su.pr …).
Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.
URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [site:tinyurl.com] SERPs, but on a [site:tinyurl.com shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

In other words: search engines should make use of their knowledge of shortened URIs in response to navigational search queries. In fact, search engines could even solve the problem of vanished and abused short URIs.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Bing 301 messed up SERPs Oh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from tinyurl.com alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [site:tinyurl.com], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

Google fails with merits As for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google -under particular circumstances- indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Long time ago Tim Berners-Lee told you that ~~URI shorteners are evil~~ fucking with URIs is a very bad habit. Did you listen? Do you make use of shortened URIs? If you post URIs that get shortened at Twitter, or if you make use of 3rd party URI shorteners elsewhere, consider yourself trapped into a low-life traffic theft scam. Shame on you, and shame on Twitter & Co.

fight evil URI shorteners Besides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Social Web, MSN, URI shortening, Search Quality, Spam, Yahoo, Risky Linkage, Twitter, Google

Derek Powazek outed himself big-mouthed and ignorant, and why that’s a pity

Posted on 16 October, 2009

With childish attacks on his colleagues, Derek Powazek didn’t do professional Web development -as an industry- a favor. As a matter of fact, Derek Powazek insulted savvy Web developers, Web designers, even search engine staff, as well as useability experts and search engine specialists, who team-up in countless projects helping small and large Web sites succeed.

I seriously can’t understand how Derek Powazek “has survived 13 years in the web biz” (source) without detailled knowledge of how things get done in Web projects. I mean, if a developer really has worked 13 years in the Web biz, he should know that the task of optimizing a Web site’s findability, crawlability, and accessibility for all user agents out there (SEO) is usually not performed by “spammers evildoers and opportunists”, but by highly professional experts who just master Web development better than the average designer, copy-writer, publisher, developer or marketing guy.

Boy, what an ego. Derek Powazek truly believes that if “[all SEO/SEM techniques are] not obvious to you, and you make websites, you need to get informed” (source). That translates to “if you aren’t 100% perfect in all aspects of Web development and Internet marketing, don’t bother making Web sites — go get a life”.

Well, I consider very few folks capable of mastering everything in Web development and Internet marketing. Clearly, Derek Powazek is not a member of this elite. With one clueless, uninformed and way too offensive rant he has ruined his reputation in a single day. Shortly after his first thoughtless blog post libelling fellow developers and consultants, Google’s search result page for [Derek Powazek] is flooded with reasonable reactions revealing that Derek Powazek’s pathetic calls for ego food are factually wrong.

Of course calm and knowledgable experts in the field setting the records straight, like Danny Sullivan (search result #1 and #4 for [Derek Powazek] today) and Peter da Vanzo (SERP position #9), can outrank a widely unknown guy like Derek Powazek at all major search engines. Now, for the rest of his presence on this planet, Derek Powazek has to live with search results that tell the world what kind of an “expert” he really is (example …).

He should have read Susan Moskwa’s very informative article about reputation management on Google’s Blog a day earlier. Not that reputation management doesn’t count as an SEO skill … actually, that’s SEO basics (as well as URI canonicalization).

Dear Derek Powazek, guess what all the bright folks you’ve bashed so cleverly will do when you ask them to take down their responses to your uncalled-for dirty talk?

So what can we learn from this gratuitous debacle? Do not piss in someone’s roses when

you suffer from an oversized ego,
you’ve not the slightest clue what you’re talking about,
you can’t make a point with proven facts, so you’ve to use false pretences and clueless assumptions,
you tend to insult people when you’re out of valid arguments,
willy whacking is not for you, because your dick is, well, somewhat undersized.

Ok, it’s Friday evening, so I’m supposed to enjoy TGIF’s. Why the fuck am I wasting my valuable spare time writing this pamphlet? Here’s why:

Having worked in, led, and coached WebDev teams on crawlability and best practices with regard to search engine crawling and indexing for ages now, I was faced with brain amputated wannabe geniuses more than once. Such assclowns are able to shipwreck great projects. From my experience the one and only way to keep teams sane and productive is sacking troublemakers at the moment you realize they’re unconvinceable. This Powazek dude has perfectly proven that his ignorance is persistent, and that his anti-social attitude is irreversible. He’s the prime example of a guy I’d never hire (except if I’d work for my worst enemy). Go figure.

Update 2009-10-19: I consider this a lame excuse. Actually, it’s even more pathetic than the malicious slamming of many good folks in his previous posts. If Derek Powazek really didn’t know what “SEO” means in the first place, his brain farts attacking something he didn’t understand at the time of publishing his rants are indefensible, provided he was anything south of sane then. Danny Sullivan doesn’t agree, and he’s right when he says that every industry has some black sheep, but as much as I dislike comment spammers, I dislike bullshit and baseness.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Web development, Crap, SEO

« Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 Next Page »

JudgeDread: Gerhard Heinz is a phoney. If you read his dribble going back months, most of predictions have not been...
mcc_FREE_LIBYA: Hi Sebastian, I moved with my maps 2 another host. Much better navigation now. Now u find my maps...
Sebastian: Alireza Sefati, I’m replying to your comment from a totally clean, er… M$-free computer. ;-)
Alireza Sefati: I think you maybe the last geek on earth promoting IE 9 or any of MS products :)
45south45: Hi again Sebastian So glad to see the reference to mcc’s maps here. They are very helpful for all of...

Table of contents

Introduction

Block rogue bots

Server name canonicalization

Deliver static stuff (images …)

Execute script (dynamic URI)

Resolve shortened URI

Excursus: URI shortener components

Redirect to destination (invalid request)

Guess destination (invalid request)

Serve a useful error page

All search engines obey the X-Canonical-URI HTTP header

Ingredients

Preparation

Serving

Include pages in Google web search, but not in News:

Include pages in Google News, but not Google web search:

Why is it bad?

Spamming URIs with utm tracking variables breaks lots of things

What can I do to avoid URI spam on my site?

What can Google do?

Speak out!

Jump station

Background

Why does this paradigm shift puts your site at risk?

So what can you do to prevent your feeds from unwanted content?

Please nudge the developers!

How should a search engine handle a shortened URI?

Bing doesn’t get redirects at all

Yahoo does an awesome job, with a tiny exception

Google does it by the book, but there’s room for improvements

About abusive short URIs

Fight shortened URIs!

INTERNET MARKETING NINJAS

Jump Center

My recent pamphlets

Your Valued Comments

Archived Pamphlets

Paid Links, err Recommendations

Charity Links

Hardcore SEO Porn

Sexy T-Shirts ...

What I'm Doing...

Sebastian's picked gems

Stuff4Bots