Hard facts about URI spam
I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:
http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster
I’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.
When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).
Why is it bad?
FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.
FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.
That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).
Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.
So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.
Spamming URIs with utm tracking variables breaks lots of things
Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies
Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.
Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.
In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.
Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.
Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.
Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.
What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds -I really mean that, literally!-, and they came out with half-assed crap like that? Un-fucking-believable.
What can I do to avoid URI spam on my site?
Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.
If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!
As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.
What can Google do?
Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.
Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.
Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.
Speak out!
So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:
I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?
Update:
I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.
By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!
|
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb Subscribe to |
19 comments Sebastian | Search Quality, Duplicate Content, Analytics, Internet Marketing, Webspam, Spam, SEO, Crap, Copy+Paste-Penalties, AdSense, Google
Recap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential
Oh what a mess. The candidate from Redmond fails totally on understanding the
As for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.
Besides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.
When you’re familiar with my various rants on the ever morphing
I couldn’t care less about PageRank™ sculpting, because a well thought out link architecture does the job with all search engines, not just Google. That’s where Google is right on the money.
What really matters is
Matt Cutts asks us
A while ago I’ve staged a public
Unfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example 


After the great
Recently
Folks try all sorts of naughty things when by accident a blog’s feed outranks the HTML version of a post. Usually that happened mostly to not that popular blogs, or with very old posts and categorized feeds that contain ancient articles.
It seems MSN/LiveSearch has tweaked their