Hard facts about URI spam

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29

GA KrakenI’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds –I really mean that, literally!–, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:2010 SEMMY Nominee

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

19 Comments to "Hard facts about URI spam"

  1. Yura on 1 December, 2009  #link

    Heh, it was actually me, who has started the thread at the Cre8asite Forums. Didn’t know it got to the SER.

    You can simply use # instead of the first ?, as John said, and get by, instead of removing URL tracking. And I am going to use it in a situation with possible limited indexing (whitepaper distribution).

    Practically, though, I don’t intend to use URL tracking for indexable content (though it’d be a stellar thing to do).

  2. Yura on 1 December, 2009  #link

    Actually, here’s the thread at the Cre8asite Forums to the curious:
    http://www.cre8asiteforums.com/forums/index.php?showtopic=73804

  3. Richard Hearne on 1 December, 2009  #link

    You could also use the Parameter Tool in GWT to ignore all the UTM related params?

    You’re right though - better to prevent than to sure!

  4. Sebastian on 2 December, 2009  #link

    Thanks for the link, Yura. For the methods described and linked over there goes what Richard said: better to prevent than to sure!

    If Google wouldn’t have created the whole mess, it wouldn’t be necessary to tweak URIs on arrival, because there wouldn’t be any cluttered URIs in the first place.

    It doesn’t make any sense to apply questionable cures to symptoms, when there’s a chance to heal the disease that causes them.

    I really hope there’s a chance that Google takes this change back and develops a reasonable procedure to track feed traffic.

  5. Hobo on 2 December, 2009  #link

    i’m seeing this all the time too - it’s a pain. My second most crawled page is a garbled cut and paste from a feedreader etc to a page that doesn’t exist but which interferes with my blog and stats.

    Frustrating!

    Might use that sort of code everytime I link to Google from now on lol

  6. […] Hard facts about URI spam, sebastians-pamphlets.com […]

  7. […] Hard facts about URI spam – friend of the Fire Horse, Sabastian X returns this week with yet another geeky tin toil touting post. I’ve been noticing these fecked up URIs over the last while and it’s nice to see someone go off on them… Suhweet! […]

  8. mohsin on 14 December, 2009  #link

    I think Google has taken this matter seriously and they have started to improve their URLs. Google has recently launched the URL shortening service for Feed burner and other services also. check out
    http://googleblog.blogspot.com/2009/12/making-urls-shorter-for-google-toolbar.html

  9. Sebastian on 15 December, 2009  #link

    Mohsin, I don’t think Google’s URI shortener has anything to do with UTM tracking variables. It’s just a neat tool to update my Twitter stream.

  10. Phil on 16 December, 2009  #link

    540,000 of urls have been effected by the feedburner utm tracking change, see:
    http://www.google.com/search?q=allinurl:”utm_source=feedburner” OR “utm_medium=feed”&num=100&filter=0

    There is related post about URI spam here, Omniture, Webtrends, Yahoo are all guilt of this!
    Google Webmaster Support Forum

  11. Sebastian on 17 December, 2009  #link

    Today Google’s SERP says they’ve indexed roughly 35 million sneakily cluttered URIs:
    http://www.google.com/search?hl=en&q=inurl:utm_source

  12. Phil on 17 December, 2009  #link

    Sebastian - I make that 95 million if you include utm_medium or utm_campaign:
    http://tinyurl.com/utm-source-medium-campaign

    Total pages in Googles index is about 25,470,000,000
    http://tinyurl.com/googles-index-size

    So… 95,600,000 / 25,470,000,000 = 0.004% URI clutter.

  13. […] это берется? Себастиан из Sebastian’s Pamphlets заявляет, что это происходит при одновременном использовании […]

  14. […] smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all crawlers have […]

  15. Everfluxx on 11 January, 2010  #link

    Sebastian, I agree that the duplicate “?” is an error because it produces malformed URLs: I believe that it should be reported as a bug and Google should fix it.

    On nearly all the rest of your post, I’m sorry, I have disagree with you.

    I believe it’s your responsibility as a webmaster to make sure that I can’t break your site (and/or screw up your rankings) if I link to you with an appended querystring such as this one: you (should) have complete control on which pages of your site get indexed and which won’t, which URLs should be regarded as canonical and which shouldn’t. There are plenty of tools that you can use for URL prophylaxis (that’s a nice neologism, isn’t it?): from good ole .htaccess to the rel=canonical attribute. The average Joe web developer might not feel 100% confident about the latter and how to use it, I agree, but there’s an abundance of information on the subject to help even novice webmasters –and absolutely no excuse for ignorance in senior web developers.

    Ultimately, I believe it is our role and responsibility as experienced SEO professionals to provide developers with guidelines on how to design web apps so that their rankings won’t fall apart when they get linked to with slightly different URLs than they were designed to handle (I think I just wrote a recursive statement, LOL).

  16. Sebastian on 11 January, 2010  #link

    Oh boy, you know you’re totally wrong.

    Basically, what you’re saying is that I’ve to wear a flak vest to cover my ass from incoming. So far I agree.

    Where I don’t agree is that I’ve to cover my ass from friendly fire, when there’s a way to avoid it.

    I didn’t call down fire on my own position. If Google were somewhat responsible, they would have invested the little brain power necessary to come out with a better solution.

    Google created the ugly mess, so Google has to fix it. As long as they don’t, I encourage you to unsubscribe from this crappy “service” (Feedburner-GA integration).

    By the way, here is a canonicalization routine that deals with with Google’s UTM crap on arrival. That doesn’t mean it’s the right thing to do. You only have no better chance. Just because Google launches crap, that doesn’t mean that you and a gazillion of other webmasters have to fix it.

  17. Everfluxx on 11 January, 2010  #link

    Oh boy, you know you’re totally wrong.

    I knew you would agree. :D

    Friendly fire and flak vest, nice metaphors. Let’s say I would wear one just for safety. :)

  18. […] Hard facts about URI spam Sebastian X, Sebastian’s Pamphlets | 12/1/09 […]

  19. Alan Levine on 8 March, 2010  #link

    Smells like a job for a Greasemonkey script…

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.