Google

Archived posts from the 'Google' Category

Ditch the spam on SERPs, pretty please?

Posted on 17 June, 2010

Say there’s a search engine that tries very hard to serve relevant results for long tail search queries. Maybe it even accepted that an algo change -supposed to wipe out shitloads of thin pages from its long tail search result pages (SERPs)- is referred to as #MayDay. One should think that this search engine isn’t exactly eager to annoy its users with crappy mash-up pages consisting of shabby stuff scraped from all known sources of duplicate content on the whole InterWebs.

Wrong.

Prominent SE spammers like Mahalo still flood the visible part of search indexes with boatloads of crap that never should be able to cheat its way onto any SERP, not even via a [site:spam.com] search. Learn more from Aaron and Michael, who’ve both invested their valuable time to craft out detailled spam reports, to no avail.

Frustrating.

Wait. Why does a bunch of spammy Web pages creates such a fuss? Because they’re findable in the search index. Of course a search engine must crawl all the WebSpam out there, and its indexer has to judge the value of all the content it gets feeded with. But there’s absolutely no need to bother the query engine, that gathers and ranks the stuff presented on the SERPs, with crap like that.

Dear Google, why do you annoy your users with spam created by “a scheme that your automated system handles quite well” at all? Those awesome spam filters should just flag crappy pages as not-SERP-worthy, so that they can never see the daylight at google.com/search. I mean, why should any searcher be at risk of pulling useless search results from your index? Hopefully not because these misleaded searchers tend to click on lots of Google ads on said pages, right?

I’d rather enjoy an empty SERP for an exotic search query, than suffer from a single link to a useless page plastered with huge ads, even if it comes with a tiny portion of stolen content that might be helpful if pointing to the source.

Do you feel like me? Speak out!

Hey Google, I dislike spam on your SERPs! #spam-report

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | Search Quality, Webspam, Spam Report, Crap, Google

Google went belly-up: SERPs sneakily redirect to FPAs

Posted on 12 May, 2010

I’m pissed. I do know I shouldn’t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.

What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? Here you go. Google’s webmaster guidelines explicitely forbid such black hat tactics:

“Don’t use cloaking or sneaky redirects.” Google just did the latter with its very own SERPs. The search interface google.com/ie, out in the wild for nearly a decade, redirects to a piece of sidebar HTML offering a download of IE8 optimized for Google. That’s a helpful redirect for some IE6 users who don’t suffer from an IT department stuck with this outdated browser, but it’s plain misleading in the eyes of all those searchers who appreciated this clean and totally uncluttered search interface. Interestingly, UA cloaking is the only way to heal this sneaky behavior.

“Don’t create pages with malicious behavior.” Google’s guilty, too. Instead of checking for the user’s browser, redirecting only IE6 requests from Google’s discontiued IE6 support (IE6 toolbar …) to the IE8 advertisement, whilst all other user agents get their desired search box, respectively their SERPs, under a google.com/search?output=ie&… URI, Google performs an unconditional redirect to a page that’s utterly useless and also totally unexpected for many searchers. I consider misleading redirects malicious.

“Avoid links to web spammers or ‘bad neighborhoods’ on the web.” I consider the propaganda for IE that Google displays instead of the search results I’d expect a bad neighborhood on the Web, because IE constantly ignores Web standards, forcing developers and designers to implement superfluous work arounds. (Ok, ok, ok … Google’s lack of geekiness doesn’t exactly count as violation of their webmaster guidelines, but it sounds good, doesn’t it?)

Hey Matt Cutts, about time to ban google.com/ie!

Google’s very best search interface is history. Here is what you got underhttp://www.google.com/ie?num=100&hl=en&safe=off&q=minimalistic:

Google's famous minimalistic search UI

And here is where Google sneakily redirects you to when you load the SERP link above (even with Chrome!):http://www.google.com/toolbar/ie8/sidebar.html:

Google's sneaky IE8 propaganda

It’s sad that a browser vendor like Google (and yes, Google Chrome is my favorite browser) feels the need to mislead its users with propaganda for a competiting browser that’s slower and doesn’t render everything as it should render it. But when this particular browser vendor also leads Web search, and makes use of black hat techniques that it bans webmasters for, then that’s a scandal. So, if you agree, please submit a spam report to Google:

Hey Matt Cutts, about time to ban google.com/ie! #spam-report

2010-05-17 I’ve updated this pamphlet because it didn’t explain the “sneakiness” clear enough. As of today, the unconditional redirect is still sneaky IMHO. Google needs to deliver searchers their desired search results, and only stubborn IE6 users ads for a somewhat better browser.

2010-05-18 Q: You’re pissed solely because your SERP scraping scrips broke. A: Glad you’ve asked. Yes, I’ve scraped Google’s /ie search too. Not because I’m a privacy nazi like Daniel Brandt. I’ve just checked (my) rankings. However, when I spotted the redirects I didn’t even remember the location of the scripts that scraped this service, because I didn’t look at ranking reports for years. I’m interested in actual traffic, and revenues. Ego food annoys me. I just love the /ie search interface. So the answer is a bold “no”. I don’t give a fucking dead rat’s ass what ranking reports based on scraped SERPs could tell.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

13 comments Sebastian | Search Quality, Redirects, Webspam, Spam Report, Cloaking, Crap, Google

Get yourself a smart robots.txt

Posted on 25 February, 2010

greedy and aggressive web robots steal your content Crawlers and other Web robots are the plague of today’s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don’t. Behaving or not, most bots just steal your content. You don’t appreciate that, so block them.

This pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines -that bring you nice traffic- while keeping the nasty (or useless, traffic-wise) bots out of the game.

The basic idea is that blocking all bots -with very few exceptions- makes more sense than maintaining kinda Web robots who’s who in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold Disallow: /.

Of course that’s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That’s bullshit. You can’t handle a gazillion bots with a black list.

Even bots that respect the REP can harm your search engine rankings, or reveal sensitive information to your competitors. Every minute a new bots turns up. You can’t manage all of them, and you can’t trust any (behaving) bot. Or, as the master of bot control explains: “That’s the only thing I’m concerned with: what do I get in return. If it’s nothing, it’s blocked“.

Also, large robots.txt files handling tons of bots are fault prone. It’s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don’t enlighten your competitors.

If you’re a smart webmaster agreeing with this approach, here’s your ToDo-List:
• Grab the code
• Install
• Customize
• Test
• Implement.
On error read further.

The anatomy of a smart robots.txt

Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you’re somewhat fucked. The code isn’t elegant. I’ve tried to keep it easy to understand even for noobs — at the expense of occasional lengthiness and redundancy.

Install

First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that’s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you’d have to add a leading <?php ?> string to all of them. Hence you add this code snippet to your root’s .htaccess file:<FilesMatch ^robots\.txt$> SetHandler application/x-httpd-php </FilesMatch>
As long as you’re testing and customizing my script, make that ^smart_robots\.txt$.

Next grab the code and extract it into your document root directory. Do not rename /smart_robots.txt to /robots.txt until you’ve customized the PHP code!

For testing purposes you can use the logRequest() function. Probably it’s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don’t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.

Customize

Search the code for #EDIT and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you’ve launched your smart robots.txt.

Let’s look at /smart_robots.txt. First of all, it sets the canonical server name, change that to yours. After routing robots.txt request logging to a flat file (change that to a database table!) it includes /smart_robots_inc.php.

Next it sends some HTTP headers that you shouldn’t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn’t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.

As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its cached copy. To test your own robots.txt with Googlebot, you can login to GWC and fetch it as Googlebot. It’s a shame that the other search engines don’t provide a feature like that.

When you implement the whitelisted crawler method, you really should provide a contact page for crawling requests. So please change the “In order to gain permissions to crawl blocked site areas…” comment.

Next up are the search engine specific crawler directives. You put them as if (isGooglebot()) { $content .= " User-agent: Googlebot Disallow: … \n\n"; }
If your URIs contain double quotes, escape them as \" in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)

Please note that you need to output at least one empty line before each User-agent: section. Repeat that for each accepted crawler, before you output $content .= "User-agent: * Disallow: / \n\n";
Every behaving Web robot that’s not whitelisted will bounce at the Disallow: /.

Before $content is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all SEO tools that spoof the user agent. Make sure that you do not output a single byte -for example leading whitespaces, a debug message, or a #comment- before the print $content; statement.

Blocking rogue bots is important. If you discover a rogue bot -for example a scraper that pretends to be Googlebot- during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can’t access your content!

Bear in mind that each and every piece of content served from your site should implement rogue bot detection, that’s doable even with non-HTML resources like images or PDFs.

Finally we deliver the user agent specific robots.txt and terminate the connection.

Now let’s look at /smart_robots_inc.php. Don’t fuck-up the variable definitions and routines that populate them or deal with the requestor’s IP addy.

Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that’s maintained 24/7 and updated every 6 hours: here you go (don’t use it for deceptive cloaking, promised?).

/smart_robots_inc.php comes with functions that detect Googlebot, MSNbot, and Slurp.

Most search engines tell how you can verify their crawlers and which crawler directives their user agents support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading “spider” and trailing “.yandex.ru” string and inbetween an integer, like in the isSlurp() function.

Test

Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don’t need to wait for crawler visits, you can use the “Fetch as Googlebot” thingy in your webmaster console.

Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from “.search.live.com” by accident, or that someone at Yahoo starts an ancient test bot that still uses an “inktomisearch.com” host name.

Don’t rely on my crawler detection routines. They’re dumped from memory in a hurry, I’ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It’s up to you to make it smart.

Launch

Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.

The output of a smart robots.txt

When you download a smart robots.txt with your browser, wget, or any other tool that comes with user agent spoofing, you’ll see a 403 or something like:

HTTP/1.1 200 OK Date: Wed, 24 Feb 2010 16:14:50 GMT Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1 X-Powered-By: sebastians-pamphlets.com X-Robots-Tag: noindex, noarchive, nosnippet Connection: close Transfer-Encoding: chunked Content-Type: text/plain;charset=iso-8859-1
# In order to gain permissions to crawl blocked site areas # please contact the webmaster via # http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot
User-agent: * Disallow: /(the contact form URI above doesn’t exist)

whilst a real search engine crawler like Googlebot gets slightly different contents:

HTTP/1.1 200 OK Date: Wed, 24 Feb 2010 16:14:50 GMT Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1 X-Powered-By: sebastians-pamphlets.com X-Robots-Tag: noindex, noarchive, nosnippet Connection: close Transfer-Encoding: chunked Content-Type: text/plain; charset=iso-8859-1
# In order to gain permissions to crawl blocked site areas # please contact the webmaster via # http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot User-agent: Googlebot Allow: / Disallow: Sitemap: http://sebastians-pamphlets.com/sitemap.xml
User-agent: * Disallow: /

Search engines hide important information from webmasters

Unfortunately, most search engines don’t provide enough information about their crawling. For example, last time I looked Google doesn’t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for “Googlebot-” and you’ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.

I’ve nudged Google and their response was that they don’t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn’t even announce properly that they’ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.

Since search engines don’t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents “on behalf of” particular user agents.

Enjoy.

Changelog:

2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

22 comments Sebastian | Crawler Directives, Web development, X-Robots-Tag, MSN, Cloaking, .htaccess, Yahoo, SEO, robots.txt, Webmaster Central, Google

URI canonicalization with an X-Canonical-URI HTTP header

Posted on 22 December, 2009

X-Canonical-URI HTTO Header Dear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I’m desperately in need of a good reason to praise your wisdom and whatnot. From this year’s x-mas wish list:

All search engines obey the X-Canonical-URI HTTP header

The rel=canonical link element is a great tool, at least if applied properly, but sometimes it’s a royal pain in the ass.

Inserting rel=canonical link elements into huge conglomerates of cluttered scripts and static files is a nightmare. Sometimes the scripts creating the most URI clutter are compiled, and there’s no way to get a hand on the source code to change them.

Also, lots of resources can’t be stuffed with HTML’s link elements, for example dynamically created PDFs, plain text files, or images.

It’s not always possible to revamp old scripts, some projects just lack a suitable budget. And in some cases 301 redirects aren’t a doable option, for example when the destination URI is #5 in a redirect chain that can’t get shortened because the redirects are performed by a 3rd party that doesn’t cooperate.

This one, on the other hand, is elegant and scalable:

if (messedUp($_SERVER["REQUEST_URI"])) {
header(”X-Canonical-URI: $canonicalUri”);
}

Or:
header(”Link: <http://example.com/canonical-uri/>; rel=canonical”);

Coding an HTTP request handler that takes care of URI canonicalization before any script gets invoked, and before any static file gets served, is the way to go for such fuddy-duddy sites.

By the way, having all URI canonicalization routines in one piece of code is way more transparent, and way better manageable, than a bazillion of isolated link elements spread over tons of resources. So that might be a feasible procedure for non-ancient sites, too.

red crab blackmailing search engines Dear search engines, if you make that happen, I promise that I don’t tweet your products with a “#crap” hashtag for the whole rest of this year. Deal?

And yes, I know I’m somewhat late, two days before x-mas, but you’ve got smart developers, haven’t you? So please, go get your ‘code monkeys’ to work and surprise me. Thanks.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Web development, MSN, Crawler Directives, SEO, Yahoo, Google

The anatomy of a deceptive Tweet spamming Google Real-Time Search

Posted on 10 December, 2009

Minutes after the launch of Google’s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:

Ingredients

1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.
1 landing page that makes the punter pull out the plastic in no time.
1 trusted authority page totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.
1 short domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.

Preparation

Create a new text file and name it hot-topic.php or so. Then code:<?php $landingPageUri = "http://affiliate-program.com/?your-aff-id"; $trustedPageUri = "http://google.com/something.py"; if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) { header("HTTP/1.1 307 Here you go today", TRUE, 307); header("Location: $trustedPageUri"); } else { header("HTTP/1.1 301 Happy shopping", TRUE, 301); header("Location: $landingPageUri"); } exit; ?>

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.

Serving

Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.
On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.
Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.
Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

Hey Google, for the sake of our children, take that as a spam report!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

20 comments Sebastian | Webspam, Search Quality, Redirects, Internet Marketing, Spam, Twitter, SEO, Cloaking, Crap, Google

The perfect robots.txt for News Corp

Posted on 7 December, 2009

News Copy kicking out Google News I appreciate Google’s brand new News User Agent. It is, however, not a perfect solution, because it doesn’t distinguish indexing and crawling.

Disallow is a crawler directive, that simply tells web robots “do not fetch my content”. It doesn’t prevent contents from indexing. That means, search engines can index content they’re not allowed to fetch from the source, and send free traffic to disallow’ed URIs. In case of news, there are enough 3rd party signals (links, anchor text, quotes, …) out there to create a neat title and snippet on the SERPs.

Fortunately, Google’s REP implementation allows news sites to refine the suggested robots.txt syntax below. Google supports noindex in robots.txt.

Below I’ve edited the robots.txt syntax suggested by Google (source).

Include pages in Google web search, but not in News:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-News
Disallow: /
Noindex: /

This robots.txt file says that no files are disallowed from Google’s general web crawler, called Googlebot, but the user agent “Googlebot-News” is blocked from all files on the website. The “Noindex” directive makes sure that Google News cannot use forbidden stuff indexed from 3rd party signals.

Include pages in Google News, but not Google web search:

User-agent: Googlebot
Disallow: /
Noindex: /

User-agent: Googlebot-News
Disallow:

When parsing a robots.txt file, Google obeys the most specific directive. The first two lines tell us that Googlebot (the user agent for Google’s web index) is blocked from crawling any pages from the site. The next directive, which applies to the more specific user agent for Google News, overrides the blocking of Googlebot and gives permission for Google News to crawl pages from the website. The “Noindex” directive makes sure that Google Web Search cannot use forbidden stuff indexed from 3rd party signals.

Of course other search engines might handle this differently. So it is obviously a good idea to add indexer directives on page level, too. The most elegant way to do that is a noindex,noarchive,nosnippet X-Robots-Tag in the HTTP header, because images, videos, PDFs etc. can’t be stuffed with HTML’s META elements.

See how this works neatly with Web standards? There’s no need for ACrAP!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Crawler Directives, robots.txt, Google

Hard facts about URI spam

Posted on 1 December, 2009

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed &utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29

GA Kraken I’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds -I really mean that, literally!-, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

?utm_source=sebastian&utm_medium=pamphlet&utm_campaign=thou+shalt+not+fuck+with+my+uris

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

19 comments Sebastian | Search Quality, Duplicate Content, Analytics, Internet Marketing, Webspam, Spam, SEO, Crap, Copy+Paste-Penalties, AdSense, Google

How to borrow relevance from authority pages with 307 redirects

Posted on 27 November, 2009

Every once in a while I switch to Dr Evil mode. That’s a “do more evil” type of pamphlet. Don’t bother reading the disclaimer, just spam away …

Content theft with 307 redirects Why the heck should you invest valuable time into crafting out compelling content, when there’s a shortcut?

There are so many awesome Web pages out there, just pick some and steal their content. You say “duplicate content issues”, I say “don’t worry”. You say “copyright violation”, I say “be happy”. Below I explain the setup.

This somewhat shady IM technique is for you when you’re shy of automatted content generation.

Register a new (short!) domain and create a tiny site with a few pages of totally unique and somewhat interesting content. Write opinion pieces, academic papers or whatnot, just don’t use content generators or anything that cannot pass a human bullshit detector. No advertising. No questionable links. Instead, link out to authority pages. No SEO stuff like nofollow’ed links to imprints or so.

Launch with a few links from clean pages. Every now and then drop a deep link in relevant discussions on forums or social media sites. Let the search engines become familiar with your site. That’ll attract even a few natural inbound links, at least if your content is linkworthy.

Use Google’s Webmaster Console (GWC) to monitor your progress. Once all URIs from your sitemap are indexed and show in [site:yourwebspam.com] searches, begin to expand your site’s menu and change outgoing links to authority pages embedded in your content.

Create short URIs (LE 20 characters!) that point to authority pages. Serve search engine crawlers a 307, and human surfers a 301 redirect. Build deep links to those URIs, for example in tweets. Once you’ve gathered 1,000+ inbounds, you’ll receive SERP traffic. By the way, don’t buy the sandbox myths.

Watch the keywords page in you GWC account. It gets populated with keywords that appear only in content of pages you’ve hijacked with redirects. Watch your [site:yourwebspam.com] SERPs. Usually the top 10 keywords listed in the GWC report will originate from pages listed on the first [site:yourwebspam.com] SERPs, provided you’ve hijacked awesome content.

Add (new) keywords from pages that appear both in redirect destinations listed within the first 20 [site:yourwebspam.com] search results, as well as in the first 20 listed keywords, to articles you actually serve on your domain.

Detect SERP referrers (human surfers who’ve clicked your URIs on search result pages) and redirect those to sales pitches. That goes for content pages as well as for redirecting URIs (mimiking shortened URIs). Laugh all the way to the bank.

Search engines rarely will discover your scam. Of course shit happens, though. Once the domain is burned, just block crawlers, redirect everything else to your sponsors, and let the domain expire.

History: Content theft with 307 redirects Disclaimer: Google has put an end to most 307 spam tactics. That’s why I’m publishing all this crap. Because watching decreasing traffic to spammy sites is frustrating. Deceptive 307′ing URIs won’t rank any more. Slowly, actually very slow, GWC reports follow suit.

What can we learn? Do not believe in the truth of search engine reports. Just because Google’s webmaster console tells you that Google thinks a keyword is highly relevant to your site, that doesn’t mean you’ll rank for it on their SERPs. Most probably GWC is not the average search engine spammer’s tool of the trade.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Redirects, Webspam, Webmaster Central, Google

As if sloppy social media users ain’t bad enough … search engines support traffic theft

Posted on 26 October, 2009

Prepare for a dose of techy tin foil hattery. [Skip rant] Again, I’m going to rant about a nightmare that Twitter & Co created with their crappy, thoughtless and shortsighted software designs: URI shorteners (yup, it’s URI, not URL).

Recap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential ~~visitors~~ customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of tinyurl.com’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs …

How should a search engine handle a shortened URI?

Handling an URI as shortened URL requires a bullet proof method to detect shortened URIs. That’s a breeze.

Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.
Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).
Seed: trusted sources like LongURL.org provide lists of domains owned by URI shortening services. Social media outlets providing their own URI shorteners don’t hide server name patterns (like su.pr …).
Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.
URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [site:tinyurl.com] SERPs, but on a [site:tinyurl.com shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

In other words: search engines should make use of their knowledge of shortened URIs in response to navigational search queries. In fact, search engines could even solve the problem of vanished and abused short URIs.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Bing 301 messed up SERPs Oh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from tinyurl.com alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [site:tinyurl.com], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

Google fails with merits As for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google -under particular circumstances- indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Long time ago Tim Berners-Lee told you that ~~URI shorteners are evil~~ fucking with URIs is a very bad habit. Did you listen? Do you make use of shortened URIs? If you post URIs that get shortened at Twitter, or if you make use of 3rd party URI shorteners elsewhere, consider yourself trapped into a low-life traffic theft scam. Shame on you, and shame on Twitter & Co.

fight evil URI shorteners Besides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Social Web, MSN, URI shortening, Search Quality, Spam, Yahoo, Risky Linkage, Twitter, Google