The perfect robots.txt for News Corp

News Copy kicking out Google NewsI appreciate Google’s brand new News User Agent. It is, however, not a perfect solution, because it doesn’t distinguish indexing and crawling.

Disallow is a crawler directive, that simply tells web robots “do not fetch my content”. It doesn’t prevent contents from indexing. That means, search engines can index content they’re not allowed to fetch from the source, and send free traffic to disallow’ed URIs. In case of news, there are enough 3rd party signals (links, anchor text, quotes, …) out there to create a neat title and snippet on the SERPs.

Fortunately, Google’s REP implementation allows news sites to refine the suggested robots.txt syntax below. Google supports noindex in robots.txt.

Below I’ve edited the robots.txt syntax suggested by Google (source).

Include pages in Google web search, but not in News:

User-agent: Googlebot

User-agent: Googlebot-News
Disallow: /
Noindex: /

This robots.txt file says that no files are disallowed from Google’s general web crawler, called Googlebot, but the user agent “Googlebot-News” is blocked from all files on the website. The “Noindex” directive makes sure that Google News cannot use forbidden stuff indexed from 3rd party signals.

User-agent: Googlebot
Disallow: /
Noindex: /

User-agent: Googlebot-News

When parsing a robots.txt file, Google obeys the most specific directive. The first two lines tell us that Googlebot (the user agent for Google’s web index) is blocked from crawling any pages from the site. The next directive, which applies to the more specific user agent for Google News, overrides the blocking of Googlebot and gives permission for Google News to crawl pages from the website. The “Noindex” directive makes sure that Google Web Search cannot use forbidden stuff indexed from 3rd party signals.

Of course other search engines might handle this differently. So it is obviously a good idea to add indexer directives on page level, too. The most elegant way to do that is a noindex,noarchive,nosnippet X-Robots-Tag in the HTTP header, because images, videos, PDFs etc. can’t be stuffed with HTML’s META elements.

See how this works neatly with Web standards? There’s no need for ACrAP!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Hard facts about URI spam

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

GA KrakenI’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI:

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds –I really mean that, literally!–, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:2010 SEMMY Nominee

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

How to borrow relevance from authority pages with 307 redirects

Every once in a while I switch to Dr Evil mode. That’s a “do more evil” type of pamphlet. Don’t bother reading the disclaimer, just spam away …

Content theft with 307 redirectsWhy the heck should you invest valuable time into crafting out compelling content, when there’s a shortcut?

There are so many awesome Web pages out there, just pick some and steal their content. You say “duplicate content issues”, I say “don’t worry”. You say “copyright violation”, I say “be happy”. Below I explain the setup.

This somewhat shady IM technique is for you when you’re shy of automatted content generation.

Register a new (short!) domain and create a tiny site with a few pages of totally unique and somewhat interesting content. Write opinion pieces, academic papers or whatnot, just don’t use content generators or anything that cannot pass a human bullshit detector. No advertising. No questionable links. Instead, link out to authority pages. No SEO stuff like nofollow’ed links to imprints or so.

Launch with a few links from clean pages. Every now and then drop a deep link in relevant discussions on forums or social media sites. Let the search engines become familiar with your site. That’ll attract even a few natural inbound links, at least if your content is linkworthy.

Use Google’s Webmaster Console (GWC) to monitor your progress. Once all URIs from your sitemap are indexed and show in [] searches, begin to expand your site’s menu and change outgoing links to authority pages embedded in your content.

Create short URIs (LE 20 characters!) that point to authority pages. Serve search engine crawlers a 307, and human surfers a 301 redirect. Build deep links to those URIs, for example in tweets. Once you’ve gathered 1,000+ inbounds, you’ll receive SERP traffic. By the way, don’t buy the sandbox myths.

Watch the keywords page in you GWC account. It gets populated with keywords that appear only in content of pages you’ve hijacked with redirects. Watch your [] SERPs. Usually the top 10 keywords listed in the GWC report will originate from pages listed on the first [] SERPs, provided you’ve hijacked awesome content.

Add (new) keywords from pages that appear both in redirect destinations listed within the first 20 [] search results, as well as in the first 20 listed keywords, to articles you actually serve on your domain.

Detect SERP referrers (human surfers who’ve clicked your URIs on search result pages) and redirect those to sales pitches. That goes for content pages as well as for redirecting URIs (mimiking shortened URIs). Laugh all the way to the bank.

Search engines rarely will discover your scam. Of course shit happens, though. Once the domain is burned, just block crawlers, redirect everything else to your sponsors, and let the domain expire.

History: Content theft with 307 redirectsDisclaimer: Google has put an end to most 307 spam tactics. That’s why I’m publishing all this crap. Because watching decreasing traffic to spammy sites is frustrating. Deceptive 307′ing URIs won’t rank any more. Slowly, actually very slow, GWC reports follow suit.

What can we learn? Do not believe in the truth of search engine reports. Just because Google’s webmaster console tells you that Google thinks a keyword is highly relevant to your site, that doesn’t mean you’ll rank for it on their SERPs. Most probably GWC is not the average search engine spammer’s tool of the trade.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Handle your (UGC) feeds with care!

When you run a website that deals with user generated content (UGC), this pamphlet is for you. #bad_news Otherwise you might enjoy it. #malicious_joy

Spam in your feeds

Jump station

Not that the recent –and ongoing– paradigm shift shift in crawling, indexing, and ranking is bad news in general. On the contrary, for the tech savvy webmaster it comes with awesome opportunities with regard to traffic generation and search engine optimization. In this pamphlet I’ll blather about pitfalls, leaving chances unmentioned.


For a moment forget everything you’ve heard about traditional crawling, indexing and ranking. Don’t buy that search engines solely rely on ancient technologies like fetching and parsing Web pages to scrape links and on-the-page signals out of your HTML. Just because someone’s link to your stuff is condomized, that doesn’t mean that Google & Co don’t grab its destination instantly.

More and more, search engines crawl, index, and rank stuff hours, days, or even weeks before they actually bother to fetch the first HTML page that carries a (nofollow’ed) link pointing to it. For example, Googlebot might follow a link you’ve tweeted right when the tweet appears on your timeline, without crawling or the timeline page of any of your followers. Magic? Nope. The same goes for favs/retweets, stumbles, delicious bookmarks etc., by the way.

Guess why Google encourages you to make your ATOM and RSS feeds crawlable. Guess why FeedBurner publishes each and every update of your blog via PubSubHubbub so that Googlebot gets an alert of new and updated posts (and comments) as you release them. Guess why Googlebot is subscribed to FriendFeed, crawling everything (blog feed items, Tweets, social media submissions, comments …) that hits a FriendFeed user account in real-time. Guess why GoogleReader passes all your Likes and Shares to Googlebot. Guess why Bing and Google are somewhat connected to Twitter’s database, getting all updates, retweets, fav-clicks etc. within a few milliseconds.

Because all these data streams transport structured data that are easy to process. Because these data get pushed to the search engine. That’s way cheaper, and faster, than polling a gazillion of sources for updates 24/7/365.

Making use of structured data (XML, RSS, ATOM, PUSHed updates …) enables search engines to index fresh content, as well as reputation and other off-page signals, on the fly. Many ranking signals can be gathered from these data streams and their context, others are already on file. Even if a feed item consists of just a few words and a link (e.g. a tweet or stumble-thumbs-up), processing the relevant on-the-page stuff from the link’s final destination by parsing cluttered HTML to extract content and recommendations (links) doesn’t really slow down the process.

Later on, when the formerly fresh stuff starts to decompose on the SERPs, signals extracted from HTML sources kick in, for example link condoms, link placement, context and so on. Starting with the discovery of a piece of content, search engines permanently refine their scoring, until the content finally drops out of scope (spam filtering, unpaid hosting bills and other events that make Web content disappear).

Traditional discovery crawling, indexing, and ranking doesn’t exactly work for real-time purposes, nor in near real-time. Not even when a search engine assigns a bazillion of computers to this task. Also, submission based crawling is not exactly a Swiss Army knife when it comes to timely content. Although XML-sitemaps were a terrific accelerator, they must be pulled for processing, hence a delay occurs by design.

Nothing is like it used to be. Change happens.

Why does this paradigm shift puts your site at risk?

Spam puts your feeds at riskAs a matter of fact, when you publish user generated content, you will get spammed. Of course, that’s bad news of yesterday. Probably you’re confident that your anti-spam defense lines will protect you. You apply link condoms to UGC link drops and all that. You remove UGC once you spot it’s spam that slipped through your filters.

Bad news is, your medieval palisade won’t protect you from a 21th century tank attack with air support. Why not? Because you’ve secured your HTML presentation layer, but not your feeds. There’s no such thing as a rel-nofollow microformat for URIs in feeds, and even condomized links transported as CDATA (in content elements) are surrounded by spammy textual content.

Feed items come with a high risk. Once they’re released, they’re immortal and multiply themselves like rabbits. That’s bad enough in case a pissed employee ‘accidently’ publishes financial statements on your company blog. It becomes worse when seasoned spammers figure out that their submissions can make it into your feeds, and be it only for a few milliseconds.

If your content management system (CMS) creates a feed item on submission, search engines –in good company with legions of Web services– will distribute it all over the InterWeb, before you can hit the delete button. It will be cached, duplicated, published and reprinted … it’s out of your control. You can’t wipe out all of its instances. Never.

Congrats. You found a surefire way to piss off both your audience (your human feed subscribers getting their feed reader flooded with PPC spam), and search engines as well (you send them weird spam signals that rise all sorts of red flags). Also, it’s not desirable to make social media services –that you rely on for marketing purposes– too suspicious (trigger happy anti-spam algos might lock away your site’s base URI in an escape-proof dungeon).

So what can you do to prevent your feeds from unwanted content?

Protect your feedsBefore I discuss advanced feed protection, let me point you to a few popular vulnerabilities you might haven’t considered yet:

  • No nay never use integers as IDs before you’re dead sure that a piece of submitted content is floral white as snow. Integer sequences produce guessable URIs. Instead, generate a UUID (aka GUID) as identifier. Yeah, I know that UUIDs make ugly URIs, but those aren’t predictable and therefore not that vulnerable. Once a content submission is finally approved, you can donate it a nice –maybe even meaningful– URI.
  • No nay never use titles, subjects or so in URIs, not even converted text from submissions (e.g. ‘My PPC spam’ ==> ‘my_ppc_spam’). Why not? See above. And you don’t really want to create URIs that contain spammy keywords, or keywords that are totally unrelated to your site. Remember that search engines do index even URIs they can’t fetch, or which they can’t refetch, at least for a while.
  • Before the final approval, serve submitted content with a “noindex,nofollow,noarchive,nosnippet” X-Robots-Tag in the HTTP header, and put a corresponding meta element in the HEAD section. Don’t rely on link condoms. Sometimes search engines ignore rel-nofollow as an indexer directive on link level, and/or decide that they should crawl the link’s destination anyway.
  • Consider serving social media bots requesting a not yet approved piece of user generated content a 503 HTTP response code. You can compile a list of their IPs and user agent names from your raw logs. These bots don’t obey REP directives, that means they fetch and process your stuff regardless whether you yell “noindex” at them or not.
  • For all burned (disapproved) URIs that were in use ensure that your server returns a 410-Gone HTTP status code, respectively perform a 301 redirect to a policy page or so to rescue link love that would get wasted otherwise.
  • Your Web forms for content submissions should be totally AJAX’ed. Use CAPTCHAs and all that. Split the submission process into multiple parts, each of them talking to the server. Reject excessively rapid walk throughs, for example by asking for something unusual when a step gets completed in a too short period of time. With AJAX calls that’s painless for the legit user. Do not accept content submissions via standard GET or POST requests.
  • Serve link builders coming from SERPs for [URL|story|link submit|submission your site’s topic] etc. your policy page, not the actual Web form.
  • There’s more. With the above said, I’ve just begun to scrape the surface of a savvy spammer’s technical portfolio. There’s next to nothing a clever programmed bot can’t mimick. Be creative and think outside the box. Otherwise the spammers will be ahead of you in no time, especially when you make use of a standard CMS.

Having said that, lets proceed to feed protection tactics. Actually, there’s just one principle set in stone:

Make absolutely sure that submitted content can’t make it into your feeds (and XML sitemaps) before it’s finally approved!

The interesting question is: what the heck is a “final approval”? Well, that depends on your needs. Ideally, that’s you releasing each and every piece of submitted content. Since this approach doesn’t scale, think of ways to semi-automate the process. Don’t fully automate it, there’s no such thing as an infallible algo. Also, consider the wisdom of the crowd spammable (voting bots). Some spam will slip through, guaranteed.

Each and every content submission must survive a probation period, whereas it will not be included in your site’s feeds. Regardless who contributed it. Stick with the four-eye principle. Here are a few generic procedures you could adapt, respectively ideas which could inspire you:

  • Queue submissions. Possible queues are Blocked, Quarantaine, Suspect, Probation, and finally Released. Define simple rules and procedures that anyone involved can follow. SOPs lack work arounds and loopholes by design.
  • Stuck content submissions from new users who didn’t participate in other ways in quarantaine. Moderate this queue and only manually release into the probation queue what passes the moderator’s heuristics. Signup-submit-and-forget is a typical spammer behavior.
  • Maintain black lists of domain names, IPs, countries, user agent names, unwanted buzzwords and so on. Use filters to arrest submissions that contain keywords you wouldn’t expect to match your site’s theme in the Blocked or Quarantaine queue.
  • On submission fetch the link’s content and analyze it, don’t stick with heuristic checks of URIs, titles and descriptions. Don’t use methods like PHP’s file_get_contents that don’t return HTTP response codes. You need to know whether a requested URI is the first one of a redirect chain, for example. Double check with a second request from another IP, preferably owned by a widely used ISP, with a standard browser’s user agent string, that provides an HTTP_REFERER, for example a Google SERP with a q parameter populated with a search term compiled from the submission’s suggested anchor text. If the returned content differs too much, set a red flag.
  • Maintain white lists, too. That’s a great way to reduce the amount of inavoidable false positives.
  • If you have editorial staff or moderators, they should get a Release to Feed  button. You can combine mod releases with a minimum number of user votes or so. For example you could define a rule like “release to feed if mod-release = true and num-trusted-votes > 10″.
  • Categorize your user’s reputation and trustworthiness. A particular number of votes from trusted users could approve a submission for feed inclusion.
  • Don’t automatically release submissions that have raised any flag. If that slows down the process, refine your flagging but don’t lower the limits.
  • With all automatted releases, for example based on votings, oops, especially based on votings, implement at least one additional sanity check. For example discard votes from new users as well as from users with a low participation history, check the sequence of votes for patterns like similar periods of time between votings, and so on.

Disclaimer: That’s just some food for thoughts. I want to make absolutely clear that I can’t provide bullet-proof anti-spam procedures. Feel free to discuss your thoughts, concerns, questions … in the comments.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

The most sexy browsers screw your analytics

Chrome and Safari fuck with the HTTP_REFERERNow that IE is quite unusable due to the lack of websites that support its non-standard rendering, and the current FireFox version suffers from various maladies, more and more users switch to browsers that are supposed to comply to Web standards, such as Chrome, Safari, or Opera.

Those sexy user agents execute client sided scripts in lightning speed, making surfers addicted to nifty rounded corners very very happy. Of course they come with massive memory leaks, but surfers who shut down their browser every once in a while won’t notice such geeky details.

Why is that bad news for Internet marketers? Because Chrome and Safari screw your analytics. Your stats are useless with regard to bookmarkers and type-in traffic. Your referrer stats lack all hits from Chrome/Safari users who have opened your landing page in a new tab or window.

Google’s Chrome and Apple’s Safari do not provide an HTTP_REFERER. (The typo is standardized, too.)

This bug was reported in September 2008. It’s not yet fixed. Not even in beta versions.

Guess from which (optional) HTTP header line your preferred stats tool compiles the search terms to create all the cool keyword statistics? Yup, that’s the HTTP_REFERER’s query string when the visitor came from a search result page (SERP). Especially on SERPs many users open links in new tabs. That means with every searcher switching to a sexy browser your keyword analysis becomes more useless.

That’s not only an analytics issue. Many sites provide sensible functionality based on the referrer (the Web page a user came from), for example default search terms for site-search facilities gathered from SERP-referrers. Many sites evaluate the HTTP_REFERER to prevent themselves from hotlinking, so their users can’t view the content they’ve paid for when they open a link in a new tab or window.

Passing a blank HTTP_REFERER when this information is available to the user agent is plain evil. Of course lots of so-called Internet security apps do this by default, but just because others do evil that doesn’t mean a top-notch Web browser like Safari or Chrome can get away with crap like this for months and years to come.

Please nudge the developers!

Here you go. Post in this thread why you want them to fix this bug asap. Tell the developers that you can’t live with screwed analytics, and that your site’s users rely on reliable HTTP_REFERERs. Even if you don’t run a website yourself, tell them that your favorite porn site bothers you with countless error messages instead of delivering smut, just because WebKit browsers are buggy.

You can test whether your browser passes the HTTP_REFERER or not: Go to this Google SERP. On the link to this post chose “Open link in new tab” (or window) in the context menu (right click over the link). Scroll down.

Your browser passed this HTTP_REFERER: None

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

As if sloppy social media users ain’t bad enough … search engines support traffic theft

Prepare for a dose of techy tin foil hattery. [Skip rant] Again, I’m going to rant about a nightmare that Twitter & Co created with their crappy, thoughtless and shortsighted software designs: URI shorteners (yup, it’s URI, not URL).

don't get seduced by URI shortenersRecap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential visitors customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs

How should a search engine handle a shortened URI?

Handling an URI as shortened URL requires a bullet proof method to detect shortened URIs. That’s a breeze.

  • Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.
  • Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).
  • Seed: trusted sources like provide lists of domains owned by URI shortening services. Social media outlets providing their own URI shorteners don’t hide server name patterns (like …).
  • Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.
  • URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [] SERPs, but on a [ shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

In other words: search engines should make use of their knowledge of shortened URIs in response to navigational search queries. In fact, search engines could even solve the problem of vanished and abused short URIs.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Bing 301 messed up SERPsOh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo 301 somewhat OkYahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

Google fails with meritsAs for tinyURLs, Google indexes only pages on the domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google –under particular circumstances– indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Long time ago Tim Berners-Lee told you that URI shorteners are evil fucking with URIs is a very bad habit. Did you listen? Do you make use of shortened URIs? If you post URIs that get shortened at Twitter, or if you make use of 3rd party URI shorteners elsewhere, consider yourself trapped into a low-life traffic theft scam. Shame on you, and shame on Twitter & Co.

fight evil URI shortenersBesides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Derek Powazek outed himself big-mouthed and ignorant, and why that’s a pity

Derek PowazekWith childish attacks on his colleagues, Derek Powazek didn’t do professional Web development –as an industry– a favor. As a matter of fact, Derek Powazek insulted savvy Web developers, Web designers, even search engine staff, as well as useability experts and search engine specialists, who team-up in countless projects helping small and large Web sites succeed.

I seriously can’t understand how Derek Powazek “has survived 13 years in the web biz” (source) without detailled knowledge of how things get done in Web projects. I mean, if a developer really has worked 13 years in the Web biz, he should know that the task of optimizing a Web site’s findability, crawlability, and accessibility for all user agents out there (SEO) is usually not performed by “spammers evildoers and opportunists”, but by highly professional experts who just master Web development better than the average designer, copy-writer, publisher, developer or marketing guy.

Boy, what an ego. Derek Powazek truly believes that if “[all SEO/SEM techniques are] not obvious to you, and you make websites, you need to get informed” (source). That translates to “if you aren’t 100% perfect in all aspects of Web development and Internet marketing, don’t bother making Web sites — go get a life”.

Derek PowazekWell, I consider very few folks capable of mastering everything in Web development and Internet marketing. Clearly, Derek Powazek is not a member of this elite. With one clueless, uninformed and way too offensive rant he has ruined his reputation in a single day. Shortly after his first thoughtless blog post libelling fellow developers and consultants, Google’s search result page for [Derek Powazek] is flooded with reasonable reactions revealing that Derek Powazek’s pathetic calls for ego food are factually wrong.

Of course calm and knowledgable experts in the field setting the records straight, like Danny Sullivan (search result #1 and #4 for [Derek Powazek] today) and Peter da Vanzo (SERP position #9), can outrank a widely unknown guy like Derek Powazek at all major search engines. Now, for the rest of his presence on this planet, Derek Powazek has to live with search results that tell the world what kind of an “expert” he really is (example ).

He should have read Susan Moskwa’s very informative article about reputation management on Google’s Blog a day earlier. Not that reputation management doesn’t count as an SEO skill … actually, that’s SEO basics (as well as URI canonicalization).

Dear Derek Powazek, guess what all the bright folks you’ve bashed so cleverly will do when you ask them to take down their responses to your uncalled-for dirty talk?

So what can we learn from this gratuitous debacle? Do not piss in someone’s roses when

  • you suffer from an oversized ego,
  • you’ve not the slightest clue what you’re talking about,
  • you can’t make a point with proven facts, so you’ve to use false pretences and clueless assumptions,
  • you tend to insult people when you’re out of valid arguments,
  • willy whacking is not for you, because your dick is, well, somewhat undersized.

Ok, it’s Friday evening, so I’m supposed to enjoy TGIF’s. Why the fuck am I wasting my valuable spare time writing this pamphlet? Here’s why:

Having worked in, led, and coached WebDev teams on crawlability and best practices with regard to search engine crawling and indexing for ages now, I was faced with brain amputated wannabe geniuses more than once. Such assclowns are able to shipwreck great projects. From my experience the one and only way to keep teams sane and productive is sacking troublemakers at the moment you realize they’re unconvinceable. This Powazek dude has perfectly proven that his ignorance is persistent, and that his anti-social attitude is irreversible. He’s the prime example of a guy I’d never hire (except if I’d work for my worst enemy). Go figure.

Update 2009-10-19: I consider this a lame excuse. Actually, it’s even more pathetic than the malicious slamming of many good folks in his previous posts. If Derek Powazek really didn’t know what “SEO” means in the first place, his brain farts attacking something he didn’t understand at the time of publishing his rants are indefensible, provided he was anything south of sane then. Danny Sullivan doesn’t agree, and he’s right when he says that every industry has some black sheep, but as much as I dislike comment spammers, I dislike bullshit and baseness.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Full disclosure @ FTC

Protecting WHOM exactly?Trying to avoid an $11,000 fine in the Federal Trade Commission’s war on bloggers:

When I write about praise search engines, that’s totally paid-for because I’ve received free search results upfront.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Search engines should make shortened URIs somewhat persistent

URI shorteners are crap. Each and every shortened URI expresses a design flaw. All –or at least most– public URI shorteners will shut down sooner or later, because shortened URIs are hard to monetize. Making use of 3rd party URI shorteners translates to “put traffic at risk”. Not to speak of link love (PageRank, Google juice, link popularity) lost forever.

SEs could rescue tiny URLsSearch engines could provide a way out of the sURL dilemma that Twitter & Co created with their crappy, thoughtless and shortsighted software designs. Here’s how:

Most browsers support search queries in the address bar, as well as suggestions (aka search results) on DNS errors, and sometimes even 404s or other HTTP response codes other than 200/3x. That means browsers “ask a search engine” when an HTTP request fails.

When a TLD is out of service, search engines could have crawled a 301 or meta refresh from a page formerly living on a .yu domain for example. They know the new address and can lead the user to this (working) URI.

The same goes for shortened URIs created ages ago by URI shortening services that died in the meantime. Search engines have transferred all the link juice from the shortened URI to the destination page already, so why not point users that request a dead short URI to the right destination?

Search engines have all the data required for rescuing short URIs that are out of service in their datebases. Not de-indexing “outdated” URIs belonging to URI shorteners would be a minor tweak. At least Google has stored attributes and behavior of all links on the Web since the past century, and most probably other search engines are operated by data rats too.

URI shorteners can be identified by simple patterns. They gather tons of inbound links from foreign domains that get redirected (not always using a 301!) to URIs on other 3rd party domains. Of course that applies to some AdServers too, but rest assured search engines do know the differences.

So why the heck didn’t Google, Yahoo/MSN Bing, and Ask offer such a service yet? I thought it’s all about users, but I might have misread something. Sigh.

By the way, I’ve recorded search engine misbehavior with regard to shortened URIs that could arouse Jack The Ripper, but that’s a completely other story.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

The “just create compelling and useful content” lie

The compelling content lieI’m so sick of the universal answer to all SEO questions. Each and every search engine rep keeps telling me that “creating a great and useful site with compelling content will gain me all the rankings I deserve”. What a pile of bullshit. Nothing ranks without strong links. I deserve more, and certainly involving less work.

Honestly, why the heck should I invest any amount of time and even money to make others happy? It’s totally absurd to put up a great site with compelling content that’s easy to navigate and all that, just to please an ungrateful crowd of anonymous users! What a crappy concept. I don’t buy it.

I create websites for the sole purpose of making money, and lots of green. It’s all about ME! Here I said it. Now pull out the plastic. ;-)

And there’s another statement that really annoys me: “make sites for users, not for search engines”. Again, with self-serving commandments like this one search engine quality guidelines do insult my intelligence. Why should I make even a single Web page for search engines? Google, Yahoo Microsoft and Ask staff might all be wealthy folks, but from my experience they don’t purchase much porn on-line.

I create and publish Web contents to laugh all the way to the bank at the end of the day. For no other reason. I swear. To finally end the ridiculous discussion of utterly useless and totally misleading search engine guidelines, I’ll guide you step by step through a few elements of a successful website, explaining why and for whom I perform whatever, why it’s totally selfish, and what it’s worth.

In some cases that’ll be quite a bit geeky, so just skip the technical stuff if you’re a plain Internet marketer. Also, I don’t do everything by the book on Web development, so please read the invisible fine print carefully .


This gatekeeper prevents my sites from useless bot traffic. That goes for behaving bots at least. Others might meet a script that handles them with care. I’d rather serve human users bigger ads than wasting bandwidth for useless bots requesting shitloads of content.


I’m a big fan of the “one URI = one piece of content” principle. I consider gazillions of URI variations serving similar contents avoidable crap. That’s because I can’t remember complex URIs stuffed with tracking parameters and other superfluous clutter. Like the average bookmarking surfer, I prefer short and meaningful URIs. With a few simple .htaccess directives I make sure that everybody gets the requested piece of content advertising under the canonical URI.


Error handling is important. Before I throw a 404-Not-found error to a human visitor, I analyze the request’s context (e.g. REQUEST_URI, HTTP_REFERER). If possible, I redirect the user to the page s/he should have requested then, or at least to a page with related links banner ads. Bouncing punters don’t make me any money.

HTTP headers

There’s more than HTTP response codes doable with creative headers. For example cost savings. In some cases a single line of text in an HTTP header tells the user agent more than a bunch of markup.

Sensible page titles and summaries

There’s nothing better to instantly catch the user’s interest than a prominent page title, followed by a short and nicely formatted summary that the average reader can skim in few seconds. Fast loading and eye-catching graphics can do wonders, too. Just in case someone’s scraping bookmarking my stuff, I duplicate my titles into usually invisible TITLE elements, and provide seducing calls for action in descriptive META elements. Keeping the visitor interested for more than a few seconds results in monetary opportunities.

Unique content

Writing unique, compelling, and maybe even witty product descriptions increases my sales. Those are even more attractive when I add neat stuff that’s not available from the vendor’s data feed (I don’t mean free shipping, that’s plain silly). A good (linkworthy) product page comes with practical use cases and/or otherwise well presented, not too longish outlined, USPs. Producers as well as distributors do suck at this task.

User generated content

Besides faked testimonials and the usual stuff, asking vistors questions like “Just in case you buy product X here, what will you actually do with it? How will you use it? Whom will you give it to?” leads to unique text snippets creating needs. Of course all user generated content gets moderated.

Ajax’ed “Buy now” widgets

Most probably a punter who has clicked the “add to shopping cart” link on a page nicely gathering quite a few products will not buy another one of them, if the mouse click invokes a POST request of the shopping cart script requiring a full round trip to the server. Out of sight, out of mind.


Both static as well as dynamically themed sitemap pages funnel a fair amount of vistors to appropriate landing pages. Dumping major parts of a site’s structure in XML format attracts traffic, too. There’s no such thing as bad traffic, just weak upselling procedures.

Ok, that’s enough examples to bring my point home. Probably I’ve bored you to death anyway. You see, whatever I do as a site owner, I do it only for myself (inevitably accepting collateral damage like satisfied punters and compliance to search engine quality guidelines). Recap: I don’t need no stinkin’ advice restrictions from search engines.

Seriously, since you’re still awake and following my long-winded gobbledygook, here’s a goodie:

The Number One SEO Secret

Think like a search engine engineer. Why? Because search engines are as selfish as you are, or at least as you should be.

In order to make money from advertising, search engines need boatloads of traffic every second, 24/7/365. Since there are only so many searchers populating this planet, search engines rely on recurring traffic. Sounds like a pretty good reason to provide relevant search results, doesn’t it?

That’s why search engines develop high sophisticated algorithms that try to emulate human surfers, supposed to extract the most useful content from the Web’s vast litter boxes. Their engineers tweak those algos on a daily basis, factoring in judgements of more and more signals as communities, services, opportunities, behavior, and techniques used on the Web evolve.

They try really hard to provide their users with the cream of the crop. However, SE engineers are just humans like you and me, therefore their awesome algos –as well as the underlying concepts– can fail. So don’t analyze too many SERPs to figure out how a SE engineer might tick, just be pragmatic and imagine you’ve got enough SE shares to temporarily throw away your Internet marketer hat.

Anytime when you have to make a desicion on content, design, navigation or whatever, switch to search engine engineer mode and ask yourself questions like “does this enhance the visitor’s surfing experience?”, “would I as a user appreciate this?” or simply “would I buy this?”. (Of course, instead of emulating an imaginary SE engineer, you also could switch to plain user mode. The downside of the latter brain emulation is, unfortunately, that especially geeks tend to trust the former more easily.)

Bear in mind that even if search engines don’t cover particular (optimization) techniques today, they might do so tomorrow. The same goes for newish forms of content presentation etc. Eventually search engines will find a way to work with any signal. Most of our neat little tricks bypassing today’s spam filters will stop working some day.

After completing this sanity check, heal your schizophrenia and evaluate whether it will make you money or not. Eventually, that’s the goal.

By the way, the above said doean’t mean that only so-called ‘purely white hat’ traffic optimization techniques work with search engines. Actually, SEO = hex’53454F’, and that’s a pretty dark gray. :-)

Related thoughts: Optimize for user experience, but keep the engines in mind. Misleading some folks just like this pamphlet.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28  Next Page »