Google went belly-up: SERPs sneakily redirect to FPAs

I’m pissed. I do know I shouldn’t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.

What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? Here you go. Google’s webmaster guidelines explicitely forbid such black hat tactics:

Don’t use cloaking or sneaky redirects.” Google just did the latter with its very own SERPs. The search interface, out in the wild for nearly a decade, redirects to a piece of sidebar HTML offering a download of IE8 optimized for Google. That’s a helpful redirect for some IE6 users who don’t suffer from an IT department stuck with this outdated browser, but it’s plain misleading in the eyes of all those searchers who appreciated this clean and totally uncluttered search interface. Interestingly, UA cloaking is the only way to heal this sneaky behavior.

Don’t create pages with malicious behavior.” Google’s guilty, too. Instead of checking for the user’s browser, redirecting only IE6 requests from Google’s discontiued IE6 support (IE6 toolbar …) to the IE8 advertisement, whilst all other user agents get their desired search box, respectively their SERPs, under a… URI, Google performs an unconditional redirect to a page that’s utterly useless and also totally unexpected for many searchers. I consider misleading redirects malicious.

Avoid links to web spammers or ‘bad neighborhoods’ on the web.” I consider the propaganda for IE that Google displays instead of the search results I’d expect a bad neighborhood on the Web, because IE constantly ignores Web standards, forcing developers and designers to implement superfluous work arounds. (Ok, ok, ok … Google’s lack of geekiness doesn’t exactly count as violation of their webmaster guidelines, but it sounds good, doesn’t it?)

Hey Matt Cutts, about time to ban! Click to tweet that

Google’s very best search interface is history. Here is what you got under

Google's famous minimalistic search UI

And here is where Google sneakily redirects you to when you load the SERP link above (even with Chrome!):

Google's sneaky IE8 propaganda

It’s sad that a browser vendor like Google (and yes, Google Chrome is my favorite browser) feels the need to mislead its users with propaganda for a competiting browser that’s slower and doesn’t render everything as it should render it. But when this particular browser vendor also leads Web search, and makes use of black hat techniques that it bans webmasters for, then that’s a scandal. So, if you agree, please submit a spam report to Google:

Hey Matt Cutts, about time to ban! #spam-report Tweet Your Spam Report

2010-05-17 I’ve updated this pamphlet because it didn’t explain the “sneakiness” clear enough. As of today, the unconditional redirect is still sneaky IMHO. Google needs to deliver searchers their desired search results, and only stubborn IE6 users ads for a somewhat better browser.

2010-05-18 Q: You’re pissed solely because your SERP scraping scrips broke. A: Glad you’ve asked. Yes, I’ve scraped Google’s /ie search too. Not because I’m a privacy nazi like Daniel Brandt. I’ve just checked (my) rankings. However, when I spotted the redirects I didn’t even remember the location of the scripts that scraped this service, because I didn’t look at ranking reports for years. I’m interested in actual traffic, and revenues. Ego food annoys me. I just love the /ie search interface. So the answer is a bold “no”. I don’t give a fucking dead rat’s ass what ranking reports based on scraped SERPs could tell.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Get yourself a smart robots.txt

greedy and aggressive web robots steal your contentCrawlers and other Web robots are the plague of today’s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don’t. Behaving or not, most bots just steal your content. You don’t appreciate that, so block them.

This pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines –that bring you nice traffic– while keeping the nasty (or useless, traffic-wise) bots out of the game.

The basic idea is that blocking all bots –with very few exceptions– makes more sense than maintaining kinda Web robots who’s who in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold Disallow: /.

Of course that’s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That’s bullshit. You can’t handle a gazillion bots with a black list.

Even bots that respect the REP can harm your search engine rankings, or reveal sensitive information to your competitors. Every minute a new bots turns up. You can’t manage all of them, and you can’t trust any (behaving) bot. Or, as the master of bot control explains: “That’s the only thing I’m concerned with: what do I get in return. If it’s nothing, it’s blocked“.

Also, large robots.txt files handling tons of bots are fault prone. It’s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don’t enlighten your competitors.

If you’re a smart webmaster agreeing with this approach, here’s your ToDo-List:
• Grab the code
• Install
• Customize
• Test
• Implement.
On error read further.

The anatomy of a smart robots.txt

Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you’re somewhat fucked. The code isn’t elegant. I’ve tried to keep it easy to understand even for noobs — at the expense of occasional lengthiness and redundancy.


First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that’s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you’d have to add a leading <?php ?> string to all of them. Hence you add this code snippet to your root’s .htaccess file:
<FilesMatch ^robots\.txt$>
SetHandler application/x-httpd-php

As long as you’re testing and customizing my script, make that ^smart_robots\.txt$.

Next grab the code and extract it into your document root directory. Do not rename /smart_robots.txt to /robots.txt until you’ve customized the PHP code!

For testing purposes you can use the logRequest() function. Probably it’s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don’t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.


Search the code for #EDIT and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you’ve launched your smart robots.txt.

Let’s look at /smart_robots.txt. First of all, it sets the canonical server name, change that to yours. After routing robots.txt request logging to a flat file (change that to a database table!) it includes /smart_robots_inc.php.

Next it sends some HTTP headers that you shouldn’t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn’t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.

As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its cached copy. To test your own robots.txt with Googlebot, you can login to GWC and fetch it as Googlebot. It’s a shame that the other search engines don’t provide a feature like that.

When you implement the whitelisted crawler method, you really should provide a contact page for crawling requests. So please change the “In order to gain permissions to crawl blocked site areas…” comment.

Next up are the search engine specific crawler directives. You put them as
if (isGooglebot()) {
$content .= "
User-agent: Googlebot


If your URIs contain double quotes, escape them as \" in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)

Please note that you need to output at least one empty line before each User-agent: section. Repeat that for each accepted crawler, before you output
$content .= "User-agent: *
Disallow: /

Every behaving Web robot that’s not whitelisted will bounce at the Disallow: /.

Before $content is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all SEO tools that spoof the user agent. Make sure that you do not output a single byte –for example leading whitespaces, a debug message, or a #comment– before the print $content; statement.

Blocking rogue bots is important. If you discover a rogue bot –for example a scraper that pretends to be Googlebot– during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can’t access your content!

Bear in mind that each and every piece of content served from your site should implement rogue bot detection, that’s doable even with non-HTML resources like images or PDFs.

Finally we deliver the user agent specific robots.txt and terminate the connection.

Now let’s look at /smart_robots_inc.php. Don’t fuck-up the variable definitions and routines that populate them or deal with the requestor’s IP addy.

Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that’s maintained 24/7 and updated every 6 hours: here you go (don’t use it for deceptive cloaking, promised?).

/smart_robots_inc.php comes with functions that detect Googlebot, MSNbot, and Slurp.

Most search engines tell how you can verify their crawlers and which crawler directives their user agents support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading “spider” and trailing “” string and inbetween an integer, like in the isSlurp() function.


Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don’t need to wait for crawler visits, you can use the “Fetch as Googlebot” thingy in your webmaster console.

Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from “” by accident, or that someone at Yahoo starts an ancient test bot that still uses an “” host name.

Don’t rely on my crawler detection routines. They’re dumped from memory in a hurry, I’ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It’s up to you to make it smart.


Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.

The output of a smart robots.txt

When you download a smart robots.txt with your browser, wget, or any other tool that comes with user agent spoofing, you’ll see a 403 or something like:

HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain;charset=iso-8859-1

# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via

User-agent: *
Disallow: /
(the contact form URI above doesn’t exist)

whilst a real search engine crawler like Googlebot gets slightly different contents:

HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain; charset=iso-8859-1

# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via

User-agent: Googlebot
Allow: /


User-agent: *
Disallow: /

Search engines hide important information from webmasters

Unfortunately, most search engines don’t provide enough information about their crawling. For example, last time I looked Google doesn’t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for “Googlebot-” and you’ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.

I’ve nudged Google and their response was that they don’t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn’t even announce properly that they’ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.

Since search engines don’t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents “on behalf of” particular user agents.





2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

SEO Bullshit: Mimicking a file system in URIs

file system like URIsWay back in the WWW’s early Jurassic, micro computer based Web development tools sneakily begun poisoning the formerly ideal world of the Internet. All of a sudden we saw ‘.htm’ URIs, because CP/M and later on PC-DOS file extensions were limited to 3 characters. Truncating the ‘language’ part of HTML was bad enough. Actually, fucking with well established naming conventions wasn’t just a malady, but a symptom of a worse world wide pandemic.

Unfortunately, in order to bring Web publishing to the mere mortals (folks who could afford a micro computer), software developers invented DOS-like restrictions the Web wasn’t designed for. Web design tools maintained files on DOS file systems. FTP clients managed to convert backslashes originating from DOS file systems to slashes on UNIX servers, and vice versa (long before NT 3.51 and IIS). Directory names / file names equalled URIs. Most Web sites were static.

None of those cheap but fancy PC based Web design tools came with a mapping of objects (locally stored as files back then) to URIs pointing to Web resources. Despite Tim Berners-Lee’s warnings (likeIt is the the duty of a Webmaster to allocate URIs which you will be able to stand by in 2 years, in 20 years, in 200 years. This needs thought, and organization, and commitment.“). The technology used to create a resource named its unique identifier (URI). That’s as absurd as wearing diapers a whole live long.

Newbie Web designers grew up with this flawed concept, and never bothered to research the Web’s fundamentals. In their limited view of the Web, a URI was a mirrored version of a file name and its location on their local machine, and everything served from /cgi-bin/ had to be blocked in robots.txt, because all dynamic stuff was evil.

Today, those former newbies consider themselves oldtimers. Actually, they’re still greenhorns, because they’ve never learned that URIs have nothing to do with files, directories, or a Web resources’s (current) underlying technology (as in .php3 for PHP version 3.x, .shtml for SSI, …).

Technology evolves, even changes, but (valuable) contents tend to stay. URIs should solely address a piece of content, they must not change when the technology used to serve those contents changes. That means strings like ‘.html’ or folder names must not be used in URIs.

Many of those notorious greenhorns offer their equally ignorant clients Web development and SEO services today. They might have managed to handle dynamic contents by now (thanks to osCommerce, WordPress and other CMSs), but they’re still stuck with ancient paradigms that were never meant to exist on the Internet.

They might have discovered that search engines are capable of crawling and indexing dynamic contents (URIs with query strings) nowadays, but they still treat them as dumb bots — as if Googlebot or Slurp weren’t more sophisticated than Altavista’s Scooter of 1998.

They might even develop trendy crap (version 2.0 with nifty rounded corners) today, but they still don’t get IT. Whatever IT is, it doesn’t deserve an URI like /category/vendor/product/color/size/crap.htm.

Why hierarchical URIs (expressing breadcrumbs or whatnot) are utter crap (SEO-wise as well as from a developer’s POV) is explained here:

SEO Toxin


SEO BullshitI’ve published my rant “Directory-Like URI Structures Are SEO Bullshit” on SEO Bullshit dot com for a reason.

You should keep an eye on this new blog. Subscribe to its RSS feed. Watch its Twitter account.

If it’s about SEO and it’s there, it’s most probably bullshit. If it’s bullshit, avoid it.

If you plan to spam the SEO blogosphere with your half-assed newbie thoughts (especially when you’re an unconvinceable ‘oldtimer’), consider obeying this rule of thumb:

The top minus one reason to publish SEO stupidity is: You’ll end up here.

Of course that doesn’t mean newbies shouldn’t speak out. I’m just sick of newbies who sell their half-assed brain farts as SEO advice to anyone. Noobs should read, ask, listen, learn, practice, evolve. Until they become pros. As a plain Web developer, I can tell from my own experience that listening to SEO professionals is worth every minute of your time.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

How do Majestic and LinkScape get their raw data?

LinkScape data acquisitionDoes your built-in bullshit detector cry in agony when you read announcements of link analysis tools claiming to have crawled Web pages in the trillions? Can a tiny SEO shop, or a remote search engine in its early stages running on donated equipment, build an index of that size? It took Google a decade to reach these figures, and Google’s webspam team alone outnumbers the staff of SEOmoz and Majestic, not to speak of infrastructure.

Well, it’s not as shady as you might think, although there’s some serious bragging and willy whacking involved.

First of all, both SEOmoz and Majestic do not own an indexed copy of the Web. They process markup just to extract hyperlinks. That means they parse Web resources, mostly HTML pages, to store linkage data. Once each link and its attributes (HREF and REL values, anchor text, …) are stored under a Web page’s URI, the markup gets discarded. That’s why you can’t search these indexes for keywords. There’s no full text index necessary to compute link graphs.

Majestic index sizeThe storage requirements for the Web’s link graph are way smaller than for a full text index that major search engines have to handle. In other words, it’s plausible.

Majestic clearly describes this process, and openly tells that they index only links.

With SEOmoz that’s a completely different story. They obfuscate information about the technology behind LinkScape to a level that could be described as near-snake-oil. Of course one could argue that they might be totally clueless, but I don’t buy that. You can’t create a tool like LinkScape being a moron with an IQ slighly below an amoeba. As a matter of fact, I do know that LinkScape was developed by extremely bright folks, so we’re dealing with a misleading sales pitch:

Linkscape index size

Let’s throw in a comment at Sphinn, where a SEOmoz rep posted “Our bots, our crawl, our index“.

Of course that’s utter bullshit. SEOmoz does not have the resources to accomplish such a task. In other words, if –and that’s a big IF– they do work as described above, they’re operating something extremely sneaky that breaks Web standards and my understanding of fairness and honesty. Actually, that’s not so, but because it is not so, LinkScape and OpenSiteExplorer in its current shape must die (see below why).

They do insult your intelligence as well as mine, and that’s obviously not the right thing to do, but I assume they do it solely for marketing purposes. Not that they need to cover up their operation with a smokescreen like that. LinkScape could succeed with all facts on the table. I’d call it a neat SEO tool, if it just would be legit.

So what’s wrong with SEOmoz’s statements above, and LinkScape at all?

Let’s start with “Crawled in the past 45 days: 700 billion links, 55 billion URLs, 63 million root domains”. That translates to “crawled … 55 billion Web pages, including 63 million root index pages, carrying 700 billion links”. 13 links per page is plausible. Crawling 55 billion URIs requires sending out HTTP GET requests to fetch 55 billion Web resources within 45 days, that’s roughly 30 terabyte per day. Plausible? Perhaps.

True? Not as is. Making up numbers like “crawled 700 billion links” suggests a comprehensive index of 700 billion URIs. I highly doubt that SEOmoz did ‘crawl’ 700 billion URIs.

When SEOmoz would really crawl the Web, they’d have to respect Web standards like the Robots Exclusion Protocol (REP). You would find their crawler in your logs. An organization crawling the Web must

  • do that with a user agent that identifies itself as crawler, for example “Mozilla/5.0 (compatible; Seomozbot/1.0; +”,
  • fetch robots.txt at least daily,
  • provide a method to block their crawler with robots.txt,
  • respect indexer directives like “noindex” or “nofollow” both in META elements as well as in HTTP response headers.

SEOmoz obeys only <META NAME="SEOMOZ" CONTENT="NOINDEX" />, according to their sources page. And exactly this page reveals that they purchase their data from various services, including search engines. They do not crawl a single Web page.

Savvy SEOs should know that crawling, parsing, and indexing are different processes. Why does SEOmoz insist on the term “crawling”, taking all the flak they can get, when they obviously don’t crawl anything?

Two claims out of three in “Our bots, our crawl, our index” are blatant lies. If SEOmoz performs any crawling, in addition to processing bought data, without following and communicating the procedure outlined above, that would be sneaky. I really hope that’s not happening.

As a matter of fact, I’d like to see SEOmoz crawling. I’d be very, very happy if they would not purchase a single byte of 3rd party crawler results. Why? Because I could block them in robots.txt. If they don’t access my content, I don’t have to worry whether they obey my indexer directives (robots meta ‘tag’) or not.

As a side note, requiring a “SEOMOZ” robots META element to opt out of their link analysis is plain theft. Adding such code bloat to my pages takes a lot of time, and that’s expensive. Also, serving an additional line of code in each and every HEAD section sums up to a lot of wasted bandwidth –$$!– over time. Am I supposed to invest my hard earned bucks just to prevent me from revealing my outgoing links to my competitors? For that reason alone I should report SEOmoz to the FTC requesting them to shut LinkScape down asap.

They don’t obey the X-Robots-Tag (”noindex”/”nofollow”/… in the HTTP header) for a reason. Working with purchased data from various sources they can’t guarantee that they even get those headers. Also, why the fuck should I serve MSNbot, Slurp or Googlebot an HTTP header addressing SEOmoz? This could put my search engine visibility at risk.

If they’d crawl themselves, serving their user agent a “noindex” X-Robots-Tag and a 403 might be doable, at least when they pay for my efforts. With their current setup that’s technically impossible. They could switch to completely, that’ll solve the problem, provided 80legs works 100% by the REP and crawls as “SEOmozBot” or so.

With MajesticSEO that’s not an issue, because I can block their crawler with
User-agent: MJ12bot
Disallow: /

Yahoo’s site explorer also delivers too much data. I can’t block it without losing search engine traffic. Since it will probably die when Microsoft overtakes, I don’t rant much about it. Google and Bing don’t reveal my linkage data to everyone.

I have an issue with SEOmoz’s LinkScape, and OpenSiteExplorer as well. It’s serious enough that I say they have to close it, if they’re not willing to change their architecture. And that has nothing to do with misleading sales pitches, or arrogant behavior, or sympathy (respectively, a possibly lack of sympathy).

The competitive link analysis OpenSiteExplorer/LinkScape provides, without giving me a real chance to opt out, puts my business at risk. As much as I appreciate an opportunity to analyze my competitors, vice versa it’s downright evil. Hence just kill it.

Is my take too extreme? Please enlighten me in the comments.

Update: A follow-up post from Michael VanDeMar and its Sphinn discussion, the first LinkScape thread at Sphinn, and Sphinn comments to this pamphlet.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

How brain-amputated developers created the social media plague

The bot playground commonly refered to as “social media” is responsible for shitloads of absurd cretinism.

Twitter Bot PlaygroundFor example Twitter, where gazillions of bots [type A] follow other equally superfluous but nevertheless very busy bots [type B] that automatically generate 27% valuable content (links to penis enlargement tools) and 73% not exactly exciting girly chatter (breeding demand for cheap viagra).

Bazillions of other bots [type C] retweet bot [type B] generated crap and create lists of bots [type A, B, C]. In rare cases when a non-bot tries to participate in Twitter, the uber-bot [type T] prevents the whole bot network from negative impacts by serving a 503 error to the homunculus’ browser.

This pamphlet is about the idiocy of a particular subclass of bots [type S] that sneakily work in the underground stealing money from content producers, and about their criminal (though brain-dead) creators. May they catch the swine flu, or at least pox or cholera, for the pest they’ve brought to us.

The Twitter pest that costs you hard earned money

WTF I’m ranting about? The technically savvy reader, familiar with my attitude, has already figured out that I’ve read way too many raw logs. For the sake of a common denominator, I encourage you to perform a tiny real-world experiment:

  • Publish a great and linkworthy piece of content.
  • Tweet its URI (not shortened - message incl. URI ≤ 139 characters!) with a compelling call for action.
  • Watch your server logs.
  • Puke. Vomit increases with every retweet.

So what happens on your server? A greedy horde of bots pounces on every tweet containing a link, requesting its content. That’s because on Twitter all URIs are suspected to be shortened (learn why Twitter makes you eat shit). This uncalled-for –IOW abusive– bot traffic burns your resources, and (with a cheap hosting plan) it can hinder your followers to read your awesome article and prevent them from clicking on your carefully selected ads.

Those crappy bots not only cost you money because they keep your server busy and increase your bandwidth bill, they actively decrease your advertising revenue because your visitors hit the back button when your page isn’t responsive due to the heavy bot traffic. Even if you’ve great hosting, you probably don’t want to burn money, not even pennies, right?

Bogus Twitter apps and their modus operandi

If only every Twitter&Crap-mashup would lookup each URI once, that wouldn’t be such a mess. Actually, some of these crappy bots request your stuff 10+ times per tweet, and again for each and every retweet. That means, as more popular your content becomes, as more bot traffic it attracts.

Most of these bots don’t obey robots.txt, that means you can’t even block them applying Web standards (learn how to block rogue bots). Topsy, for example, does respect the content producer — so morons using “Python-urllib/1.17″ or “AppEngine-Google; (+; appid: mapthislink)” could obey the Robots Exclusion Protocol (REP), too. Their developers are just too fucking lazy to understand such protocols that every respected service on the Web (search engines…) obeys.

Some of these bots even provide an HTTP_REFERER to lure you into viewing the website operated by their shithead of developer when you’re viewing your referrer stats. Others fake Web browsers in their user agent string, just in case you’re not smart enough to smell shit that really stinks (IOW browser-like requests that don’t fetch images, CSS files, and so on).

One of the worst offenders is outing itself as “ThingFetcher” in the user agent string. It’s hosted by Rackspace, which is a hosting service that obviously doesn’t care much about its reputation. Otherwise these guys would have reacted to my various complaints WRT “ThingFetcher”. By the way, Robert Scoble represents Rackspace, you could drop him a line if ThingFetcher annoys you, too.

ThingFetcher sometimes requests a (shortened) URI 30 times per second, from different IPs. It can get worse when a URI gets retweeted often. This malicious piece of code doesn’t obey robots.txt, and doesn’t cache results. Also, it’s too dumb to follow chained redirects, by the way. It doesn’t even publish its results anywhere, at least I couldn’t find the fancy URIs I’ve feeded it with in Google’s search index.

In ThingFetcher’s defense, its developer might say that it performs only HEAD requests. Well, it’s true that HEAD request provoke only an HTTP response header. But: the script invoked gets completely processed, just the output is trashed.

That means, the Web server has to deal with the same load as with a GET request, it just deletes the content portion (the compelety formatted HTML page) when responding, after counting its size to send the Content-Length response header. Do you really believe that I don’t care about machine time? For each of your utterly useless bogus requests I could have my server deliver ads to a human visitor, who pulls the plastic if I’m upselling the right way (I do, usually).

Unfortunately, ThingFetcher is not the only bot that does a lookup for each URI embedded in a tweet, per tweet processed. Probably the overall number of URIs that appear only once is bigger than the number of URIs that appear quite often while a retweet campaign lasts. That means that doing HTTP requests is cheaper for the bot’s owner, but on the other hand that’s way more expensive for the content producer, and the URI shortening services involved as well.

ThingFetcher update: The owners of ThingFetcher are now aware of the problem, and will try to fix it asap (more information). Now that I know who’s operating the Twitter app owning ThingFetcher, I take back the insults above I’ve removed some insults from above, because they’d no longer address an anonymous developer, but bright folks who’ve just failed once. Too sad that Brizzly didn’t reply earlier to my attempts to identify ThingFetcher’s owner.

As a content producer I don’t care about the costs of any Twitter application that processes Tweets to deliver anything to its users. I care about my costs, and I can perfecly live without such a crappy service. Liberally, I can allow one single access per (shortened) URI to figure out its final destination, but I can’t tolerate such thoughtless abuse of my resources.

Every Twitter related “service” that does multiple requests per (shortened) URI embedded in a tweet is guilty of theft and pilferage. Actually, that’s an understatement, because these raids cost publishers an enormous sum across the Web.

These fancy apps shall maintain a database table storing the destination of each redirect (chain) acessible by its short URI. Or leave the Web, respectively pay the publishers. And by the way, Twitter should finally end URI shortening. Not only it breaks the Internet, it’s way too expensive for all of us.

A few more bots that need a revamp, or at least minor tweaks

I’ve added this section to express that besides my prominent example above, there’s more than one Twitter related app running not exactly squeaky clean bots. That’s not a “worst offenders” list, it’s not complete (I don’t want to reprint Twitter’s yellow pages), and bots are listed in no particular order (compiled from requests following the link in a test tweet, evaluating only a snapshot of less than 5 minutes, backed by historized logs.)

Skip examples

Tweetmeme’s TweetmemeBot coming from doesn’t fetch robots.txt. On their site they don’t explain why they don’t respect the robots exclusion protocol (REP). Apart from that it behaves.

OneRiot’s bot OneRiot/1.0 totally proves that this real time search engine has chosen a great name for itself. Performing 5+ GET as well as HEAD requests per link in a tweet (sometimes more) certainly counts as rioting. Requests for content come from different IPs, the host name pattern is flx1-ppp*, e.g. From the same IPs comes another bot: Me.dium/1.0, redirects to OneRiot doesn’t respect the REP.

Microsoft/Bing runs abusive bots following links in tweets, too. They fake browsers in the user agent, make use of IPs that don’t obviously point to Microsoft (no host name, e.g., …), send multiple GET requests per processed tweet, and don’t respect the REP. If you need more information, I’ve ranted about deceptive M$-bots before. Just a remark in case you’re going to block abusive MSN bot traffic:

MSN/Bing reps ask you not to block their spam bots when you’d like to stay included in their search index (that goes for real time search, too), but who really wants that? Their search index is tiny –compared to other search engines like Yahoo and Google–, their discovery crawling sucks –to get indexed you need to submit your URIs at their webmaster forum–, and in most niches you can count your yearly Bing SERP referrers using not even all fingers of your right hand. If your stats show more than that, check your raw logs. You’ll soon figure out that MSN/Bing spam bots fake SERP traffic in the HTTP_REFERER (guess where their “impressive” market share comes from).

FriendFeed’s bot FriendFeedBot/0.1 is well explained, and behaves. Its bot page even lists all its IPs, and provides you with an email addy for complaints (I never had a reason to use it). The FriendFeedBot made it on this list just because of its lack of REP support.

PostRank’s bot PostRank/2.0 comes from Amazon IPs. It doesn’t respect the REP, and does more than one request per URI found in one single tweet.

MarkMonitor operates a bot faking browser requests, coming from * (,, …). Multiple requests per URI, no REP support.

Cuil’s bot provides an empty user agent name when following links in tweets, but fetches robots.txt like Cuil’s offical crawler Twiceler. I didn’t bother to test whether this Twitter bot can be blocked following Cuil’s instructions for webmasters or not. It got included in this list for the supressed user agent.

Twingly’s bot Twingly Recon coming from * doesn’t respect the REP, doesn’t name its owner, but does only few HEAD requests.

Many bots mimicking browsers come from Amazon, Rackspace, and other cloudy environments, so you can’t get hold of their owners without submitting a report-abuse form. You can identify such bots by sorting your access logs by IP addy. Those “browsers” which don’t request your images, CSS files, and so on, are most certainly bots. Of course, a human visitor having cached your images and CSS matches this pattern, too. So block only IPs that solely request your HTML output over a longer period of time (problematic with bots using DSL providers, AOL, …).

Blocking requests (with IPs belonging to consumer ISPs, or from Amazon and other dynamic hosting environments) with a user agent name like “LWP::Simple/5.808″, “PycURL/7.18.2″, “my6sense/1.0″, “Firefox” (just these 7 characters), “Java/1.6.0_16″ or “libwww-perl/5.816″ is sound advice. By the way, these requests sum up to an amount that would lead a “worst offenders” listing.

Then there are students doing research. I’m not sure I want to waste my resources on requests from Moscow’s “Institute for System Programming RAS”, which fakes unnecessary loads of human traffic (from,, …), for example.

When you analyze bot traffic following a tweet with many retweets, you’ll gather a way longer list of misbehaving bots. That’s because you’ll catch more 3rd party Twitter UIs when many Twitter users view their timeline. Not all Twitter apps route their short URI evaluation through their servers, so you might miss out on abusive requests coming from real users via client sided scripts.

Developers might argue that such requests “on behalf of the user” are neither abusive, nor count as bot traffic. I assure you, that’s crap, regardless a particular Twitter app’s architecture, when you count more than one evaluation request per (shortened) URI. For example Googlebot acts on behalf of search engine users too, but it doesn’t overload your server. It fetches each URI embedded in tweets only once. And yes, it processes all tweets out there.

How to do it the right way

Here is what a site owner can expect from a Twitter app’s Web robot:

A meaningful user agent

A Web robot must provide a user agent name that fulfills at least these requirements:

  • A unique string that identifies the bot. The unique part of this string must not change when the version changes (”somebot/1.0″, “somebot/2.0″, …).
  • A URI pointing to a page that explains what the bot is all about, names the owner, and tells how it can be blocked in robots.txt (like this or that).
  • A hint on the rendering engine used, for example “Mozilla/5.0 (compatible; …”.

A method to verify the bot

All IP addresses used by a bot should resolve to server names having a unique pattern. For example Googlebot comes only from servers named "crawl" + "-" + replace($IP, ".", "-") + "", e.g. “”. All major search engines follow this standard that enables crawler detection not solely relying on the easily spoofable user agent name.

Obeying the robots.txt standard

Webmasters must be able to steer a bot with crawler directives in robots.txt like “Disallow:”. A Web robot should fetch a site’s /robots.txt file before it launches a request for content, when it doesn’t have a cached version from the same day.

Obeying REP indexer directives

Indexer directives like “nofollow”, “noindex” et cetera must be obeyed. That goes for HEAD requests just chasing for a 301/302/307 redirect response code and a “location” header, too.

Indexer directives can be served in the HTTP response header with an X-Robots-Tag, and/or in META elements like the robots meta tag, as well as in LINK elements like rel=canonical and its corresponding headers.

Responsible behavior

As outlined above, requesting the same resources over and over doesn’t count as responsible behavior. Fetching or “HEAD’ing” a resource no more than once a day should suffice for every Twitter app’s needs.

Reprinting a page’s content, or just large quotes, doesn’t count as fair use. It’s Ok to grab the page title and a summary from a META element like “description” (or up to 250 characters from an article’s first paragraph) to craft links, for example - but not more! Also, showing images or embedding videos from the crawled page violates copyrights.

Conclusion, and call for action

If you suffer from rogue Twitter bot traffic, use the medium those bots live in to make their sins public knowledge. Identify the bogus bot’s owners and tweet the crap out of them. Lookup their hosting services, find the report-abuse form, and submit your complaints. Most of these apps make use of the Twitter-API, there are many spam report forms you can creatively use to ruin their reputation at Twitter. If you’ve an account at such a bogus Twitter app, then cancel it and encourage your friends to follow suit.

Don’t let the assclowns of the Twitter universe get away with theft!

I’d like to hear about particular offenders you’re dealing with, and your defense tactics as well, in the comments. Don’t be shy. Go rant away. Thanks in advance!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

How to disagree on Twitter, machine-readable

URI link condom for social mediaWith standard hyperlinks you can add a rel="crap nofollow" attribute to your A elements. But how do you tell search engine crawlers and other Web robots that you disagree with a link’s content, when you post the URI at Twitter or elsewhere?

You cannot rely on the HTML presentation layer of social media sites. Despite the fact that most of them add a condom to all UGC links, crawlers do follow those links. Nowadays crawlers grab tweets and their embedded links long before they bother to fetch the HTML pages. They fatten their indexers with contents scraped from feeds. That means indexers don’t (really) take the implicit disagreement into account.

As long as you operate your own URI shortener, there’s a solution.

Condomize URIs, not A elements

Here’s how to nofollow a plain link drop, where you’ve no control over link attributes like rel-nofollow:

  • Prerequisite: understanding the anatomy of a URI shortener.
  • Add an attribute like shortUri.suriNofollowed, boolean, default=false, to your shortened URIs database table. In the Web form where you create and edit short URIs, add a corresponding checkbox and update your affected scripts.
  • Make sure your search engine crawler detection is up-to-date.
  • Change the piece of code that redirects to the original URI:
    if ($isCrawler && $suriNofollowed) {
    header("HTTP/1.1 403 Forbidden redirect target", TRUE, 403);
    print "<html><head><title>This link is condomized!</title></head><body><p>Search engines are not allowed to follow this link: <code>$suriUri</code></p></body></html>";
    else {
    header("HTTP/1.1 301 Here you go", TRUE, 301);
    header("Location: $suriUri");

Here’s an example: This shortened URI takes you to a Bing SEO tip. Search engine crawlers get bagged in a 403 link condom.

Since you can’t test it yourself (user agent spoofing doesn’t work), here’s a header reported by Googlebot (requesting the condomized URI above) today:

HTTP/1.1 403 Forbidden
Date: Thu, 07 Jan 2010 10:19:16 GMT
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

The error page just says:
Title + H1: Link is nofollow'ed
P: Sorry, this shortened URI must not get followed by search engines.

If you can’t roll your own, feel free to make use of my URI Condomizer. Have fun condomizing crappy links on Twitter.


If you check “Nofollow” your URI gets condomized. That means, search engines can’t request it from the shortened URI, but users and other Web robots get redirected.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Sanitize links in your content feeds

Don't bother your visitors with broken links in feedsHere’s a WordPress plug-in that sanizites relative links and on-the-page links in your content feeds: feedLinkSanitizer. Why do you need it?

Because you end up with invalid links like if you don’t use it. Once the post phases out of the main page, the link points to nowhere in feedreaders and reprints.

In feeds, absolute links are mandatory. Make sure that not a single on-the-page link or relative link slips out of your site.

Relative links

When you put all links to your own stuff as /perma-link/ instead of http://your-blog/permalink/ you can serve your blog’s content from a different server / base URI (dev, move, …) without editing all internal links.

The downside is, that for various very good reasons (scrapers, search engines, whatnot) thou must not have relative links in your HTML. You might disagree, but read on.

The simple solution is: store relative links in your WordPress database, but output absolute links. Follow the hint in feedLinkSanitizer.txt to activate link sanitizing in your HTML. By default it changes only feed contents.

The plug-in changes /perma-link/ to in your posts, using the blog URI provided in your WordPress settings. It takes the current server name if this value is missing.

Fragment links

You can link to any DOM-ID in an HTML page, for example <a href="#tos">Table of contents</a> where ‘tos’ is the DOM-ID of an HTML element like <h2 id="tos">Table of contents</h2>. These on-the-page links even come with some SEO value, just in case you don’t care much about usability.

The plug-in changes #tos to in your posts. If you’ve set $sanitizeAllLinks = TRUE; in the plugin-code, an on-the-page link clicked on the blog’s main page will open the post, positioning to the DOM-ID.

Download feedLinkSanitizer

I’m a launch-early kind of guy, so test it yourself. And: Use at your own risk. No warranty expressed or implied is provided.

If you use another CMS, download the plug-in anyway and steal adapt its code.

Credits for previous work go to Jon Thysell and Gerd Riesselmann.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

How to cleverly integrate your own URI shortener

This pamphlet is somewhat geeky. Don’t necessarily understand it as a part of my ongoing jihad holy war on URI shorteners.

Clever implementation of an URL shortenerAssuming you’re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don’t want to make use of unholy crap, so you need to roll your own. Here’s how you can (could) integrate a URI shortener into your site’s architecture.

Please note that my design suggestions ain’t black nor white. Your site’s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they’re applicable.

At the first sight, searching for a free URI shortener script to implement it on a dedicated domain looks like a pretty simple solution. It’s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that’s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.

So if you’re not running and your domain’s name without the “www” prefix plus a few characters gives URIs of 20 (30) characters or less, you don’t need a short domain name to host your shortened URIs.

As a side note, when you’re shortening your URIs for Twitter you should know that shortened URIs aren’t mandatory any more. If your message doesn’t exceed 139 characters, you don’t need to shorten embedded URIs.

By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.

In the following I’ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you’ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider alternatives). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.

Table of contents

To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I’ll bore you with a ton of dull and totally unrelated stuff:


There’s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we’re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:

RewriteEngine On
RewriteCond %{SERVER_PORT} ^80$
RewriteRule . /requestHandler.php [L]

Please note that the code above kinda disables the Web server’s error handling. If
exists in the root directory, all ErrorDocument directives (except some 5xx) et cetera will be ignored. You need to take care of errors yourself.

/requestHandler.php (Warning: untested and simplified code snippets below)
/* Initialization */
$serverName = strtolower($_SERVER["SERVER_NAME"]);
$canonicalServerName = "";
$scheme = "http://";
$rootUri = $scheme .$canonicalServerName; /* if used w/o path add a
slash */
$rootPath = $_SERVER["DOCUMENT_ROOT"];
$includePath = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */
$requestIp = $_SERVER["REMOTE_ADDR"];
$reverseIp = NULL;
$requestReferrer = $_SERVER["HTTP_REFERER"];
$requestUserAgent = $_SERVER["HTTP_USER_AGENT"];
$isRogueBot = FALSE;
$isCrawler = NULL;
$requestUri = $_SERVER["REQUEST_URI"];
$absoluteUri = $scheme .$canonicalServerName .$requestUri;
$uriParts = parse_url($absoluteUri);
$requestScript = $PHP_SELF;
$httpResponseCode = NULL;

Block rogue bots

You don’t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, …) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: IncrediBill.

While handling bots, you should detect search engine crawlers, too:

/* lookup your crawler IP database to populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */
if ($isCrawler !== TRUE) {
$crawlerName = NULL;
$crawlerHost = NULL;
$crawlerServer = NULL;
if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = "";}
if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ""; }
if ($crawlerName != NULL) {
$reverseIp = @gethostbyaddr($requestIp);
if (!stristr($reverseIp,$crawlerServer)) {
$isCrawler = FALSE;
if ("$reverseIp" == "$requestIp") {
$isCrawler = FALSE;
if ($isCrawler !== FALSE;) {
$chkIpAddyRev = @gethostbyname($reverseIp);
if ("$chkIpAddyRev" == "$requestIp") {
$isCrawler = TRUE;
$crawlerHost = $reverseIp;
// store the newly discovered crawler IP

If Baidu doesn’t send you any traffic, it makes sense to block its crawler. This piece of crap doesn’t behave anyway.
if ($isCrawler &&
"$crawlerName" == "Baiduspider") {
$isRogueBot = TRUE;

Another SE candidate is Bing’s spam bot that tries to manipulate stats on search engine usage. If you don’t approve such scams, block incoming! from the IP address range to ( to …) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren’t aware of their company’s spammy activities, so make sure you serve them a friendly GFY page that explains the issue.

Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you’re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:
if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}

If you operate a porn site you should refine this approach.

As for blocking requests by IP addy I’d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a @gethostbyaddr($requestIp) DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.
if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}

You won’t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [block user agent], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or code
if (isBlacklistedUa($requestUserAgent) ||

stristr($requestUserAgent,”ThingFetcher”)) {$isRogueBot = TRUE;}

By the way, the owner of ThingFetcher really should stand up now. I’ve sent a complaint to Rackspace and I’ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn’t bother to check for robots.txt.

Finally, serve rogue bots what they deserve:
if ($isRogueBot === TRUE) {

header("HTTP/1.1 403 Go fuck yourself", TRUE, 403);

If you’re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a file_get_contents($requestReferrer); and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I’d go for the 403-GFY response.

Server name canonicalization

Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under one server name:
if (”$serverName” != “$canonicalServerName”) {
header(”HTTP/1.1 301 Please use the canonical URI”, TRUE, 301);
header(”Location: $absoluteUri”);
header(”X-Canonical-URI: $absoluteUri”); //
header("Link: <$absoluteUri>; rel=canonical"); // experimental

Subdomains are so 1999, also 2010 is the year of non-’.www’ URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite … the code from this pamphlet as you like. However, you must not change the $canonicalServerName = ""; statement. I’ll appreciate the traffic. ;)

When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes –if necessary–, and remove clutter from query strings.

Sometimes even smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers :( ):
$testForUriClutter = $absoluteUri;
if (isset($_GET)) {
foreach ($_GET as $var => $crap) {
if ( stristr($var,”utm_”) ) {
$testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);
$testForUriClutter = str_replace($testForUriClutter, “&amp;$var=$crap”, “”);

unset ($_GET[$var]);
$uriPartsSanitized = parse_url($testForUriClutter);
$qs = $uriPartsSanitized["query"];
$qs = str_replace($qs, "?", "");
if ("$qs" != $uriParts["query"]) {
$canonicalUri = $scheme .$canonicalServerName .$requestScript;
if (!empty($qs)) {
$canonicalUri .= "?" .$qs;
if (!empty($uriParts["fragment"])) {
$canonicalUri .= "#" .$uriParts["fragment"];
header("HTTP/1.1 301 URI messed up by Google", TRUE, 301);
header("Location: $canonicalUri");

By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.

Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (see below), but some are self-made.

Therefore I’d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don’t need a log entry.

Before you’re going to serve your content, do a lookup in your shortUri table. If the requested URI is a shortened URI pointing to your own stuff, don’t perform a redirect but serve the content under the shortened URI.

Deliver static stuff (images …)

Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we’ve bypassed this functionality, do it yourself:
if (empty($uriParts[”query”])) && empty($uriParts[”fragment”])) && file_exists(”$rootPath$requestUri”)) {
header(”Content-type: ” .getContentType(”$rootPath$requestUri”), TRUE);
/* getContentType($filename) returns a
MIME media type like 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */

If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don’t handle them here.

Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your smart robots.txt to maintain your crawler-IP database and more.

Execute script (dynamic URI)

Say you’ve a WP blog in /blog/, then you can invoke WordPress with
if (substring($requestUri, 0, 6) == “/blog/”) {

(Perhaps the WP configuration needs a tweak to make this work.) There’s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.

Fortunately, when WordPress calls the 404 page (wp-content/themes/yourtheme/404.php), it hasn’t sent any output or headers yet. That means you can include the procedures discussed below in WP’s 404.php:
$httpResponseCode = “404″;
$errSrc = “WordPress”;
$errMsg = “The blog couldn’t make sense out of this request.”;

Like in my WordPress example, you’ll find a way to call your scripts so that they don’t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.

Resolve shortened URI

If you’re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.

Excursus: URI shortener components

Using the hints below you should be able to code your own URI shortener. You don’t need all the balls and whistles (like stats) overloading most scripts available on the Web.

  • A database table with at least these attributes:

    • shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)
    • shortUri.suriUri, text, indexed, stores the original URI
    • shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)

    Storing page titles and content (snippets) makes sense, but isn’t mandatory. For outputs like “recently shortened URIs” you need a timestamp attribute.

  • A method to create a shortened URI.
    Make that an independent script callable from a Web form’s server procedure, via Ajax, SOAP, etc.

    Without a given shortcut, use the primary key to create one. base_convert(intval($suriId), 10, 36); converts an integer into a short string. If you can’t do that in a database insert/create trigger procedure, retrieve the primary key’s value with LAST_INSERT_ID() or so and perform an update.

    URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your create short URI method should return a previously created shortcut then.

    If you’re storing titles and such stuff grabbed from the destination page, don’t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.

    With the shortcut returned build the short URI on-the-fly $shortUri = getBaseUri() ."/" .$suriShortcut; (so you can use your URI shortener across all your sites).

  • A method to retrieve the original URI.
    Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.

    Bear in mind that shortened URIs spread via social media do get abused. A shortcut like ‘xxyyzz’ can appear as ‘xxyyz..’, ‘xxy’, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an error page with suggestions.

  • A Web form to create and edit shortened URIs.
    Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
  • If you want to use your URI shortener with a Twitter client, then build an API.
  • If you need particular stats for your short URIs pointing to foreign sites that your analytics package can’t deliver, then store those click data separately.
    // end excursus

If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.
$suriUri = resolveShortUri($requestUri);
if ($suriUri === FALSE) {
$httpResponseCode = “404″;
$errSrc = “sUri”;
$errMsg = “Invalid short URI. Shortcut resolves to more than one result.”;
if (!empty($suriUri))
if (!stristr($suriUri, $canonicalServerName)) {
header(”HTTP/1.1 301 Here you go”, TRUE, 301);
header(”Location: $suriUri”);

Otherwise ($suriUri is yours) deliver your content without redirecting.

Redirect to destination (invalid request)

From reading your raw logs (404 stats don’t cover 302-Found crap) you’ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn’t make sense to show visitors following such a link your 404 page.

Most screwed URIs are unique in a way that they still ‘address’ one particular resource on your server. You should maintain a mapping table for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.

When you feature a “product of the week”, “hottest blog post”, “today’s joke” or so, then bookmarkers will love it when its URI doesn’t change. For such transient URIs do a 307 redirect to the currently featured page. Don’t fear non-existing ‘duplicate content penalties’. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you’ll still get the SERP traffic you deserve.

Guess destination (invalid request)

For many screwed URIs you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.

Once you’ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your screwed URIs mapping table (see above).

When you can’t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.

Serve a useful error page

Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don’t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page’s prominent message should consist of hints how the visitor can reach the estimated contents.

A central error page gets invoked from various scripts. Unfortunately, err.php can’t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can’t send an HTTP response header. Hence prefix the header() statement with a ‘@’ to supress PHP error messages, and catch and log errors.

Before you output your wonderful error page, send a 404 header:
if ($httpResponseCode == NULL) {
$httpResponseCode = “404″;
if (empty($httpResponseCode)) {
$httpResponseCode = “501″; // log for debugging
@header(”HTTP/1.1 $httpResponseCode Shit happens”, TRUE, intval($httpResponseCode));

In rare cases you better send a 410-Gone header, for example when Matt’s team has discovered a shitload of questionable pages and you’ve filed a reconsideration request.

In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.

Use possible input ($errSrc, $errMsg, …) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [* 404 page *] might inspire you (WordPress users click here).

All errors are mine. In other words, be careful when you grab my untested code examples. It’s all dumped from memory without further thoughts and didn’t face a syntax checker.

I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I’d be happy to discuss your thoughts in the comments. Thanks for your time.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

URI canonicalization with an X-Canonical-URI HTTP header

X-Canonical-URI HTTO HeaderDear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I’m desperately in need of a good reason to praise your wisdom and whatnot. From this year’s x-mas wish list:

All search engines obey the X-Canonical-URI HTTP header

The rel=canonical link element is a great tool, at least if applied properly, but sometimes it’s a royal pain in the ass.

Inserting rel=canonical link elements into huge conglomerates of cluttered scripts and static files is a nightmare. Sometimes the scripts creating the most URI clutter are compiled, and there’s no way to get a hand on the source code to change them.

Also, lots of resources can’t be stuffed with HTML’s link elements, for example dynamically created PDFs, plain text files, or images.

It’s not always possible to revamp old scripts, some projects just lack a suitable budget. And in some cases 301 redirects aren’t a doable option, for example when the destination URI is #5 in a redirect chain that can’t get shortened because the redirects are performed by a 3rd party that doesn’t cooperate.

This one, on the other hand, is elegant and scalable:

if (messedUp($_SERVER["REQUEST_URI"])) {
header(”X-Canonical-URI: $canonicalUri”);

header(”Link: <>; rel=canonical”);

Coding an HTTP request handler that takes care of URI canonicalization before any script gets invoked, and before any static file gets served, is the way to go for such fuddy-duddy sites.

By the way, having all URI canonicalization routines in one piece of code is way more transparent, and way better manageable, than a bazillion of isolated link elements spread over tons of resources. So that might be a feasible procedure for non-ancient sites, too.

red crab blackmailing search enginesDear search engines, if you make that happen, I promise that I don’t tweet your products with a “#crap” hashtag for the whole rest of this year. Deal?

And yes, I know I’m somewhat late, two days before x-mas, but you’ve got smart developers, haven’t you? So please, go get your ‘code monkeys’ to work and surprise me. Thanks.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

The anatomy of a deceptive Tweet spamming Google Real-Time Search

Google real time search spammed and abusedMinutes after the launch of Google’s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:


  • 1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.
  • 1 landing page that makes the punter pull out the plastic in no time.
  • 1 trusted authority page totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.
  • 1 short domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.


Create a new text file and name it hot-topic.php or so. Then code:
$landingPageUri = "";
$trustedPageUri = "";
if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) {
header("HTTP/1.1 307 Here you go today", TRUE, 307);
header("Location: $trustedPageUri");
else {
header("HTTP/1.1 301 Happy shopping", TRUE, 301);
header("Location: $landingPageUri");

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.


  • Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.
  • On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.
  • Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.
  • Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

Hey Google, for the sake of our children, take that as a spam report!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28  Next Page »