Archived posts from the 'Crap' Category

My Top 10 Predictions for 2012

  1. SEO refuses to die. In other words, SEOs who grasped HTML5 might survive. Flash on the other hand, died earlier.
  2. Google delays the launch of their mind-reading-search-implant (beta) after Altavista threatens to give away their babelfish earpiece for free.
  3. Yahoo launches a huge comment link spam attack in order to boost the ranking of its few remaining Web facilities at Bing.
  4. Earth becomes flat, at least on-line, after the presidential elections.
  5. Counting is overrated.

While working hard on tomorrow’s hangover, I remembered that posting drunk ain’t good for unknown reasons.

If you’ve nothing better to do, feel free to complete this post in the comments. Don’t. I hate experienced optimists.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Get the Google cop outta my shopping cart!

So now Google ranks my shopping SERPs by its opinion of customer service quality?

Do not want!

Google CopI’m perfectly satisfied with shopping search results ordered by relevancy and (link) popularity. I do not want Google to decide where I have to buy my stuff, just because an assclown treating his customers like shit got coverage in the NYT.

If I’m old enough to have free access to the Internet and a credit card, then I’m capable of checking out a Web shop before I buy. I don’t need to be extremely Web savvy to fire up a search for [XXX sucks] before I click on “add to cart”. Hey, even my 13yo son applies way more sophisticated methods. Google cannot and never will be able to create anything more reliable than my build-in bullshit-detector.

Of course, it’s Google’s search engine. Matt’s right when he states “two different court cases have held that our search results are our opinion and protected under 1st amendment”. The problem is, sometimes I disagree with Google’s opinions.

Expressing an opinion about a site’s customer service by not showing it on the SERPs that more than 60% of this planet’s population use to find stuff is a slippery slope. A very slippery slope. It means that for example I cannot buy a pair of shoes for $40 (time of delivery 10 days, free shipping), because Google only points me to shops that sell the same pair of shoes for $100 (plus fedex overnight fees). Since when did Google’s mission statement change to “organize the world’s shopping expeditions”? Maybe I didn’t get an important memo.

Not only that. Google is well known for producing heavy collateral damage when applying changes to commercial rankings. A simple software glitch could peculate the best deals on the Web, or ruin totally legit businesses suffering from fraudulent review spam spread by their competitors.

And finally, cross your heart, do you trust a search engine that far? Do you really expect Google to sort out the Web for you, not even asking how much of Google’s opinion you want to get applied when it comes to judging what appears on your personal search results? Not that Google will ever implement a slider where you can tell how much of your common sense you’re willing to invest vs. Google’s choice of goog, er, good customer service …

Well, I could live with a warning put as an anchor text like “show what boatloads of ripped-off customers told Googlebot about XXX” or so, but I do want to get the whole picture, uncensored.

End of rant.

Lets look at the algo change from a technical point of view:

Credit where credit is due, developing and deploying a filter that catches a fraudulent Web shop “gaming Google” out of billions of indexed pages within a few days is not trivial (what translates to ‘awesome job’ coming from a geek).

It’s not so astonishing that this filter also picked 100 clones of the jerk mentioned by the New York Times for Google’s newish shitlist. Of course it didn’t catch just another fishy site, same SOP, owned by the same guy. That makes it kinda hand job, just executed by an algorithm. Explained in my Twitter stream: “@DaveWiner I read that Google post as ‘We realize there is a problem that we can’t solve yet. We have a short term fix for this jerk.’”, or “so yeah, I stand by my statement: it’s a hand job to manipulate the press and keep the stock from moving.”

And that’s good news, at least for today’s shape of Google’s Web search. It means that Google does not yet rank the results of each and every search with commerial intent by Google’s rough estimate of the shop’s customer service quality.

Google’s ranking is still based on link popularity, so negative links are still a vote of confidence.

There are only so many not-totally-weak signals out there, and Google’s not to blame for heavily relying on one of the better ones: links. I don’t believe they’ll lower the importance of links anytime soon, at least not significantly. And why should they? I surely don’t want that. And I doubt it makes much sense, plus I doubt that Google can do that.

As for the meaning of links, well, I just hope that Google doesn’t try to guess intentions out of plain A elements and their context. That’s a must-fail project. I’ve developed some faith in the sanity and smartness of Google’s engineers over the years. I hope they won’t disappoint me now.

Of course one can express a link’s intention in a machine-readable way. For example with a microformat like VoteLinks. Unfortunately, nobody cares enough to actually make use of it.

Google’s very own misconception, er, microformat rel-nofollow, is even less reliable. Imagine a dead tired and overworked algo in the cellar of building 43 trying to figure out whether a particular link’s rel=”nofollow” was set

  • to mark a paid link
  • because the SEO next door said PageRank® hoarding is cool
  • because at the webmaster’s preferred hangout nofollow’ing links was the topic of week 53/2005
  • because the webmaster bought Google’s FUD and castrates all links except those leading to google.com just in case Google could penalize him for a badass one
  • to express that the link’s destination is a 404 page, so that the “PageRank™ leak”, er, link isn’t worth any link juice
  • because the author thankfully links back to a leading Web resource in his industry that linked to him as a honest recommendation, but is afraid of a reciprocal link penalty
  • because the author agrees with the linked page’s message, but doesn’t like the foul language used over there
  • because the author disagrees with the discussed, and therefore linked, destination page
  • just because some crappy CMS condomizes every 3rd link automatically for reasons not known to man

Well, not even all Googlers like it. In fact, some teams decided to ignore it because of its weakness and widespread abuse.

The above said is only valid for links embedded in markup that allows machine-readable tagging of links. Even if such tags would be reliable, they don’t cover all references, aka hyperlinks, on the Web. Think of PDF, Flash, some client sided scripting, … and what about the gazillions of un-tagged links out there, put by folks who never heard of microformats?

Also, nobody links out anymore. We paste URIs into tiny textareas limited to 140 characters that don’t have room for meta data like microformats at all. And since Bing as well as Google use links in tweets for ranking purposes (Web search and news), how the fuck could even a smartass algo decide whether a tweet’s link points to crap or gold? Go figure.

And please don’t get me started on a possible use of sentiment analysis in rankings. To summarize, “FAIL” is printed in big bold letters all over Google’s (or any search engine for that matter) approach to rank search results by the quality of customer service based on signals scraped from unstructured data crawled on the Interwebs. So please, for the sake of my thin wallet, DEAR GOOGLE DON’T EVEN TRY IT! Thanks in advance.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

While doing evil, reluctantly: Size, er trust matters.

These Interwebs are a mess. One can’t trust anyone. Especially not link drops, since Twitter decided to break the Web by raping all of its URIs. Twitter’s sloppy URI gangbang became the Web’s biggest and most disgusting clusterfuck in no time.

Evil URI shortenersI still can’t agree to the friggin’ “N” in SNAFU when it comes to URI shortening. Every time I’m doing evil myself at sites like bit.ly, I’m literally vomiting all over the ‘net — in Swahili, er base36 pidgin.

Besides the fact that each and every shortened URI manifests a felonious design flaw, the major concern is that most –if not all– URI shorteners will die before the last URI they’ve shortened is irrevocable dead. And yes, shit happens all day long — RIP tr.im et al.

Letting shit happen is by no means a dogma. We shouldn’t throw away common sense and best practices when it comes to URI management, which, besides avoiding as many redirects as possible, includes risk management:

What if the great chief of Libya all of a sudden decides that gazillions of bit.ly-URIs redirecting punters to their desired smut aren’t exactly compatible to the Qur’an? All your bit.ly URIs will be defunct over night, and because you rely on traffic from places you’ve spammed with your shortened URIs, you’ll be forced to downgrade your expensive hosting plan to a shitty freehost account that displays huge Al-Quaeda or even Weight-Watchers banners above the fold of your pathetic Web pages.

In related news, even the almighty Google just pestered the Interwebs with just another URI shortener’s website: Goo.gl. It promises stability, security, and speed.

Well, at the day it launched, I broke it with recursive chains of redirects, and meanwhile creative folks like Dave Naylor perhaps wrote a guide on “hacking goo.gl for fun and profit”. #abuse

Of course there are bugs in a brand new product. But Google is a company iterating code way faster than most Internet companies, and due to their huge user base and continuous testing under operating conditions they’re aware of most of their bugs. They’ll fix them eventually, and soon goo.gl –as promised– will be “the stablest, most secure, and fastest URL shortener on the Web”.

So, just based on the size of Google’s infrastructure, it seems goo.gl is going to be the most reliable one out of all evil URI shorteners. Kinda queen of all royal PITAs. But is this a good enough reason to actually use goo.gl? Not quite enough, yet.

Go ask a Googler “Can you guarantee that goo.gl will outlive the Internet?”. I got answers like “I agree with your concern. I thought about it myself. But I’m confident Google will try its very best to preserve that”. From an engineer’s perspective, all of them agree with my statement “URI shortening totally sucks ass”. But IRL the Interwebs are flooded with crappy shortURLs, and that’s not acceptable. They figured that URI shortening can’t be eliminated, so it had to be enhanced by a more reliable procedure. Hence bright folks like Muthu Muthusrinivasan, Devin Mullins, Ben D’Angelo et al created goo.gl, with mixed feelings.

That’s why I recommend the lesser evil. Not because Google is huge, has the better infrastructure, picked a better domain, and the whole shebang. I do trust these software engineers, because they think and act like me. Plus, they’ve got the resources.

I’m going goo.gl.
I’ll dump bit.ly etc.

Fineprint: However, I won’t throw away my very own URI shortener, because this evil piece of crap can do things the mainstream URI shorteners –including goo.gl– are still dreaming of, like preventing search angine crawlers from spotting affiliate links and such stuff. Shortening links alone doesn’t equal cloaking fishy links professionally.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Ditch the spam on SERPs, pretty please?

Say there’s a search engine that tries very hard to serve relevant results for long tail search queries. Maybe it even accepted that an algo change –supposed to wipe out shitloads of thin pages from its long tail search result pages (SERPs)– is referred to as #MayDay. One should think that this search engine isn’t exactly eager to annoy its users with crappy mash-up pages consisting of shabby stuff scraped from all known sources of duplicate content on the whole InterWebs.

Wrong.

Prominent SE spammers like Mahalo still flood the visible part of search indexes with boatloads of crap that never should be able to cheat its way onto any SERP, not even via a [site:spam.com] search. Learn more from Aaron and Michael, who’ve both invested their valuable time to craft out detailled spam reports, to no avail.

Frustrating.

Wait. Why does a bunch of spammy Web pages creates such a fuss? Because they’re findable in the search index. Of course a search engine must crawl all the WebSpam out there, and its indexer has to judge the value of all the content it gets feeded with. But there’s absolutely no need to bother the query engine, that gathers and ranks the stuff presented on the SERPs, with crap like that.

Dear Google, why do you annoy your users with spam created by “a scheme that your automated system handles quite well” at all? Those awesome spam filters should just flag crappy pages as not-SERP-worthy, so that they can never see the daylight at google.com/search. I mean, why should any searcher be at risk of pulling useless search results from your index? Hopefully not because these misleaded searchers tend to click on lots of Google ads on said pages, right?

I’d rather enjoy an empty SERP for an exotic search query, than suffer from a single link to a useless page plastered with huge ads, even if it comes with a tiny portion of stolen content that might be helpful if pointing to the source.

Do you feel like me? Speak out!

Hey Google, I dislike spam on your SERPs! #spam-report Tweet Your Plea For Clean SERPs!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google went belly-up: SERPs sneakily redirect to FPAs

I’m pissed. I do know I shouldn’t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.

What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? Here you go. Google’s webmaster guidelines explicitely forbid such black hat tactics:

Don’t use cloaking or sneaky redirects.” Google just did the latter with its very own SERPs. The search interface google.com/ie, out in the wild for nearly a decade, redirects to a piece of sidebar HTML offering a download of IE8 optimized for Google. That’s a helpful redirect for some IE6 users who don’t suffer from an IT department stuck with this outdated browser, but it’s plain misleading in the eyes of all those searchers who appreciated this clean and totally uncluttered search interface. Interestingly, UA cloaking is the only way to heal this sneaky behavior.

Don’t create pages with malicious behavior.” Google’s guilty, too. Instead of checking for the user’s browser, redirecting only IE6 requests from Google’s discontiued IE6 support (IE6 toolbar …) to the IE8 advertisement, whilst all other user agents get their desired search box, respectively their SERPs, under a google.com/search?output=ie&… URI, Google performs an unconditional redirect to a page that’s utterly useless and also totally unexpected for many searchers. I consider misleading redirects malicious.

Avoid links to web spammers or ‘bad neighborhoods’ on the web.” I consider the propaganda for IE that Google displays instead of the search results I’d expect a bad neighborhood on the Web, because IE constantly ignores Web standards, forcing developers and designers to implement superfluous work arounds. (Ok, ok, ok … Google’s lack of geekiness doesn’t exactly count as violation of their webmaster guidelines, but it sounds good, doesn’t it?)

Hey Matt Cutts, about time to ban google.com/ie! Click to tweet that

Google’s very best search interface is history. Here is what you got under
http://www.google.com/ie?num=100&hl=en&safe=off&q=minimalistic
:

Google's famous minimalistic search UI

And here is where Google sneakily redirects you to when you load the SERP link above (even with Chrome!):
http://www.google.com/toolbar/ie8/sidebar.html
:

Google's sneaky IE8 propaganda

It’s sad that a browser vendor like Google (and yes, Google Chrome is my favorite browser) feels the need to mislead its users with propaganda for a competiting browser that’s slower and doesn’t render everything as it should render it. But when this particular browser vendor also leads Web search, and makes use of black hat techniques that it bans webmasters for, then that’s a scandal. So, if you agree, please submit a spam report to Google:

Hey Matt Cutts, about time to ban google.com/ie! #spam-report Tweet Your Spam Report

2010-05-17 I’ve updated this pamphlet because it didn’t explain the “sneakiness” clear enough. As of today, the unconditional redirect is still sneaky IMHO. Google needs to deliver searchers their desired search results, and only stubborn IE6 users ads for a somewhat better browser.

2010-05-18 Q: You’re pissed solely because your SERP scraping scrips broke. A: Glad you’ve asked. Yes, I’ve scraped Google’s /ie search too. Not because I’m a privacy nazi like Daniel Brandt. I’ve just checked (my) rankings. However, when I spotted the redirects I didn’t even remember the location of the scripts that scraped this service, because I didn’t look at ranking reports for years. I’m interested in actual traffic, and revenues. Ego food annoys me. I just love the /ie search interface. So the answer is a bold “no”. I don’t give a fucking dead rat’s ass what ranking reports based on scraped SERPs could tell.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

SEO Bullshit: Mimicking a file system in URIs

file system like URIsWay back in the WWW’s early Jurassic, micro computer based Web development tools sneakily begun poisoning the formerly ideal world of the Internet. All of a sudden we saw ‘.htm’ URIs, because CP/M and later on PC-DOS file extensions were limited to 3 characters. Truncating the ‘language’ part of HTML was bad enough. Actually, fucking with well established naming conventions wasn’t just a malady, but a symptom of a worse world wide pandemic.

Unfortunately, in order to bring Web publishing to the mere mortals (folks who could afford a micro computer), software developers invented DOS-like restrictions the Web wasn’t designed for. Web design tools maintained files on DOS file systems. FTP clients managed to convert backslashes originating from DOS file systems to slashes on UNIX servers, and vice versa (long before NT 3.51 and IIS). Directory names / file names equalled URIs. Most Web sites were static.

None of those cheap but fancy PC based Web design tools came with a mapping of objects (locally stored as files back then) to URIs pointing to Web resources. Despite Tim Berners-Lee’s warnings (likeIt is the the duty of a Webmaster to allocate URIs which you will be able to stand by in 2 years, in 20 years, in 200 years. This needs thought, and organization, and commitment.“). The technology used to create a resource named its unique identifier (URI). That’s as absurd as wearing diapers a whole live long.

Newbie Web designers grew up with this flawed concept, and never bothered to research the Web’s fundamentals. In their limited view of the Web, a URI was a mirrored version of a file name and its location on their local machine, and everything served from /cgi-bin/ had to be blocked in robots.txt, because all dynamic stuff was evil.

Today, those former newbies consider themselves oldtimers. Actually, they’re still greenhorns, because they’ve never learned that URIs have nothing to do with files, directories, or a Web resources’s (current) underlying technology (as in .php3 for PHP version 3.x, .shtml for SSI, …).

Technology evolves, even changes, but (valuable) contents tend to stay. URIs should solely address a piece of content, they must not change when the technology used to serve those contents changes. That means strings like ‘.html’ or folder names must not be used in URIs.

Many of those notorious greenhorns offer their equally ignorant clients Web development and SEO services today. They might have managed to handle dynamic contents by now (thanks to osCommerce, WordPress and other CMSs), but they’re still stuck with ancient paradigms that were never meant to exist on the Internet.

They might have discovered that search engines are capable of crawling and indexing dynamic contents (URIs with query strings) nowadays, but they still treat them as dumb bots — as if Googlebot or Slurp weren’t more sophisticated than Altavista’s Scooter of 1998.

They might even develop trendy crap (version 2.0 with nifty rounded corners) today, but they still don’t get IT. Whatever IT is, it doesn’t deserve an URI like /category/vendor/product/color/size/crap.htm.

Why hierarchical URIs (expressing breadcrumbs or whatnot) are utter crap (SEO-wise as well as from a developer’s POV) is explained here:

SEO Toxin

 

SEO BullshitI’ve published my rant “Directory-Like URI Structures Are SEO Bullshit” on SEO Bullshit dot com for a reason.

You should keep an eye on this new blog. Subscribe to its RSS feed. Watch its Twitter account.

If it’s about SEO and it’s there, it’s most probably bullshit. If it’s bullshit, avoid it.

If you plan to spam the SEO blogosphere with your half-assed newbie thoughts (especially when you’re an unconvinceable ‘oldtimer’), consider obeying this rule of thumb:

The top minus one reason to publish SEO stupidity is: You’ll end up here.

Of course that doesn’t mean newbies shouldn’t speak out. I’m just sick of newbies who sell their half-assed brain farts as SEO advice to anyone. Noobs should read, ask, listen, learn, practice, evolve. Until they become pros. As a plain Web developer, I can tell from my own experience that listening to SEO professionals is worth every minute of your time.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How brain-amputated developers created the social media plague

The bot playground commonly refered to as “social media” is responsible for shitloads of absurd cretinism.

Twitter Bot PlaygroundFor example Twitter, where gazillions of bots [type A] follow other equally superfluous but nevertheless very busy bots [type B] that automatically generate 27% valuable content (links to penis enlargement tools) and 73% not exactly exciting girly chatter (breeding demand for cheap viagra).

Bazillions of other bots [type C] retweet bot [type B] generated crap and create lists of bots [type A, B, C]. In rare cases when a non-bot tries to participate in Twitter, the uber-bot [type T] prevents the whole bot network from negative impacts by serving a 503 error to the homunculus’ browser.

This pamphlet is about the idiocy of a particular subclass of bots [type S] that sneakily work in the underground stealing money from content producers, and about their criminal (though brain-dead) creators. May they catch the swine flu, or at least pox or cholera, for the pest they’ve brought to us.

The Twitter pest that costs you hard earned money

WTF I’m ranting about? The technically savvy reader, familiar with my attitude, has already figured out that I’ve read way too many raw logs. For the sake of a common denominator, I encourage you to perform a tiny real-world experiment:

  • Publish a great and linkworthy piece of content.
  • Tweet its URI (not shortened - message incl. URI ≤ 139 characters!) with a compelling call for action.
  • Watch your server logs.
  • Puke. Vomit increases with every retweet.

So what happens on your server? A greedy horde of bots pounces on every tweet containing a link, requesting its content. That’s because on Twitter all URIs are suspected to be shortened (learn why Twitter makes you eat shit). This uncalled-for –IOW abusive– bot traffic burns your resources, and (with a cheap hosting plan) it can hinder your followers to read your awesome article and prevent them from clicking on your carefully selected ads.

Those crappy bots not only cost you money because they keep your server busy and increase your bandwidth bill, they actively decrease your advertising revenue because your visitors hit the back button when your page isn’t responsive due to the heavy bot traffic. Even if you’ve great hosting, you probably don’t want to burn money, not even pennies, right?

Bogus Twitter apps and their modus operandi

If only every Twitter&Crap-mashup would lookup each URI once, that wouldn’t be such a mess. Actually, some of these crappy bots request your stuff 10+ times per tweet, and again for each and every retweet. That means, as more popular your content becomes, as more bot traffic it attracts.

Most of these bots don’t obey robots.txt, that means you can’t even block them applying Web standards (learn how to block rogue bots). Topsy, for example, does respect the content producer — so morons using “Python-urllib/1.17″ or “AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)” could obey the Robots Exclusion Protocol (REP), too. Their developers are just too fucking lazy to understand such protocols that every respected service on the Web (search engines…) obeys.

Some of these bots even provide an HTTP_REFERER to lure you into viewing the website operated by their shithead of developer when you’re viewing your referrer stats. Others fake Web browsers in their user agent string, just in case you’re not smart enough to smell shit that really stinks (IOW browser-like requests that don’t fetch images, CSS files, and so on).

One of the worst offenders is outing itself as “ThingFetcher” in the user agent string. It’s hosted by Rackspace, which is a hosting service that obviously doesn’t care much about its reputation. Otherwise these guys would have reacted to my various complaints WRT “ThingFetcher”. By the way, Robert Scoble represents Rackspace, you could drop him a line if ThingFetcher annoys you, too.

ThingFetcher sometimes requests a (shortened) URI 30 times per second, from different IPs. It can get worse when a URI gets retweeted often. This malicious piece of code doesn’t obey robots.txt, and doesn’t cache results. Also, it’s too dumb to follow chained redirects, by the way. It doesn’t even publish its results anywhere, at least I couldn’t find the fancy URIs I’ve feeded it with in Google’s search index.

In ThingFetcher’s defense, its developer might say that it performs only HEAD requests. Well, it’s true that HEAD request provoke only an HTTP response header. But: the script invoked gets completely processed, just the output is trashed.

That means, the Web server has to deal with the same load as with a GET request, it just deletes the content portion (the compelety formatted HTML page) when responding, after counting its size to send the Content-Length response header. Do you really believe that I don’t care about machine time? For each of your utterly useless bogus requests I could have my server deliver ads to a human visitor, who pulls the plastic if I’m upselling the right way (I do, usually).

Unfortunately, ThingFetcher is not the only bot that does a lookup for each URI embedded in a tweet, per tweet processed. Probably the overall number of URIs that appear only once is bigger than the number of URIs that appear quite often while a retweet campaign lasts. That means that doing HTTP requests is cheaper for the bot’s owner, but on the other hand that’s way more expensive for the content producer, and the URI shortening services involved as well.

ThingFetcher update: The owners of ThingFetcher are now aware of the problem, and will try to fix it asap (more information). Now that I know who’s operating the Twitter app owning ThingFetcher, I take back the insults above I’ve removed some insults from above, because they’d no longer address an anonymous developer, but bright folks who’ve just failed once. Too sad that Brizzly didn’t reply earlier to my attempts to identify ThingFetcher’s owner.

As a content producer I don’t care about the costs of any Twitter application that processes Tweets to deliver anything to its users. I care about my costs, and I can perfecly live without such a crappy service. Liberally, I can allow one single access per (shortened) URI to figure out its final destination, but I can’t tolerate such thoughtless abuse of my resources.

Every Twitter related “service” that does multiple requests per (shortened) URI embedded in a tweet is guilty of theft and pilferage. Actually, that’s an understatement, because these raids cost publishers an enormous sum across the Web.

These fancy apps shall maintain a database table storing the destination of each redirect (chain) acessible by its short URI. Or leave the Web, respectively pay the publishers. And by the way, Twitter should finally end URI shortening. Not only it breaks the Internet, it’s way too expensive for all of us.

A few more bots that need a revamp, or at least minor tweaks

I’ve added this section to express that besides my prominent example above, there’s more than one Twitter related app running not exactly squeaky clean bots. That’s not a “worst offenders” list, it’s not complete (I don’t want to reprint Twitter’s yellow pages), and bots are listed in no particular order (compiled from requests following the link in a test tweet, evaluating only a snapshot of less than 5 minutes, backed by historized logs.)

Skip examples

Tweetmeme’s TweetmemeBot coming from eagle.favsys.net doesn’t fetch robots.txt. On their site they don’t explain why they don’t respect the robots exclusion protocol (REP). Apart from that it behaves.

OneRiot’s bot OneRiot/1.0 totally proves that this real time search engine has chosen a great name for itself. Performing 5+ GET as well as HEAD requests per link in a tweet (sometimes more) certainly counts as rioting. Requests for content come from different IPs, the host name pattern is flx1-ppp*.lvdi.net, e.g. flx1-ppp47.lvdi.net. From the same IPs comes another bot: Me.dium/1.0, me.dium.com redirects to oneriot.com. OneRiot doesn’t respect the REP.

Microsoft/Bing runs abusive bots following links in tweets, too. They fake browsers in the user agent, make use of IPs that don’t obviously point to Microsoft (no host name, e.g. 65.52.19.122, 70.37.70.228 …), send multiple GET requests per processed tweet, and don’t respect the REP. If you need more information, I’ve ranted about deceptive M$-bots before. Just a remark in case you’re going to block abusive MSN bot traffic:

MSN/Bing reps ask you not to block their spam bots when you’d like to stay included in their search index (that goes for real time search, too), but who really wants that? Their search index is tiny –compared to other search engines like Yahoo and Google–, their discovery crawling sucks –to get indexed you need to submit your URIs at their webmaster forum–, and in most niches you can count your yearly Bing SERP referrers using not even all fingers of your right hand. If your stats show more than that, check your raw logs. You’ll soon figure out that MSN/Bing spam bots fake SERP traffic in the HTTP_REFERER (guess where their “impressive” market share comes from).

FriendFeed’s bot FriendFeedBot/0.1 is well explained, and behaves. Its bot page even lists all its IPs, and provides you with an email addy for complaints (I never had a reason to use it). The FriendFeedBot made it on this list just because of its lack of REP support.

PostRank’s bot PostRank/2.0 comes from Amazon IPs. It doesn’t respect the REP, and does more than one request per URI found in one single tweet.

MarkMonitor operates a bot faking browser requests, coming from *.embarqhsd.net (va-71-53-201-211.dhcp.embarqhsd.net, va-67-233-115-66.dhcp.embarqhsd.net, …). Multiple requests per URI, no REP support.

Cuil’s bot provides an empty user agent name when following links in tweets, but fetches robots.txt like Cuil’s offical crawler Twiceler. I didn’t bother to test whether this Twitter bot can be blocked following Cuil’s instructions for webmasters or not. It got included in this list for the supressed user agent.

Twingly’s bot Twingly Recon coming from *.serverhotell.net doesn’t respect the REP, doesn’t name its owner, but does only few HEAD requests.

Many bots mimicking browsers come from Amazon, Rackspace, and other cloudy environments, so you can’t get hold of their owners without submitting a report-abuse form. You can identify such bots by sorting your access logs by IP addy. Those “browsers” which don’t request your images, CSS files, and so on, are most certainly bots. Of course, a human visitor having cached your images and CSS matches this pattern, too. So block only IPs that solely request your HTML output over a longer period of time (problematic with bots using DSL providers, AOL, …).

Blocking requests (with IPs belonging to consumer ISPs, or from Amazon and other dynamic hosting environments) with a user agent name like “LWP::Simple/5.808″, “PycURL/7.18.2″, “my6sense/1.0″, “Firefox” (just these 7 characters), “Java/1.6.0_16″ or “libwww-perl/5.816″ is sound advice. By the way, these requests sum up to an amount that would lead a “worst offenders” listing.

Then there are students doing research. I’m not sure I want to waste my resources on requests from Moscow’s “Institute for System Programming RAS”, which fakes unnecessary loads of human traffic (from efrate.ispras.ru, narva.ispras.ru, dvina.ispras.ru …), for example.

When you analyze bot traffic following a tweet with many retweets, you’ll gather a way longer list of misbehaving bots. That’s because you’ll catch more 3rd party Twitter UIs when many Twitter users view their timeline. Not all Twitter apps route their short URI evaluation through their servers, so you might miss out on abusive requests coming from real users via client sided scripts.

Developers might argue that such requests “on behalf of the user” are neither abusive, nor count as bot traffic. I assure you, that’s crap, regardless a particular Twitter app’s architecture, when you count more than one evaluation request per (shortened) URI. For example Googlebot acts on behalf of search engine users too, but it doesn’t overload your server. It fetches each URI embedded in tweets only once. And yes, it processes all tweets out there.

How to do it the right way

Here is what a site owner can expect from a Twitter app’s Web robot:

A meaningful user agent

A Web robot must provide a user agent name that fulfills at least these requirements:

  • A unique string that identifies the bot. The unique part of this string must not change when the version changes (”somebot/1.0″, “somebot/2.0″, …).
  • A URI pointing to a page that explains what the bot is all about, names the owner, and tells how it can be blocked in robots.txt (like this or that).
  • A hint on the rendering engine used, for example “Mozilla/5.0 (compatible; …”.

A method to verify the bot

All IP addresses used by a bot should resolve to server names having a unique pattern. For example Googlebot comes only from servers named "crawl" + "-" + replace($IP, ".", "-") + ".googlebot.com", e.g. “crawl-66-249-71-135.googlebot.com”. All major search engines follow this standard that enables crawler detection not solely relying on the easily spoofable user agent name.

Obeying the robots.txt standard

Webmasters must be able to steer a bot with crawler directives in robots.txt like “Disallow:”. A Web robot should fetch a site’s /robots.txt file before it launches a request for content, when it doesn’t have a cached version from the same day.

Obeying REP indexer directives

Indexer directives like “nofollow”, “noindex” et cetera must be obeyed. That goes for HEAD requests just chasing for a 301/302/307 redirect response code and a “location” header, too.

Indexer directives can be served in the HTTP response header with an X-Robots-Tag, and/or in META elements like the robots meta tag, as well as in LINK elements like rel=canonical and its corresponding headers.

Responsible behavior

As outlined above, requesting the same resources over and over doesn’t count as responsible behavior. Fetching or “HEAD’ing” a resource no more than once a day should suffice for every Twitter app’s needs.

Reprinting a page’s content, or just large quotes, doesn’t count as fair use. It’s Ok to grab the page title and a summary from a META element like “description” (or up to 250 characters from an article’s first paragraph) to craft links, for example - but not more! Also, showing images or embedding videos from the crawled page violates copyrights.

Conclusion, and call for action

If you suffer from rogue Twitter bot traffic, use the medium those bots live in to make their sins public knowledge. Identify the bogus bot’s owners and tweet the crap out of them. Lookup their hosting services, find the report-abuse form, and submit your complaints. Most of these apps make use of the Twitter-API, there are many spam report forms you can creatively use to ruin their reputation at Twitter. If you’ve an account at such a bogus Twitter app, then cancel it and encourage your friends to follow suit.

Don’t let the assclowns of the Twitter universe get away with theft!

I’d like to hear about particular offenders you’re dealing with, and your defense tactics as well, in the comments. Don’t be shy. Go rant away. Thanks in advance!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The anatomy of a deceptive Tweet spamming Google Real-Time Search

Google real time search spammed and abusedMinutes after the launch of Google’s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:

Ingredients

  • 1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.
  • 1 landing page that makes the punter pull out the plastic in no time.
  • 1 trusted authority page totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.
  • 1 short domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.

Preparation

Create a new text file and name it hot-topic.php or so. Then code:
<?php
$landingPageUri = "http://affiliate-program.com/?your-aff-id";
$trustedPageUri = "http://google.com/something.py";
if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) {
header("HTTP/1.1 307 Here you go today", TRUE, 307);
header("Location: $trustedPageUri");
}
else {
header("HTTP/1.1 301 Happy shopping", TRUE, 301);
header("Location: $landingPageUri");
}
exit;
?>

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.

Serving

  • Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.
  • On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.
  • Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.
  • Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

Hey Google, for the sake of our children, take that as a spam report!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Hard facts about URI spam

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29

GA KrakenI’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds –I really mean that, literally!–, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:2010 SEMMY Nominee

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The most sexy browsers screw your analytics

Chrome and Safari fuck with the HTTP_REFERERNow that IE is quite unusable due to the lack of websites that support its non-standard rendering, and the current FireFox version suffers from various maladies, more and more users switch to browsers that are supposed to comply to Web standards, such as Chrome, Safari, or Opera.

Those sexy user agents execute client sided scripts in lightning speed, making surfers addicted to nifty rounded corners very very happy. Of course they come with massive memory leaks, but surfers who shut down their browser every once in a while won’t notice such geeky details.

Why is that bad news for Internet marketers? Because Chrome and Safari screw your analytics. Your stats are useless with regard to bookmarkers and type-in traffic. Your referrer stats lack all hits from Chrome/Safari users who have opened your landing page in a new tab or window.

Google’s Chrome and Apple’s Safari do not provide an HTTP_REFERER. (The typo is standardized, too.)

This bug was reported in September 2008. It’s not yet fixed. Not even in beta versions.

Guess from which (optional) HTTP header line your preferred stats tool compiles the search terms to create all the cool keyword statistics? Yup, that’s the HTTP_REFERER’s query string when the visitor came from a search result page (SERP). Especially on SERPs many users open links in new tabs. That means with every searcher switching to a sexy browser your keyword analysis becomes more useless.

That’s not only an analytics issue. Many sites provide sensible functionality based on the referrer (the Web page a user came from), for example default search terms for site-search facilities gathered from SERP-referrers. Many sites evaluate the HTTP_REFERER to prevent themselves from hotlinking, so their users can’t view the content they’ve paid for when they open a link in a new tab or window.

Passing a blank HTTP_REFERER when this information is available to the user agent is plain evil. Of course lots of so-called Internet security apps do this by default, but just because others do evil that doesn’t mean a top-notch Web browser like Safari or Chrome can get away with crap like this for months and years to come.

Please nudge the developers!

Here you go. Post in this thread why you want them to fix this bug asap. Tell the developers that you can’t live with screwed analytics, and that your site’s users rely on reliable HTTP_REFERERs. Even if you don’t run a website yourself, tell them that your favorite porn site bothers you with countless error messages instead of delivering smut, just because WebKit browsers are buggy.


You can test whether your browser passes the HTTP_REFERER or not: Go to this Google SERP. On the link to this post chose “Open link in new tab” (or window) in the context menu (right click over the link). Scroll down.

Your browser passed this HTTP_REFERER: None



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

  1 | 2 | 3 | 4 | 5 | 6  Next Page »