The bot playground commonly refered to as “social media” is responsible for shitloads of absurd cretinism.
For example Twitter, where gazillions of bots [type A] follow other equally superfluous but nevertheless very busy bots [type B] that automatically generate 27% valuable content (links to penis enlargement tools) and 73% not exactly exciting girly chatter (breeding demand for cheap viagra).
Bazillions of other bots [type C] retweet bot [type B] generated crap and create lists of bots [type A, B, C]. In rare cases when a non-bot tries to participate in Twitter, the uber-bot [type T] prevents the whole bot network from negative impacts by serving a 503 error to the homunculus’ browser.
This pamphlet is about the idiocy of a particular subclass of bots [type S] that sneakily work in the underground stealing money from content producers, and about their criminal (though brain-dead) creators. May they catch the swine flu, or at least pox or cholera, for the pest they’ve brought to us.
The Twitter pest that costs you hard earned money
WTF I’m ranting about? The technically savvy reader, familiar with my attitude, has already figured out that I’ve read way too many raw logs. For the sake of a common denominator, I encourage you to perform a tiny real-world experiment:
- Publish a great and linkworthy piece of content.
- Tweet its URI (not shortened - message incl. URI ≤ 139 characters!) with a compelling call for action.
- Watch your server logs.
- Puke. Vomit increases with every retweet.
So what happens on your server? A greedy horde of bots pounces on every tweet containing a link, requesting its content. That’s because on Twitter all URIs are suspected to be shortened (learn why Twitter makes you eat shit). This uncalled-for –IOW abusive– bot traffic burns your resources, and (with a cheap hosting plan) it can hinder your followers to read your awesome article and prevent them from clicking on your carefully selected ads.
Those crappy bots not only cost you money because they keep your server busy and increase your bandwidth bill, they actively decrease your advertising revenue because your visitors hit the back button when your page isn’t responsive due to the heavy bot traffic. Even if you’ve great hosting, you probably don’t want to burn money, not even pennies, right?
Bogus Twitter apps and their modus operandi
If only every Twitter&Crap-mashup would lookup each URI once, that wouldn’t be such a mess. Actually, some of these crappy bots request your stuff 10+ times per tweet, and again for each and every retweet. That means, as more popular your content becomes, as more bot traffic it attracts.
Most of these bots don’t obey robots.txt, that means you can’t even block them applying Web standards (learn how to block rogue bots). Topsy, for example, does respect the content producer — so morons using “Python-urllib/1.17″ or “AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)” could obey the Robots Exclusion Protocol (REP), too. Their developers are just too fucking lazy to understand such protocols that every respected service on the Web (search engines…) obeys.
Some of these bots even provide an HTTP_REFERER to lure you into viewing the website operated by their shithead of developer when you’re viewing your referrer stats. Others fake Web browsers in their user agent string, just in case you’re not smart enough to smell shit that really stinks (IOW browser-like requests that don’t fetch images, CSS files, and so on).
One of the worst offenders is outing itself as “ThingFetcher” in the user agent string. It’s hosted by Rackspace, which is a hosting service that obviously doesn’t care much about its reputation. Otherwise these guys would have reacted to my various complaints WRT “ThingFetcher”. By the way, Robert Scoble represents Rackspace, you could drop him a line if ThingFetcher annoys you, too.
ThingFetcher sometimes requests a (shortened) URI 30 times per second, from different IPs. It can get worse when a URI gets retweeted often. This malicious piece of code doesn’t obey robots.txt, and doesn’t cache results. Also, it’s too dumb to follow chained redirects, by the way. It doesn’t even publish its results anywhere, at least I couldn’t find the fancy URIs I’ve feeded it with in Google’s search index.
In ThingFetcher’s defense, its developer might say that it performs only HEAD requests. Well, it’s true that HEAD request provoke only an HTTP response header. But: the script invoked gets completely processed, just the output is trashed.
That means, the Web server has to deal with the same load as with a GET request, it just deletes the content portion (the compelety formatted HTML page) when responding, after counting its size to send the
Content-Length response header. Do you really believe that I don’t care about machine time? For each of your
utterly useless requests I could have my server deliver ads to a human visitor, who pulls the plastic if I’m upselling the right way (I do, usually).
Unfortunately, ThingFetcher is not the only bot that does a lookup for each URI embedded in a tweet, per tweet processed. Probably the overall number of URIs that appear only once is bigger than the number of URIs that appear quite often while a retweet campaign lasts. That means that doing HTTP requests is cheaper for the bot’s owner, but on the other hand that’s way more expensive for the content producer, and the URI shortening services involved as well.
ThingFetcher update: The owners of ThingFetcher are now aware of the problem, and will try to fix it asap (more information). Now that I know who’s operating the Twitter app owning ThingFetcher,
I take back the insults above . Too sad that Brizzly didn’t reply earlier to my attempts to identify ThingFetcher’s owner.
As a content producer I don’t care about the costs of any Twitter application that processes Tweets to deliver anything to its users. I care about my costs, and I can perfecly live without such a crappy service. Liberally, I can allow one single access per (shortened) URI to figure out its final destination, but I can’t tolerate such thoughtless abuse of my resources.
Every Twitter related “service” that does multiple requests per (shortened) URI embedded in a tweet is guilty of theft and pilferage. Actually, that’s an understatement, because these raids cost publishers an enormous sum across the Web.
These fancy apps shall maintain a database table storing the destination of each redirect (chain) acessible by its short URI. Or leave the Web, respectively pay the publishers. And by the way, Twitter should finally end URI shortening. Not only it breaks the Internet, it’s way too expensive for all of us.
A few more bots that need a revamp, or at least minor tweaks
I’ve added this section to express that besides my prominent example above, there’s more than one Twitter related app running not exactly squeaky clean bots. That’s not a “worst offenders” list, it’s not complete (I don’t want to reprint Twitter’s yellow pages), and bots are listed in no particular order (compiled from requests following the link in a test tweet, evaluating only a snapshot of less than 5 minutes, backed by historized logs.)
Tweetmeme’s TweetmemeBot coming from eagle.favsys.net doesn’t fetch robots.txt. On their site they don’t explain why they don’t respect the robots exclusion protocol (REP). Apart from that it behaves.
OneRiot’s bot OneRiot/1.0 totally proves that this real time search engine has chosen a great name for itself. Performing 5+ GET as well as HEAD requests per link in a tweet (sometimes more) certainly counts as rioting. Requests for content come from different IPs, the host name pattern is
flx1-ppp*.lvdi.net, e.g. flx1-ppp47.lvdi.net. From the same IPs comes another bot: Me.dium/1.0, me.dium.com redirects to oneriot.com. OneRiot doesn’t respect the REP.
Microsoft/Bing runs abusive bots following links in tweets, too. They fake browsers in the user agent, make use of IPs that don’t obviously point to Microsoft (no host name, e.g. 126.96.36.199, 188.8.131.52 …), send multiple GET requests per processed tweet, and don’t respect the REP. If you need more information, I’ve ranted about deceptive M$-bots before. Just a remark in case you’re going to block abusive MSN bot traffic:
MSN/Bing reps ask you not to block their spam bots when you’d like to stay included in their search index (that goes for real time search, too), but who really wants that? Their search index is tiny –compared to other search engines like Yahoo and Google–, their discovery crawling sucks –to get indexed you need to submit your URIs at their webmaster forum–, and in most niches you can count your yearly Bing SERP referrers using not even all fingers of your right hand. If your stats show more than that, check your raw logs. You’ll soon figure out that MSN/Bing spam bots fake SERP traffic in the HTTP_REFERER (guess where their “impressive” market share comes from).
FriendFeed’s bot FriendFeedBot/0.1 is well explained, and behaves. Its bot page even lists all its IPs, and provides you with an email addy for complaints (I never had a reason to use it). The FriendFeedBot made it on this list just because of its lack of REP support.
PostRank’s bot PostRank/2.0 comes from Amazon IPs. It doesn’t respect the REP, and does more than one request per URI found in one single tweet.
MarkMonitor operates a bot faking browser requests, coming from *.embarqhsd.net (va-71-53-201-211.dhcp.embarqhsd.net, va-67-233-115-66.dhcp.embarqhsd.net, …). Multiple requests per URI, no REP support.
Cuil’s bot provides an empty user agent name when following links in tweets, but fetches robots.txt like Cuil’s offical crawler Twiceler. I didn’t bother to test whether this Twitter bot can be blocked following Cuil’s instructions for webmasters or not. It got included in this list for the supressed user agent.
Twingly’s bot Twingly Recon coming from *.serverhotell.net doesn’t respect the REP, doesn’t name its owner, but does only few HEAD requests.
Many bots mimicking browsers come from Amazon, Rackspace, and other cloudy environments, so you can’t get hold of their owners without submitting a report-abuse form. You can identify such bots by sorting your access logs by IP addy. Those “browsers” which don’t request your images, CSS files, and so on, are most certainly bots. Of course, a human visitor having cached your images and CSS matches this pattern, too. So block only IPs that solely request your HTML output over a longer period of time (problematic with bots using DSL providers, AOL, …).
Blocking requests (with IPs belonging to consumer ISPs, or from Amazon and other dynamic hosting environments) with a user agent name like “LWP::Simple/5.808″, “PycURL/7.18.2″, “my6sense/1.0″, “Firefox” (just these 7 characters), “Java/1.6.0_16″ or “libwww-perl/5.816″ is sound advice. By the way, these requests sum up to an amount that would lead a “worst offenders” listing.
Then there are students doing research. I’m not sure I want to waste my resources on requests from Moscow’s “Institute for System Programming RAS”, which fakes unnecessary loads of human traffic (from efrate.ispras.ru, narva.ispras.ru, dvina.ispras.ru …), for example.
When you analyze bot traffic following a tweet with many retweets, you’ll gather a way longer list of misbehaving bots. That’s because you’ll catch more 3rd party Twitter UIs when many Twitter users view their timeline. Not all Twitter apps route their short URI evaluation through their servers, so you might miss out on abusive requests coming from real users via client sided scripts.
Developers might argue that such requests “on behalf of the user” are neither abusive, nor count as bot traffic. I assure you, that’s crap, regardless a particular Twitter app’s architecture, when you count more than one evaluation request per (shortened) URI. For example Googlebot acts on behalf of search engine users too, but it doesn’t overload your server. It fetches each URI embedded in tweets only once. And yes, it processes all tweets out there.
How to do it the right way
Here is what a site owner can expect from a Twitter app’s Web robot:
A meaningful user agent
A Web robot must provide a user agent name that fulfills at least these requirements:
- A unique string that identifies the bot. The unique part of this string must not change when the version changes (”somebot/1.0″, “somebot/2.0″, …).
- A URI pointing to a page that explains what the bot is all about, names the owner, and tells how it can be blocked in robots.txt (like this or that).
- A hint on the rendering engine used, for example “Mozilla/5.0 (compatible; …”.
A method to verify the bot
All IP addresses used by a bot should resolve to server names having a unique pattern. For example Googlebot comes only from servers named
"crawl" + "-" + replace($IP, ".", "-") + ".googlebot.com", e.g. “crawl-66-249-71-135.googlebot.com”. All major search engines follow this standard that enables crawler detection not solely relying on the easily spoofable user agent name.
Obeying the robots.txt standard
Webmasters must be able to steer a bot with crawler directives in robots.txt like “Disallow:”. A Web robot should fetch a site’s /robots.txt file before it launches a request for content, when it doesn’t have a cached version from the same day.
Obeying REP indexer directives
Indexer directives like “nofollow”, “noindex” et cetera must be obeyed. That goes for HEAD requests just chasing for a 301/302/307 redirect response code and a “location” header, too.
Indexer directives can be served in the HTTP response header with an X-Robots-Tag, and/or in META elements like the robots meta tag, as well as in LINK elements like rel=canonical and its corresponding headers.
As outlined above, requesting the same resources over and over doesn’t count as responsible behavior. Fetching or “HEAD’ing” a resource no more than once a day should suffice for every Twitter app’s needs.
Reprinting a page’s content, or just large quotes, doesn’t count as fair use. It’s Ok to grab the page title and a summary from a META element like “description” (or up to 250 characters from an article’s first paragraph) to craft links, for example - but not more! Also, showing images or embedding videos from the crawled page violates copyrights.
Conclusion, and call for action
If you suffer from rogue Twitter bot traffic, use the medium those bots live in to make their sins public knowledge. Identify the bogus bot’s owners and tweet the crap out of them. Lookup their hosting services, find the report-abuse form, and submit your complaints. Most of these apps make use of the Twitter-API, there are many spam report forms you can creatively use to ruin their reputation at Twitter. If you’ve an account at such a bogus Twitter app, then cancel it and encourage your friends to follow suit.
Don’t let the assclowns of the Twitter universe get away with theft!
I’d like to hear about particular offenders you’re dealing with, and your defense tactics as well, in the comments. Don’t be shy. Go rant away. Thanks in advance!
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments