Archived posts from the 'Web development' Category

Cloaking is good for you. Just ignore Bing’s/Google’s guidelines.

Summary first: If you feel the need to cloak, just do it within reason. Don’t cloak because you can, but because it’s technically the most elegant procedure to accomplish a Web development task. Bing and Google can’t detect your (in no way deceptive) intend algorithmically. Don’t spam away, though, because you might leave trails besides cloaking alone, if you aren’t good enough at spamming search engines. Keep your users interests in mind. Don’t comply to search engine guidelines as set in stone, but to a reasonable level, for example when those force you to comply to Web standards that make more sense than the fancy idea you’ve developed on internationalization, based on detecting browser language settings or so.

search engine guidelines are bullshit WRT cloakingThis pamphlet is an opinion piece. The above said should be considered best practice, even by search engines. Of course it’s not, because search engines can and do fail, just like a webmaster who takes my statement “go cloak away if it makes sense” as technical advice and gets his search engine visibility tanked the hard way.

WTF is cloaking?

Cloaking, also known as IP delivery, means delivering content tailored for specific users who are identified primarily by their IP addresses, but also by user agent (browser, crawler, screen reader…) names, and whatnot. Here’s a simple demonstration of this technique. The content of the next paragraph differs depending on the user requesting this page. Googlebot, Googlers, as well as Matt Cutts at work, will read a personalized message:

Dear visitor, thanks for your visit from 38.107.191.88 (38.107.191.88).

You surely can imagine that cloaking opens a can of worms lots of opportunities to enhance a user’s surfing experience, besides “stalking” particular users like Google’s head of WebSpam.

Why do search engines dislike cloaking?

Apparently they don’t. They use IP delivery themselves. When you’re traveling in europe, you’ll get hints like “go to Google.fr” or “go to Google.at” all the time. That’s google.com checking where you are, trying to lure you into their regional services.

More seriously, there’s a so-called “dark side of cloaking”. Say you’re a seasoned Internet marketer, then you could show Googlebot an educational page with compelling content under an URI like “/games/poker” with an X-Robots-Tag HTTP header telling “noarchive”, whilst surfers (search engine users) supplying an HTTP_REFERER and not coming from employee.google.com get redirected to poker dot com (simplified example).

That’s hard to detect for Google’s WebSpam team. Because they don’t do evil themselves, they can’t officially operate sneaky bots that use for example AOL as their ISP to compare your spider fodder to pages/redirects served to actual users.

Bing sends out spam bots that request your pages “as a surfer” in order to discover deceptive cloaking. Of course those bots can be identified, so professional spammers serve them their spider fodder. Besides burning the bandwidth of non-cloaking sites, Bing doesn’t accomplish anything useful in terms of search quality.

Because search engines can’t detect cloaking properly, not to speak of a cloaking webmaster’s intentions, they’ve launched webmaster guidelines (FUD) that forbid cloaking at all. All Google/Bing reps tell you that cloaking is an evil black hat tactic that will get your site penalized or even banned. By the way, the same goes for perfectly legit “hidden content” that’s invisible on page load, but viewable after a mouse click on a “learn more” widget/link or so.

Bullshit.

If your competitor makes creative use of IP delivery to enhance their visitors’ surfing experience, you can file a spam report for cloaking and Google/Bing will ban the site eventually. Just because cloaking can be used with deceptive intent. And yes, it works this way. See below.

Actually, those spam reports trigger a review by a human, so maybe your competitor gets away with it. But search engines also use spam reports to develop spam filters that penalize crawled pages totally automatted. Such filters can fail, and –trust me– they do fail often. Once you must optimize your content delivery for particular users or user groups yourself, such a filter could tank your very own stuff by accident. So don’t snitch on your competitors, because tomorrow they’ll return the favor.

Enforcing a “do not cloak” policy is evil

At least Google’s WebSpam team comes with cojones. They’ve even banned their very own help pages for “cloaking“, although those didn’t serve porn to minors searching for SpongeBob images with safe-search=on.

That’s overdrawn, because the help files of any Google product aren’t usable without a search facility. When I click “help” in any Google service like AdWords, I get either blank pages, and/or links within the help system are broken because the destination pages were deindexed for cloaking. Plain evil, and counter productive.

Just because Google’s help software doesn’t show ads and related links to Googlebot, those pages aren’t guilty of deceptive cloaking. Ms Googlebot won’t pull the plastic, so it makes no sense to serve her advertisements. Related links are context sensitive just like ads, so it makes no sense to persist them in Google’s crawling cache, or even in Google’s search index. Also, as a user I really don’t care whether Google has crawled the same heading I see on a help page or not, as long as I get directed to relevant content, that is a paragraph or more that answers my question.

When a search engine doesn’t deliver the very best search results intentionally, just because those pages violate an outdated and utterly useless policy that rules fraudulent tactics in a shape lastly used in the last century and doesn’t take into account how the Internet works today, I’m pissed.

Maybe that’s not bad at all when applied to Google products? Bullshit, again. The same happens to any other website that doesn’t fit Google’s weird idea of “serving the same content to users and crawlers”. I mean, as long as Google’s crawlers come from US IPs only, how can a US based webmaster serve the same content in German language to a user coming from Austria and Googlebot, both requesting a URI like “/shipping-costs?lang=de” that has to be different for each user because shipping a parcel to Germany costs $30.00 and a parcel of the same weight shipped to Vienna costs $40.00? Don’t tell me bothering a user with shipping fees for all regions in CH/AT/DE all on one page is a good idea, when I can reduce the information overflow to a tailored info of just one shipping fee that my user expects to see, followed by a link to a page that lists shipping costs for all european countries, or all countries where at least some folks might speak/understand German.

Back to Google’s ban of its very own help pages that hid AdSense code from Googlebot. Of course Google wants to see what surfers see in order to deliver relevant search results, and that might include advertisements. However, surrounding ads don’t necessarily obfuscate the page’s content. Ads served instead of content do. So when Google wants to detect ad laden thin pages, they need to become smarter. Penalizing pages that don’t show ads to search engine crawlers is a bad idea for a search engine, because not showing ads to crawlers is a good idea, not only bandwidth-wise, for a webmaster.

Managing this dichotomy is the search engine’s job. They shouldn’t expect webmasters to help them solving their very own problems (maintaining search quality). In fact, bothering webmasters with policies solely put because search engine algos are fallible and incapable is plain evil. The same applies to instruments like rel-nofollow (launched to help Google devaluing spammy links but backfiring enormously) or Google’s war on paid links (as if not each and every link on the whole Internet is paid/bartered for, somehow).

What do you think, should search engines ditch their way too restrictive “don’t cloak” policies? Click to vote: Stop search engines that tyrannize webmasters!

 

Update 2010-07-06: Don’t miss out on Danny Sullivan’s “Google be fair!” appeal, posted today: Why Google Should Ban Its Own Help Pages — But Also Shouldn’t



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Get yourself a smart robots.txt

greedy and aggressive web robots steal your contentCrawlers and other Web robots are the plague of today’s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don’t. Behaving or not, most bots just steal your content. You don’t appreciate that, so block them.

This pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines –that bring you nice traffic– while keeping the nasty (or useless, traffic-wise) bots out of the game.

The basic idea is that blocking all bots –with very few exceptions– makes more sense than maintaining kinda Web robots who’s who in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold Disallow: /.

Of course that’s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That’s bullshit. You can’t handle a gazillion bots with a black list.

Even bots that respect the REP can harm your search engine rankings, or reveal sensitive information to your competitors. Every minute a new bots turns up. You can’t manage all of them, and you can’t trust any (behaving) bot. Or, as the master of bot control explains: “That’s the only thing I’m concerned with: what do I get in return. If it’s nothing, it’s blocked“.

Also, large robots.txt files handling tons of bots are fault prone. It’s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don’t enlighten your competitors.

If you’re a smart webmaster agreeing with this approach, here’s your ToDo-List:
• Grab the code
• Install
• Customize
• Test
• Implement.
On error read further.

The anatomy of a smart robots.txt

Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you’re somewhat fucked. The code isn’t elegant. I’ve tried to keep it easy to understand even for noobs — at the expense of occasional lengthiness and redundancy.

Install

First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that’s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you’d have to add a leading <?php ?> string to all of them. Hence you add this code snippet to your root’s .htaccess file:
<FilesMatch ^robots\.txt$>
SetHandler application/x-httpd-php
</FilesMatch>

As long as you’re testing and customizing my script, make that ^smart_robots\.txt$.

Next grab the code and extract it into your document root directory. Do not rename /smart_robots.txt to /robots.txt until you’ve customized the PHP code!

For testing purposes you can use the logRequest() function. Probably it’s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don’t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.

Customize

Search the code for #EDIT and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you’ve launched your smart robots.txt.

Let’s look at /smart_robots.txt. First of all, it sets the canonical server name, change that to yours. After routing robots.txt request logging to a flat file (change that to a database table!) it includes /smart_robots_inc.php.

Next it sends some HTTP headers that you shouldn’t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn’t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.

As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its cached copy. To test your own robots.txt with Googlebot, you can login to GWC and fetch it as Googlebot. It’s a shame that the other search engines don’t provide a feature like that.

When you implement the whitelisted crawler method, you really should provide a contact page for crawling requests. So please change the “In order to gain permissions to crawl blocked site areas…” comment.

Next up are the search engine specific crawler directives. You put them as
if (isGooglebot()) {
$content .= "
User-agent: Googlebot
Disallow:

\n\n";
}

If your URIs contain double quotes, escape them as \" in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)

Please note that you need to output at least one empty line before each User-agent: section. Repeat that for each accepted crawler, before you output
$content .= "User-agent: *
Disallow: /
\n\n";

Every behaving Web robot that’s not whitelisted will bounce at the Disallow: /.

Before $content is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all SEO tools that spoof the user agent. Make sure that you do not output a single byte –for example leading whitespaces, a debug message, or a #comment– before the print $content; statement.

Blocking rogue bots is important. If you discover a rogue bot –for example a scraper that pretends to be Googlebot– during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can’t access your content!

Bear in mind that each and every piece of content served from your site should implement rogue bot detection, that’s doable even with non-HTML resources like images or PDFs.

Finally we deliver the user agent specific robots.txt and terminate the connection.

Now let’s look at /smart_robots_inc.php. Don’t fuck-up the variable definitions and routines that populate them or deal with the requestor’s IP addy.

Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that’s maintained 24/7 and updated every 6 hours: here you go (don’t use it for deceptive cloaking, promised?).

/smart_robots_inc.php comes with functions that detect Googlebot, MSNbot, and Slurp.

Most search engines tell how you can verify their crawlers and which crawler directives their user agents support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading “spider” and trailing “.yandex.ru” string and inbetween an integer, like in the isSlurp() function.

Test

Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don’t need to wait for crawler visits, you can use the “Fetch as Googlebot” thingy in your webmaster console.

Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from “.search.live.com” by accident, or that someone at Yahoo starts an ancient test bot that still uses an “inktomisearch.com” host name.

Don’t rely on my crawler detection routines. They’re dumped from memory in a hurry, I’ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It’s up to you to make it smart.

Launch

Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.

The output of a smart robots.txt

When you download a smart robots.txt with your browser, wget, or any other tool that comes with user agent spoofing, you’ll see a 403 or something like:


HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1
X-Powered-By: sebastians-pamphlets.com
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain;charset=iso-8859-1

# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot

User-agent: *
Disallow: /
(the contact form URI above doesn’t exist)

whilst a real search engine crawler like Googlebot gets slightly different contents:


HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1
X-Powered-By: sebastians-pamphlets.com
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain; charset=iso-8859-1

# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot

User-agent: Googlebot
Allow: /
Disallow:

Sitemap: http://sebastians-pamphlets.com/sitemap.xml

User-agent: *
Disallow: /

Search engines hide important information from webmasters

Unfortunately, most search engines don’t provide enough information about their crawling. For example, last time I looked Google doesn’t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for “Googlebot-” and you’ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.

I’ve nudged Google and their response was that they don’t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn’t even announce properly that they’ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.

Since search engines don’t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents “on behalf of” particular user agents.

 

Enjoy.

 

Changelog:

2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

SEO Bullshit: Mimicking a file system in URIs

file system like URIsWay back in the WWW’s early Jurassic, micro computer based Web development tools sneakily begun poisoning the formerly ideal world of the Internet. All of a sudden we saw ‘.htm’ URIs, because CP/M and later on PC-DOS file extensions were limited to 3 characters. Truncating the ‘language’ part of HTML was bad enough. Actually, fucking with well established naming conventions wasn’t just a malady, but a symptom of a worse world wide pandemic.

Unfortunately, in order to bring Web publishing to the mere mortals (folks who could afford a micro computer), software developers invented DOS-like restrictions the Web wasn’t designed for. Web design tools maintained files on DOS file systems. FTP clients managed to convert backslashes originating from DOS file systems to slashes on UNIX servers, and vice versa (long before NT 3.51 and IIS). Directory names / file names equalled URIs. Most Web sites were static.

None of those cheap but fancy PC based Web design tools came with a mapping of objects (locally stored as files back then) to URIs pointing to Web resources. Despite Tim Berners-Lee’s warnings (likeIt is the the duty of a Webmaster to allocate URIs which you will be able to stand by in 2 years, in 20 years, in 200 years. This needs thought, and organization, and commitment.“). The technology used to create a resource named its unique identifier (URI). That’s as absurd as wearing diapers a whole live long.

Newbie Web designers grew up with this flawed concept, and never bothered to research the Web’s fundamentals. In their limited view of the Web, a URI was a mirrored version of a file name and its location on their local machine, and everything served from /cgi-bin/ had to be blocked in robots.txt, because all dynamic stuff was evil.

Today, those former newbies consider themselves oldtimers. Actually, they’re still greenhorns, because they’ve never learned that URIs have nothing to do with files, directories, or a Web resources’s (current) underlying technology (as in .php3 for PHP version 3.x, .shtml for SSI, …).

Technology evolves, even changes, but (valuable) contents tend to stay. URIs should solely address a piece of content, they must not change when the technology used to serve those contents changes. That means strings like ‘.html’ or folder names must not be used in URIs.

Many of those notorious greenhorns offer their equally ignorant clients Web development and SEO services today. They might have managed to handle dynamic contents by now (thanks to osCommerce, WordPress and other CMSs), but they’re still stuck with ancient paradigms that were never meant to exist on the Internet.

They might have discovered that search engines are capable of crawling and indexing dynamic contents (URIs with query strings) nowadays, but they still treat them as dumb bots — as if Googlebot or Slurp weren’t more sophisticated than Altavista’s Scooter of 1998.

They might even develop trendy crap (version 2.0 with nifty rounded corners) today, but they still don’t get IT. Whatever IT is, it doesn’t deserve an URI like /category/vendor/product/color/size/crap.htm.

Why hierarchical URIs (expressing breadcrums or whatnot) are utter crap (SEO-wise as well as from a developer’s POV) is explained here:

SEO Toxin

 

SEO BullshitI’ve published my rant “Directory-Like URI Structures Are SEO Bullshit” on SEO Bullshit dot com for a reason.

You should keep an eye on this new blog. Subscribe to its RSS feed. Watch its Twitter account.

If it’s about SEO and it’s there, it’s most probably bullshit. If it’s bullshit, avoid it.

If you plan to spam the SEO blogosphere with your half-assed newbie thoughts (especially when you’re an unconvinceable ‘oldtimer’), consider obeying this rule of thumb:

The top minus one reason to publish SEO stupidity is: You’ll end up here.

Of course that doesn’t mean newbies shouldn’t speak out. I’m just sick of newbies who sell their half-assed brain farts as SEO advice to anyone. Noobs should read, ask, listen, learn, practice, evolve. Until they become pros. As a plain Web developer, I can tell from my own experience that listening to SEO professionals is worth every minute of your time.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How brain-amputated developers created the social media plague

The bot playground commonly refered to as “social media” is responsible for shitloads of absurd cretinism.

Twitter Bot PlaygroundFor example Twitter, where gazillions of bots [type A] follow other equally superfluous but nevertheless very busy bots [type B] that automatically generate 27% valuable content (links to penis enlargement tools) and 73% not exactly exciting girly chatter (breeding demand for cheap viagra).

Bazillions of other bots [type C] retweet bot [type B] generated crap and create lists of bots [type A, B, C]. In rare cases when a non-bot tries to participate in Twitter, the uber-bot [type T] prevents the whole bot network from negative impacts by serving a 503 error to the homunculus’ browser.

This pamphlet is about the idiocy of a particular subclass of bots [type S] that sneakily work in the underground stealing money from content producers, and about their criminal (though brain-dead) creators. May they catch the swine flu, or at least pox or cholera, for the pest they’ve brought to us.

The Twitter pest that costs you hard earned money

WTF I’m ranting about? The technically savvy reader, familiar with my attitude, has already figured out that I’ve read way too many raw logs. For the sake of a common denominator, I encourage you to perform a tiny real-world experiment:

  • Publish a great and linkworthy piece of content.
  • Tweet its URI (not shortened - message incl. URI ≤ 139 characters!) with a compelling call for action.
  • Watch your server logs.
  • Puke. Vomit increases with every retweet.

So what happens on your server? A greedy horde of bots pounces on every tweet containing a link, requesting its content. That’s because on Twitter all URIs are suspected to be shortened (learn why Twitter makes you eat shit). This uncalled-for –IOW abusive– bot traffic burns your resources, and (with a cheap hosting plan) it can hinder your followers to read your awesome article and prevent them from clicking on your carefully selected ads.

Those crappy bots not only cost you money because they keep your server busy and increase your bandwidth bill, they actively decrease your advertising revenue because your visitors hit the back button when your page isn’t responsive due to the heavy bot traffic. Even if you’ve great hosting, you probably don’t want to burn money, not even pennies, right?

Bogus Twitter apps and their modus operandi

If only every Twitter&Crap-mashup would lookup each URI once, that wouldn’t be such a mess. Actually, some of these crappy bots request your stuff 10+ times per tweet, and again for each and every retweet. That means, as more popular your content becomes, as more bot traffic it attracts.

Most of these bots don’t obey robots.txt, that means you can’t even block them applying Web standards (learn how to block rogue bots). Topsy, for example, does respect the content producer — so morons using “Python-urllib/1.17″ or “AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)” could obey the Robots Exclusion Protocol (REP), too. Their developers are just too fucking lazy to understand such protocols that every respected service on the Web (search engines…) obeys.

Some of these bots even provide an HTTP_REFERER to lure you into viewing the website operated by their shithead of developer when you’re viewing your referrer stats. Others fake Web browsers in their user agent string, just in case you’re not smart enough to smell shit that really stinks (IOW browser-like requests that don’t fetch images, CSS files, and so on).

One of the worst offenders is outing itself as “ThingFetcher” in the user agent string. It’s hosted by Rackspace, which is a hosting service that obviously doesn’t care much about its reputation. Otherwise these guys would have reacted to my various complaints WRT “ThingFetcher”. By the way, Robert Scoble represents Rackspace, you could drop him a line if ThingFetcher annoys you, too.

ThingFetcher sometimes requests a (shortened) URI 30 times per second, from different IPs. It can get worse when a URI gets retweeted often. This malicious piece of code doesn’t obey robots.txt, and doesn’t cache results. Also, it’s too dumb to follow chained redirects, by the way. It doesn’t even publish its results anywhere, at least I couldn’t find the fancy URIs I’ve feeded it with in Google’s search index.

In ThingFetcher’s defense, its developer might say that it performs only HEAD requests. Well, it’s true that HEAD request provoke only an HTTP response header. But: the script invoked gets completely processed, just the output is trashed.

That means, the Web server has to deal with the same load as with a GET request, it just deletes the content portion (the compelety formatted HTML page) when responding, after counting its size to send the Content-Length response header. Do you really believe that I don’t care about machine time? For each of your utterly useless bogus requests I could have my server deliver ads to a human visitor, who pulls the plastic if I’m upselling the right way (I do, usually).

Unfortunately, ThingFetcher is not the only bot that does a lookup for each URI embedded in a tweet, per tweet processed. Probably the overall number of URIs that appear only once is bigger than the number of URIs that appear quite often while a retweet campaign lasts. That means that doing HTTP requests is cheaper for the bot’s owner, but on the other hand that’s way more expensive for the content producer, and the URI shortening services involved as well.

ThingFetcher update: The owners of ThingFetcher are now aware of the problem, and will try to fix it asap (more information). Now that I know who’s operating the Twitter app owning ThingFetcher, I take back the insults above I’ve removed some insults from above, because they’d no longer address an anonymous developer, but bright folks who’ve just failed once. Too sad that Brizzly didn’t reply earlier to my attempts to identify ThingFetcher’s owner.

As a content producer I don’t care about the costs of any Twitter application that processes Tweets to deliver anything to its users. I care about my costs, and I can perfecly live without such a crappy service. Liberally, I can allow one single access per (shortened) URI to figure out its final destination, but I can’t tolerate such thoughtless abuse of my resources.

Every Twitter related “service” that does multiple requests per (shortened) URI embedded in a tweet is guilty of theft and pilferage. Actually, that’s an understatement, because these raids cost publishers an enormous sum across the Web.

These fancy apps shall maintain a database table storing the destination of each redirect (chain) acessible by its short URI. Or leave the Web, respectively pay the publishers. And by the way, Twitter should finally end URI shortening. Not only it breaks the Internet, it’s way too expensive for all of us.

A few more bots that need a revamp, or at least minor tweaks

I’ve added this section to express that besides my prominent example above, there’s more than one Twitter related app running not exactly squeaky clean bots. That’s not a “worst offenders” list, it’s not complete (I don’t want to reprint Twitter’s yellow pages), and bots are listed in no particular order (compiled from requests following the link in a test tweet, evaluating only a snapshot of less than 5 minutes, backed by historized logs.)

Skip examples

Tweetmeme’s TweetmemeBot coming from eagle.favsys.net doesn’t fetch robots.txt. On their site they don’t explain why they don’t respect the robots exclusion protocol (REP). Apart from that it behaves.

OneRiot’s bot OneRiot/1.0 totally proves that this real time search engine has chosen a great name for itself. Performing 5+ GET as well as HEAD requests per link in a tweet (sometimes more) certainly counts as rioting. Requests for content come from different IPs, the host name pattern is flx1-ppp*.lvdi.net, e.g. flx1-ppp47.lvdi.net. From the same IPs comes another bot: Me.dium/1.0, me.dium.com redirects to oneriot.com. OneRiot doesn’t respect the REP.

Microsoft/Bing runs abusive bots following links in tweets, too. They fake browsers in the user agent, make use of IPs that don’t obviously point to Microsoft (no host name, e.g. 65.52.19.122, 70.37.70.228 …), send multiple GET requests per processed tweet, and don’t respect the REP. If you need more information, I’ve ranted about deceptive M$-bots before. Just a remark in case you’re going to block abusive MSN bot traffic:

MSN/Bing reps ask you not to block their spam bots when you’d like to stay included in their search index (that goes for real time search, too), but who really wants that? Their search index is tiny –compared to other search engines like Yahoo and Google–, their discovery crawling sucks –to get indexed you need to submit your URIs at their webmaster forum–, and in most niches you can count your yearly Bing SERP referrers using not even all fingers of your right hand. If your stats show more than that, check your raw logs. You’ll soon figure out that MSN/Bing spam bots fake SERP traffic in the HTTP_REFERER (guess where their “impressive” market share comes from).

FriendFeed’s bot FriendFeedBot/0.1 is well explained, and behaves. Its bot page even lists all its IPs, and provides you with an email addy for complaints (I never had a reason to use it). The FriendFeedBot made it on this list just because of its lack of REP support.

PostRank’s bot PostRank/2.0 comes from Amazon IPs. It doesn’t respect the REP, and does more than one request per URI found in one single tweet.

MarkMonitor operates a bot faking browser requests, coming from *.embarqhsd.net (va-71-53-201-211.dhcp.embarqhsd.net, va-67-233-115-66.dhcp.embarqhsd.net, …). Multiple requests per URI, no REP support.

Cuil’s bot provides an empty user agent name when following links in tweets, but fetches robots.txt like Cuil’s offical crawler Twiceler. I didn’t bother to test whether this Twitter bot can be blocked following Cuil’s instructions for webmasters or not. It got included in this list for the supressed user agent.

Twingly’s bot Twingly Recon coming from *.serverhotell.net doesn’t respect the REP, doesn’t name its owner, but does only few HEAD requests.

Many bots mimicking browsers come from Amazon, Rackspace, and other cloudy environments, so you can’t get hold of their owners without submitting a report-abuse form. You can identify such bots by sorting your access logs by IP addy. Those “browsers” which don’t request your images, CSS files, and so on, are most certainly bots. Of course, a human visitor having cached your images and CSS matches this pattern, too. So block only IPs that solely request your HTML output over a longer period of time (problematic with bots using DSL providers, AOL, …).

Blocking requests (with IPs belonging to consumer ISPs, or from Amazon and other dynamic hosting environments) with a user agent name like “LWP::Simple/5.808″, “PycURL/7.18.2″, “my6sense/1.0″, “Firefox” (just these 7 characters), “Java/1.6.0_16″ or “libwww-perl/5.816″ is sound advice. By the way, these requests sum up to an amount that would lead a “worst offenders” listing.

Then there are students doing research. I’m not sure I want to waste my resources on requests from Moscow’s “Institute for System Programming RAS”, which fakes unnecessary loads of human traffic (from efrate.ispras.ru, narva.ispras.ru, dvina.ispras.ru …), for example.

When you analyze bot traffic following a tweet with many retweets, you’ll gather a way longer list of misbehaving bots. That’s because you’ll catch more 3rd party Twitter UIs when many Twitter users view their timeline. Not all Twitter apps route their short URI evaluation through their servers, so you might miss out on abusive requests coming from real users via client sided scripts.

Developers might argue that such requests “on behalf of the user” are neither abusive, nor count as bot traffic. I assure you, that’s crap, regardless a particular Twitter app’s architecture, when you count more than one evaluation request per (shortened) URI. For example Googlebot acts on behalf of search engine users too, but it doesn’t overload your server. It fetches each URI embedded in tweets only once. And yes, it processes all tweets out there.

How to do it the right way

Here is what a site owner can expect from a Twitter app’s Web robot:

A meaningful user agent

A Web robot must provide a user agent name that fulfills at least these requirements:

  • A unique string that identifies the bot. The unique part of this string must not change when the version changes (”somebot/1.0″, “somebot/2.0″, …).
  • A URI pointing to a page that explains what the bot is all about, names the owner, and tells how it can be blocked in robots.txt (like this or that).
  • A hint on the rendering engine used, for example “Mozilla/5.0 (compatible; …”.

A method to verify the bot

All IP addresses used by a bot should resolve to server names having a unique pattern. For example Googlebot comes only from servers named "crawl" + "-" + replace($IP, ".", "-") + ".googlebot.com", e.g. “crawl-66-249-71-135.googlebot.com”. All major search engines follow this standard that enables crawler detection not solely relying on the easily spoofable user agent name.

Obeying the robots.txt standard

Webmasters must be able to steer a bot with crawler directives in robots.txt like “Disallow:”. A Web robot should fetch a site’s /robots.txt file before it launches a request for content, when it doesn’t have a cached version from the same day.

Obeying REP indexer directives

Indexer directives like “nofollow”, “noindex” et cetera must be obeyed. That goes for HEAD requests just chasing for a 301/302/307 redirect response code and a “location” header, too.

Indexer directives can be served in the HTTP response header with an X-Robots-Tag, and/or in META elements like the robots meta tag, as well as in LINK elements like rel=canonical and its corresponding headers.

Responsible behavior

As outlined above, requesting the same resources over and over doesn’t count as responsible behavior. Fetching or “HEAD’ing” a resource no more than once a day should suffice for every Twitter app’s needs.

Reprinting a page’s content, or just large quotes, doesn’t count as fair use. It’s Ok to grab the page title and a summary from a META element like “description” (or up to 250 characters from an article’s first paragraph) to craft links, for example - but not more! Also, showing images or embedding videos from the crawled page violates copyrights.

Conclusion, and call for action

If you suffer from rogue Twitter bot traffic, use the medium those bots live in to make their sins public knowledge. Identify the bogus bot’s owners and tweet the crap out of them. Lookup their hosting services, find the report-abuse form, and submit your complaints. Most of these apps make use of the Twitter-API, there are many spam report forms you can creatively use to ruin their reputation at Twitter. If you’ve an account at such a bogus Twitter app, then cancel it and encourage your friends to follow suit.

Don’t let the assclowns of the Twitter universe get away with theft!

I’d like to hear about particular offenders you’re dealing with, and your defense tactics as well, in the comments. Don’t be shy. Go rant away. Thanks in advance!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Sanitize links in your content feeds

Don't bother your visitors with broken links in feedsHere’s a WordPress plug-in that sanizites relative links and on-the-page links in your content feeds: feedLinkSanitizer. Why do you need it?

Because you end up with invalid links like http://feeds.sebastians-pamphlets.com/SebastiansPamphlets#tos if you don’t use it. Once the post phases out of the main page, the link points to nowhere in feedreaders and reprints.

In feeds, absolute links are mandatory. Make sure that not a single on-the-page link or relative link slips out of your site.

Relative links

When you put all links to your own stuff as /perma-link/ instead of http://your-blog/permalink/ you can serve your blog’s content from a different server / base URI (dev, move, …) without editing all internal links.

The downside is, that for various very good reasons (scrapers, search engines, whatnot) thou must not have relative links in your HTML. You might disagree, but read on.

The simple solution is: store relative links in your WordPress database, but output absolute links. Follow the hint in feedLinkSanitizer.txt to activate link sanitizing in your HTML. By default it changes only feed contents.

The plug-in changes /perma-link/ to http://example.com/perma-link/ in your posts, using the blog URI provided in your WordPress settings. It takes the current server name if this value is missing.

Fragment links

You can link to any DOM-ID in an HTML page, for example <a href="#tos">Table of contents</a> where ‘tos’ is the DOM-ID of an HTML element like <h2 id="tos">Table of contents</h2>. These on-the-page links even come with some SEO value, just in case you don’t care much about usability.

The plug-in changes #tos to http://example.com/perma-link/#tos in your posts. If you’ve set $sanitizeAllLinks = TRUE; in the plugin-code, an on-the-page link clicked on the blog’s main page will open the post, positioning to the DOM-ID.

Download feedLinkSanitizer

I’m a launch-early kind of guy, so test it yourself. And: Use at your own risk. No warranty expressed or implied is provided.

If you use another CMS, download the plug-in anyway and steal adapt its code.

Credits for previous work go to Jon Thysell and Gerd Riesselmann.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How to cleverly integrate your own URI shortener

This pamphlet is somewhat geeky. Don’t necessarily understand it as a part of my ongoing jihad holy war on URI shorteners.

Clever implementation of an URL shortenerAssuming you’re slightly familiar with my opinions, you already know that third party URI shorteners (aka URL shorteners) are downright evil. You don’t want to make use of unholy crap, so you need to roll your own. Here’s how you can (could) integrate a URI shortener into your site’s architecture.

Please note that my design suggestions ain’t black nor white. Your site’s architecture may require a different approach. Adapt my tips with care, or use my thoughts to rethink your architectural decisions, if they’re applicable.

At the first sight, searching for a free URI shortener script to implement it on a dedicated domain looks like a pretty simple solution. It’s not. At least not in most cases. Standalone URI shorteners work fine when you want to shorten mostly foreign URIs, but that’s a crappy approach when you want to submit your own stuff to social media. Why? Because you throw away the ability to totally control your traffic from social media, and search engine traffic generated by social media as well.

So if you’re not running cheap-student-loans-with-debt-consolidation-on-each-payday-is-a-must-have-for-sexual-heroes-desperate-for-a-viagra-overdose-and-extreme-penis-length-enhancement.info and your domain’s name without the “www” prefix plus a few characters gives URIs of 20 (30) characters or less, you don’t need a short domain name to host your shortened URIs.

As a side note, when you’re shortening your URIs for Twitter you should know that shortened URIs aren’t mandatory any more. If your message doesn’t exceed 139 characters, you don’t need to shorten embedded URIs.

By integrating a URI shortener into your site architecture you gain the abilitiy to perform way more than URI shortening. For example, you can transform your longish and ugly dynamic URIs into short (but keyword rich) URIs, and more.

In the following I’ll walk you step by step through (not really) everything an incoming HTTP request might face. Of course the sequence of steps is a generalization, so perhaps you’ll have to change it to fit your needs. For example when you operate a WordPress blog, you could code nearly everthing below in your 404 page (consider alternatives). Actually, handling short URIs in your error handler is a pretty good idea when you suffer from a mainstream CMS.

Table of contents

To provide enough context to get the advantages of a fully integrated URI shortener, vs. the stand-alone variant, I’ll bore you with a ton of dull and totally unrelated stuff:

Introduction

There’s a bazillion of methods to handle HTTP requests. For the sake of this pamphlet I assume we’re dealing with a well structured site, hosted on Apache with mod_rewrite and PHP available. That allows us to handle each and every HTTP request dynamically with a PHP script. To accomplish that, upload an .htaccess file to the document root directory:

RewriteEngine On
RewriteCond %{SERVER_PORT} ^80$
RewriteRule . /requestHandler.php [L]

Please note that the code above kinda disables the Web server’s error handling. If
/requestHandler.php
exists in the root directory, all ErrorDocument directives (except some 5xx) et cetera will be ignored. You need to take care of errors yourself.

/requestHandler.php (Warning: untested and simplified code snippets below)
/* Initialization */
$serverName = strtolower($_SERVER["SERVER_NAME"]);
$canonicalServerName = "sebastians-pamphlets.com";
$scheme = "http://";
$rootUri = $scheme .$canonicalServerName; /* if used w/o path add a
slash */
$rootPath = $_SERVER["DOCUMENT_ROOT"];
$includePath = $rootPath ."/src"; /* Customize that, maybe you've to manipulate the file system path to your Web server's root */
$requestIp = $_SERVER["REMOTE_ADDR"];
$reverseIp = NULL;
$requestReferrer = $_SERVER["HTTP_REFERER"];
$requestUserAgent = $_SERVER["HTTP_USER_AGENT"];
$isRogueBot = FALSE;
$isCrawler = NULL;
$requestUri = $_SERVER["REQUEST_URI"];
$absoluteUri = $scheme .$canonicalServerName .$requestUri;
$uriParts = parse_url($absoluteUri);
$requestScript = $PHP_SELF;
$httpResponseCode = NULL;

Block rogue bots

You don’t want to waste resources by serving your valuable content to useless bots. Here are a few ideas how to block rogue (crappy, not behaving, …) Web robots. If you need a top-notch nasty-bot-handler please contact the authority in this field: IncrediBill.

While handling bots, you should detect search engine crawlers, too:

/* lookup your crawler IP database to populate $isCrawler; then, if the IP wasn't identified as search engine crawler: */
if ($isCrawler !== TRUE) {
$crawlerName = NULL;
$crawlerHost = NULL;
$crawlerServer = NULL;
if (stristr($requestUserAgent,"Baiduspider")) {$crawlerName = "Baiduspider"; $crawlerServer = ".crawl.baidu.com";}
...
if (stristr($requestUserAgent,"Googlebot")) {$crawlerName = "Googlebot"; $crawlerServer = ".googlebot.com"; }
if ($crawlerName != NULL) {
$reverseIp = @gethostbyaddr($requestIp);
if (!stristr($reverseIp,$crawlerServer)) {
$isCrawler = FALSE;
}
if ("$reverseIp" == "$requestIp") {
$isCrawler = FALSE;
}
if ($isCrawler !== FALSE;) {
$chkIpAddyRev = @gethostbyname($reverseIp);
if ("$chkIpAddyRev" == "$requestIp") {
$isCrawler = TRUE;
$crawlerHost = $reverseIp;
// store the newly discovered crawler IP
}
}
}
}

If Baidu doesn’t send you any traffic, it makes sense to block its crawler. This piece of crap doesn’t behave anyway.
if ($isCrawler &&
"$crawlerName" == "Baiduspider") {
$isRogueBot = TRUE;
}

Another SE candidate is Bing’s spam bot that tries to manipulate stats on search engine usage. If you don’t approve such scams, block incoming! from the IP address range 65.52.0.0 to 65.55.255.255 (131.107.0.0 to 131.107.255.255 …) when the referrer is a Bing SERP. With this method you occasionally might block searching Microsoft employees who aren’t aware of their company’s spammy activities, so make sure you serve them a friendly GFY page that explains the issue.

Other rogue bots identify themselves by IP addy, user agent, and/or referrer. For example some bots spam your referrer stats, just in case when viewing stats you’re in the mood to consume porn, consolidate your debt, or buy cheap viagra. Compile a list of NSAW keywords and run it against the HTTP_REFERER:
if (notSafeAtWork($requestReferrer)) {$isRogueBot = TRUE;}

If you operate a porn site you should refine this approach.

As for blocking requests by IP addy I’d recommend a spamIp database table to collect IP addresses belonging to rogue bots. Doing a @gethostbyaddr($requestIp) DNS lookup while processing HTTP requests is way too expensive (with regard to performance). Just read your raw logs and add IP addies of bogus requests to your black list.
if (isBlacklistedIp($requestIp)) {$isRogueBot = TRUE;}

You won’t believe how many rogue bots still out themselves by supplying you with a unique user agent string. Go search for [block user agent], then pick what fits your needs best from rougly two million search results. You should maintain a database table for ugly user agents, too. Or code
if (isBlacklistedUa($requestUserAgent) ||

stristr($requestUserAgent,”ThingFetcher”)) {$isRogueBot = TRUE;}

By the way, the owner of ThingFetcher really should stand up now. I’ve sent a complaint to Rackspace and I’ve blocked your misbehaving bot on various sites because it performs excessive loops requesting the same stuff over and over again, and doesn’t bother to check for robots.txt.

Finally, serve rogue bots what they deserve:
if ($isRogueBot === TRUE) {

header("HTTP/1.1 403 Go fuck yourself", TRUE, 403);
exit;
}

If you’re picky, you could make some fun out of these requests. For example, when the bot provides an HTTP_REFERER (the page you should click from your referrer stats), then just do a file_get_contents($requestReferrer); and serve the slutty bot its very own crap. Or just 301 redirect it to the referrer provided, to http://example.com/go-fuck-yourself, or something funny like a huge image gfy.jpeg.html on a freehost (not that such bots usually follow redirects). I’d go for the 403-GFY response.

Server name canonicalization

Although search engines have learned to deal with multiple URIs pointing to the same piece of content, sometimes their URI canonicalization routines do need your support. At least make sure you serve your content under one server name:
if (”$serverName” != “$canonicalServerName”) {
header(”HTTP/1.1 301 Please use the canonical URI”, TRUE, 301);
header(”Location: $absoluteUri”);
header(”X-Canonical-URI: $absoluteUri”); //
experimental
header("Link: <$absoluteUri>; rel=canonical"); // experimental
exit;
}

Subdomains are so 1999, also 2010 is the year of non-’.www’ URIs. Keep your server name clean, uncluttered, memorable, and remarkable. By the way, you can use, alter, rewrite … the code from this pamphlet as you like. However, you must not change the $canonicalServerName = "sebastians-pamphlets.com"; statement. I’ll appreciate the traffic. ;)

When the server name is Ok, you should add some basic URI canonicalization routines here. For example add trailing slashes –if necessary–, and remove clutter from query strings.

Sometimes even smart developers do evil things with your URIs. For example Yahoo truncates the trailing slash. And Google badly messes up your URIs for click tracking purposes. Here’s how you can ‘heal’ the latter issue on arrival (after all search engine crawlers have passed the cluttered URIs to their indexers :( ):
$testForUriClutter = $absoluteUri;
if (isset($_GET)) {
foreach ($_GET as $var => $crap) {
if ( stristr($var,”utm_”) ) {
$testForUriClutter = str_replace($testForUriClutter, “&$var=$crap”, “”);
$testForUriClutter = str_replace($testForUriClutter, “&amp;$var=$crap”, “”);

unset ($_GET[$var]);
}
}
$uriPartsSanitized = parse_url($testForUriClutter);
$qs = $uriPartsSanitized["query"];
$qs = str_replace($qs, "?", "");
if ("$qs" != $uriParts["query"]) {
$canonicalUri = $scheme .$canonicalServerName .$requestScript;
if (!empty($qs)) {
$canonicalUri .= "?" .$qs;
}
if (!empty($uriParts["fragment"])) {
$canonicalUri .= "#" .$uriParts["fragment"];
}
header("HTTP/1.1 301 URI messed up by Google", TRUE, 301);
header("Location: $canonicalUri");
exit;
}
}

By definition, heuristic checks barely scratch the surface. In many cases only the piece of code handling the content can catch malformed URIs that need canonicalization.

Also, there are many sources of malformed URIs. Sometimes a 3rd party screws a URI of yours (see below), but some are self-made.

Therefore I’d encapsulate URI canonicalization, logging pairs of bad/good URIs with referrer, script name, counter, and a lastUpdate-timestamp. Of course plain vanilla stuff like stripped www prefixes don’t need a log entry.


Before you’re going to serve your content, do a lookup in your shortUri table. If the requested URI is a shortened URI pointing to your own stuff, don’t perform a redirect but serve the content under the shortened URI.

Deliver static stuff (images …)

Usually your Web server checks whether a file exists or not, and sends the matching Content-type header when serving static files. Since we’ve bypassed this functionality, do it yourself:
if (empty($uriParts[”query”])) && empty($uriParts[”fragment”])) && file_exists(”$rootPath$requestUri”)) {
header(”Content-type: ” .getContentType(”$rootPath$requestUri”), TRUE);
readfile(”$rootPath$requestUri”);
exit;
}
/* getContentType($filename) returns a
MIME media type like 'image/jpeg', 'image/gif', 'image/png', 'application/pdf', 'text/plain' ... but never an empty string */

If your dynamic stuff mimicks static files for some reason, and those files do exist, make sure you don’t handle them here.

Some files should pretend to be static, for example /robots.txt. Making use of variables like $isCrawler, $crawlerName, etc., you can use your smart robots.txt to maintain your crawler-IP database and more.

Execute script (dynamic URI)

Say you’ve a WP blog in /blog/, then you can invoke WordPress with
if (substring($requestUri, 0, 6) == “/blog/”) {
require(”$rootPath/blog/index.php”);
exit;
}

(Perhaps the WP configuration needs a tweak to make this work.) There’s a downside, though. Passing control to WordPress disables the centralized error handling and everything else below.

Fortunately, when WordPress calls the 404 page (wp-content/themes/yourtheme/404.php), it hasn’t sent any output or headers yet. That means you can include the procedures discussed below in WP’s 404.php:
$httpResponseCode = “404″;
$errSrc = “WordPress”;
$errMsg = “The blog couldn’t make sense out of this request.”;
require(”$includePath/err.php”);
exit;

Like in my WordPress example, you’ll find a way to call your scripts so that they don’t need to bother with error handling themselves. Of course you need to modularize the request handler for this purpose.

Resolve shortened URI

If you’re shortening your very own URIs, then you should lookup the shortUri table for a matching $requestUri before you process static stuff and scripts. Extract the real URI belonging to your site and serve the content instead of performing a redirect.

Excursus: URI shortener components

Using the hints below you should be able to code your own URI shortener. You don’t need all the balls and whistles (like stats) overloading most scripts available on the Web.

  • A database table with at least these attributes:

    • shortUri.suriId, bigint, primary key, populated from a sequence (auto-increment)
    • shortUri.suriUri, text, indexed, stores the original URI
    • shortUri.suriShortcut, varchar, unique index, stores the shortcut (not the full short URI!)

    Storing page titles and content (snippets) makes sense, but isn’t mandatory. For outputs like “recently shortened URIs” you need a timestamp attribute.

  • A method to create a shortened URI.
    Make that an independent script callable from a Web form’s server procedure, via Ajax, SOAP, etc.

    Without a given shortcut, use the primary key to create one. base_convert(intval($suriId), 10, 36); converts an integer into a short string. If you can’t do that in a database insert/create trigger procedure, retrieve the primary key’s value with LAST_INSERT_ID() or so and perform an update.

    URI shortening is bad enough, hence it makes no sense to maintain more than one short URI per original URI. Your create short URI method should return a previously created shortcut then.

    If you’re storing titles and such stuff grabbed from the destination page, don’t fetch the destination page on create. Better do that when you actually need this information, or run a cron job for this purpose.

    With the shortcut returned build the short URI on-the-fly $shortUri = getBaseUri() ."/" .$suriShortcut; (so you can use your URI shortener across all your sites).

  • A method to retrieve the original URI.
    Remove the leading slash (and other ballast like a useless query string/fragment) from REQUEST_URI and pull the shortUri record identified by suriShortcut.

    Bear in mind that shortened URIs spread via social media do get abused. A shortcut like ‘xxyyzz’ can appear as ‘xxyyz..’, ‘xxy’, and so on. So if the path component of a REQUEST_URI somehow looks like a shortened URI, you should try a broader query. If it returns one single result, use it. Otherwise display an error page with suggestions.

  • A Web form to create and edit shortened URIs.
    Preferably protected in a site admin area. At least for your own URIs you should use somewhat meaningful shortcuts, so make suriShortcut an input field.
  • If you want to use your URI shortener with a Twitter client, then build an API.
  • If you need particular stats for your short URIs pointing to foreign sites that your analytics package can’t deliver, then store those click data separately.
    // end excursus

If REQUEST_URI contains a valid shortcut belonging to a foreign server, then do a 301 redirect.
$suriUri = resolveShortUri($requestUri);
if ($suriUri === FALSE) {
$httpResponseCode = “404″;
$errSrc = “sUri”;
$errMsg = “Invalid short URI. Shortcut resolves to more than one result.”;
require(”$includePath/err.php”);
exit;
}
if (!empty($suriUri))
if (!stristr($suriUri, $canonicalServerName)) {
header(”HTTP/1.1 301 Here you go”, TRUE, 301);
header(”Location: $suriUri”);
exit;
}
}

Otherwise ($suriUri is yours) deliver your content without redirecting.

Redirect to destination (invalid request)

From reading your raw logs (404 stats don’t cover 302-Found crap) you’ll learn that some of your resources get persistently requested with invalid URIs. This happens when someone links to you with a messed up URI. It doesn’t make sense to show visitors following such a link your 404 page.

Most screwed URIs are unique in a way that they still ‘address’ one particular resource on your server. You should maintain a mapping table for all identified screwed URIs, pointing to the canonical URI. When you can identify a resouce from a lookup in this mapping table, then do a 301 redirect to the canonical URI.

When you feature a “product of the week”, “hottest blog post”, “today’s joke” or so, then bookmarkers will love it when its URI doesn’t change. For such transient URIs do a 307 redirect to the currently featured page. Don’t fear non-existing ‘duplicate content penalties’. Search engines are smart enough to figure out your intention. Even if the transient URI outranks the original page for a while, you’ll still get the SERP traffic you deserve.

Guess destination (invalid request)

For many screwed URIs you can identify the canonical URI on-the-fly. REQUEST_URI and HTTP_REFERER provide lots of hints, for example keywords from SERPs or fragments of existing URIs.

Once you’ve identified the destination, do a 307 redirect and log both REQUEST_URI and guessed destination URI for a later review. Use these logs to update your screwed URIs mapping table (see above).

When you can’t identify the destination free of doubt, and the visitor comes from a search engine, extract the search query from the HTTP_REFERER and pass it to your site search facility (strip operators like site: and inurl:). Log these requests as invalid, too, and update your mapping table.

Serve a useful error page

Following the suggestions above, you got rid of most reasons to actually show the visitor an error page. However, make your 404 page useful. For example don’t bounce out your visitor with a prominent error message in 24pt or so. Of course you should mention that an error has occured, but your error page’s prominent message should consist of hints how the visitor can reach the estimated contents.

A central error page gets invoked from various scripts. Unfortunately, err.php can’t be sure that none of these scripts has outputted something to the user. With a previous output of just one single byte you can’t send an HTTP response header. Hence prefix the header() statement with a ‘@’ to supress PHP error messages, and catch and log errors.

Before you output your wonderful error page, send a 404 header:
if ($httpResponseCode == NULL) {
$httpResponseCode = “404″;
}
if (empty($httpResponseCode)) {
$httpResponseCode = “501″; // log for debugging
}
@header(”HTTP/1.1 $httpResponseCode Shit happens”, TRUE, intval($httpResponseCode));
logHeaderErr(error_get_last());

In rare cases you better send a 410-Gone header, for example when Matt’s team has discovered a shitload of questionable pages and you’ve filed a reconsideration request.

In general, do avoid 404/410 responses. Every URI indexed anywhere is an asset. Closely watch your 404 stats and try to map these requests to related content on your site.

Use possible input ($errSrc, $errMsg, …) from the caller to customize the error page. Without meaningful input, deliver a generic error page. A search for [* 404 page *] might inspire you (WordPress users click here).


All errors are mine. In other words, be careful when you grab my untested code examples. It’s all dumped from memory without further thoughts and didn’t face a syntax checker.

I consider this pamphlet kinda draft of a concept, not a design pattern or tutorial. It was fun to write, so go get the best out of it. I’d be happy to discuss your thoughts in the comments. Thanks for your time.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

URI canonicalization with an X-Canonical-URI HTTP header

X-Canonical-URI HTTO HeaderDear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I’m desperately in need of a good reason to praise your wisdom and whatnot. From this year’s x-mas wish list:

All search engines obey the X-Canonical-URI HTTP header

The rel=canonical link element is a great tool, at least if applied properly, but sometimes it’s a royal pain in the ass.

Inserting rel=canonical link elements into huge conglomerates of cluttered scripts and static files is a nightmare. Sometimes the scripts creating the most URI clutter are compiled, and there’s no way to get a hand on the source code to change them.

Also, lots of resources can’t be stuffed with HTML’s link elements, for example dynamically created PDFs, plain text files, or images.

It’s not always possible to revamp old scripts, some projects just lack a suitable budget. And in some cases 301 redirects aren’t a doable option, for example when the destination URI is #5 in a redirect chain that can’t get shortened because the redirects are performed by a 3rd party that doesn’t cooperate.

This one, on the other hand, is elegant and scalable:

if (messedUp($_SERVER["REQUEST_URI"])) {
header(”X-Canonical-URI: $canonicalUri”);
}

Or:
header(”Link: <http://example.com/canonical-uri/>; rel=canonical”);

Coding an HTTP request handler that takes care of URI canonicalization before any script gets invoked, and before any static file gets served, is the way to go for such fuddy-duddy sites.

By the way, having all URI canonicalization routines in one piece of code is way more transparent, and way better manageable, than a bazillion of isolated link elements spread over tons of resources. So that might be a feasible procedure for non-ancient sites, too.

red crab blackmailing search enginesDear search engines, if you make that happen, I promise that I don’t tweet your products with a “#crap” hashtag for the whole rest of this year. Deal?

And yes, I know I’m somewhat late, two days before x-mas, but you’ve got smart developers, haven’t you? So please, go get your ‘code monkeys’ to work and surprise me. Thanks.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Handle your (UGC) feeds with care!

When you run a website that deals with user generated content (UGC), this pamphlet is for you. #bad_news Otherwise you might enjoy it. #malicious_joy

Spam in your feeds

Jump station

Not that the recent –and ongoing– paradigm shift shift in crawling, indexing, and ranking is bad news in general. On the contrary, for the tech savvy webmaster it comes with awesome opportunities with regard to traffic generation and search engine optimization. In this pamphlet I’ll blather about pitfalls, leaving chances unmentioned.

Background

For a moment forget everything you’ve heard about traditional crawling, indexing and ranking. Don’t buy that search engines solely rely on ancient technologies like fetching and parsing Web pages to scrape links and on-the-page signals out of your HTML. Just because someone’s link to your stuff is condomized, that doesn’t mean that Google & Co don’t grab its destination instantly.

More and more, search engines crawl, index, and rank stuff hours, days, or even weeks before they actually bother to fetch the first HTML page that carries a (nofollow’ed) link pointing to it. For example, Googlebot might follow a link you’ve tweeted right when the tweet appears on your timeline, without crawling http://twitter.com/your-user-name or the timeline page of any of your followers. Magic? Nope. The same goes for favs/retweets, stumbles, delicious bookmarks etc., by the way.

Guess why Google encourages you to make your ATOM and RSS feeds crawlable. Guess why FeedBurner publishes each and every update of your blog via PubSubHubbub so that Googlebot gets an alert of new and updated posts (and comments) as you release them. Guess why Googlebot is subscribed to FriendFeed, crawling everything (blog feed items, Tweets, social media submissions, comments …) that hits a FriendFeed user account in real-time. Guess why GoogleReader passes all your Likes and Shares to Googlebot. Guess why Bing and Google are somewhat connected to Twitter’s database, getting all updates, retweets, fav-clicks etc. within a few milliseconds.

Because all these data streams transport structured data that are easy to process. Because these data get pushed to the search engine. That’s way cheaper, and faster, than polling a gazillion of sources for updates 24/7/365.

Making use of structured data (XML, RSS, ATOM, PUSHed updates …) enables search engines to index fresh content, as well as reputation and other off-page signals, on the fly. Many ranking signals can be gathered from these data streams and their context, others are already on file. Even if a feed item consists of just a few words and a link (e.g. a tweet or stumble-thumbs-up), processing the relevant on-the-page stuff from the link’s final destination by parsing cluttered HTML to extract content and recommendations (links) doesn’t really slow down the process.

Later on, when the formerly fresh stuff starts to decompose on the SERPs, signals extracted from HTML sources kick in, for example link condoms, link placement, context and so on. Starting with the discovery of a piece of content, search engines permanently refine their scoring, until the content finally drops out of scope (spam filtering, unpaid hosting bills and other events that make Web content disappear).

Traditional discovery crawling, indexing, and ranking doesn’t exactly work for real-time purposes, nor in near real-time. Not even when a search engine assigns a bazillion of computers to this task. Also, submission based crawling is not exactly a Swiss Army knife when it comes to timely content. Although XML-sitemaps were a terrific accelerator, they must be pulled for processing, hence a delay occurs by design.

Nothing is like it used to be. Change happens.

Why does this paradigm shift puts your site at risk?

Spam puts your feeds at riskAs a matter of fact, when you publish user generated content, you will get spammed. Of course, that’s bad news of yesterday. Probably you’re confident that your anti-spam defense lines will protect you. You apply link condoms to UGC link drops and all that. You remove UGC once you spot it’s spam that slipped through your filters.

Bad news is, your medieval palisade won’t protect you from a 21th century tank attack with air support. Why not? Because you’ve secured your HTML presentation layer, but not your feeds. There’s no such thing as a rel-nofollow microformat for URIs in feeds, and even condomized links transported as CDATA (in content elements) are surrounded by spammy textual content.

Feed items come with a high risk. Once they’re released, they’re immortal and multiply themselves like rabbits. That’s bad enough in case a pissed employee ‘accidently’ publishes financial statements on your company blog. It becomes worse when seasoned spammers figure out that their submissions can make it into your feeds, and be it only for a few milliseconds.

If your content management system (CMS) creates a feed item on submission, search engines –in good company with legions of Web services– will distribute it all over the InterWeb, before you can hit the delete button. It will be cached, duplicated, published and reprinted … it’s out of your control. You can’t wipe out all of its instances. Never.

Congrats. You found a surefire way to piss off both your audience (your human feed subscribers getting their feed reader flooded with PPC spam), and search engines as well (you send them weird spam signals that rise all sorts of red flags). Also, it’s not desirable to make social media services –that you rely on for marketing purposes– too suspicious (trigger happy anti-spam algos might lock away your site’s base URI in an escape-proof dungeon).

So what can you do to prevent your feeds from unwanted content?

Protect your feedsBefore I discuss advanced feed protection, let me point you to a few popular vulnerabilities you might haven’t considered yet:

  • No nay never use integers as IDs before you’re dead sure that a piece of submitted content is floral white as snow. Integer sequences produce guessable URIs. Instead, generate a UUID (aka GUID) as identifier. Yeah, I know that UUIDs make ugly URIs, but those aren’t predictable and therefore not that vulnerable. Once a content submission is finally approved, you can donate it a nice –maybe even meaningful– URI.
  • No nay never use titles, subjects or so in URIs, not even converted text from submissions (e.g. ‘My PPC spam’ ==> ‘my_ppc_spam’). Why not? See above. And you don’t really want to create URIs that contain spammy keywords, or keywords that are totally unrelated to your site. Remember that search engines do index even URIs they can’t fetch, or which they can’t refetch, at least for a while.
  • Before the final approval, serve submitted content with a “noindex,nofollow,noarchive,nosnippet” X-Robots-Tag in the HTTP header, and put a corresponding meta element in the HEAD section. Don’t rely on link condoms. Sometimes search engines ignore rel-nofollow as an indexer directive on link level, and/or decide that they should crawl the link’s destination anyway.
  • Consider serving social media bots requesting a not yet approved piece of user generated content a 503 HTTP response code. You can compile a list of their IPs and user agent names from your raw logs. These bots don’t obey REP directives, that means they fetch and process your stuff regardless whether you yell “noindex” at them or not.
  • For all burned (disapproved) URIs that were in use ensure that your server returns a 410-Gone HTTP status code, respectively perform a 301 redirect to a policy page or so to rescue link love that would get wasted otherwise.
  • Your Web forms for content submissions should be totally AJAX’ed. Use CAPTCHAs and all that. Split the submission process into multiple parts, each of them talking to the server. Reject excessively rapid walk throughs, for example by asking for something unusual when a step gets completed in a too short period of time. With AJAX calls that’s painless for the legit user. Do not accept content submissions via standard GET or POST requests.
  • Serve link builders coming from SERPs for [URL|story|link submit|submission your site’s topic] etc. your policy page, not the actual Web form.
  • There’s more. With the above said, I’ve just begun to scrape the surface of a savvy spammer’s technical portfolio. There’s next to nothing a clever programmed bot can’t mimick. Be creative and think outside the box. Otherwise the spammers will be ahead of you in no time, especially when you make use of a standard CMS.

Having said that, lets proceed to feed protection tactics. Actually, there’s just one principle set in stone:

Make absolutely sure that submitted content can’t make it into your feeds (and XML sitemaps) before it’s finally approved!

The interesting question is: what the heck is a “final approval”? Well, that depends on your needs. Ideally, that’s you releasing each and every piece of submitted content. Since this approach doesn’t scale, think of ways to semi-automate the process. Don’t fully automate it, there’s no such thing as an infallible algo. Also, consider the wisdom of the crowd spammable (voting bots). Some spam will slip through, guaranteed.

Each and every content submission must survive a probation period, whereas it will not be included in your site’s feeds. Regardless who contributed it. Stick with the four-eye principle. Here are a few generic procedures you could adapt, respectively ideas which could inspire you:

  • Queue submissions. Possible queues are Blocked, Quarantaine, Suspect, Probation, and finally Released. Define simple rules and procedures that anyone involved can follow. SOPs lack work arounds and loopholes by design.
  • Stuck content submissions from new users who didn’t participate in other ways in quarantaine. Moderate this queue and only manually release into the probation queue what passes the moderator’s heuristics. Signup-submit-and-forget is a typical spammer behavior.
  • Maintain black lists of domain names, IPs, countries, user agent names, unwanted buzzwords and so on. Use filters to arrest submissions that contain keywords you wouldn’t expect to match your site’s theme in the Blocked or Quarantaine queue.
  • On submission fetch the link’s content and analyze it, don’t stick with heuristic checks of URIs, titles and descriptions. Don’t use methods like PHP’s file_get_contents that don’t return HTTP response codes. You need to know whether a requested URI is the first one of a redirect chain, for example. Double check with a second request from another IP, preferably owned by a widely used ISP, with a standard browser’s user agent string, that provides an HTTP_REFERER, for example a Google SERP with a q parameter populated with a search term compiled from the submission’s suggested anchor text. If the returned content differs too much, set a red flag.
  • Maintain white lists, too. That’s a great way to reduce the amount of inavoidable false positives.
  • If you have editorial staff or moderators, they should get a Release to Feed  button. You can combine mod releases with a minimum number of user votes or so. For example you could define a rule like “release to feed if mod-release = true and num-trusted-votes > 10″.
  • Categorize your user’s reputation and trustworthiness. A particular number of votes from trusted users could approve a submission for feed inclusion.
  • Don’t automatically release submissions that have raised any flag. If that slows down the process, refine your flagging but don’t lower the limits.
  • With all automatted releases, for example based on votings, oops, especially based on votings, implement at least one additional sanity check. For example discard votes from new users as well as from users with a low participation history, check the sequence of votes for patterns like similar periods of time between votings, and so on.

Disclaimer: That’s just some food for thoughts. I want to make absolutely clear that I can’t provide bullet-proof anti-spam procedures. Feel free to discuss your thoughts, concerns, questions … in the comments.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Derek Powazek outed himself big-mouthed and ignorant, and why that’s a pity

Derek PowazekWith childish attacks on his colleagues, Derek Powazek didn’t do professional Web development –as an industry– a favor. As a matter of fact, Derek Powazek insulted savvy Web developers, Web designers, even search engine staff, as well as useability experts and search engine specialists, who team-up in countless projects helping small and large Web sites succeed.

I seriously can’t understand how Derek Powazek “has survived 13 years in the web biz” (source) without detailled knowledge of how things get done in Web projects. I mean, if a developer really has worked 13 years in the Web biz, he should know that the task of optimizing a Web site’s findability, crawlability, and accessibility for all user agents out there (SEO) is usually not performed by “spammers evildoers and opportunists”, but by highly professional experts who just master Web development better than the average designer, copy-writer, publisher, developer or marketing guy.

Boy, what an ego. Derek Powazek truly believes that if “[all SEO/SEM techniques are] not obvious to you, and you make websites, you need to get informed” (source). That translates to “if you aren’t 100% perfect in all aspects of Web development and Internet marketing, don’t bother making Web sites — go get a life”.

Derek PowazekWell, I consider very few folks capable of mastering everything in Web development and Internet marketing. Clearly, Derek Powazek is not a member of this elite. With one clueless, uninformed and way too offensive rant he has ruined his reputation in a single day. Shortly after his first thoughtless blog post libelling fellow developers and consultants, Google’s search result page for [Derek Powazek] is flooded with reasonable reactions revealing that Derek Powazek’s pathetic calls for ego food are factually wrong.

Of course calm and knowledgable experts in the field setting the records straight, like Danny Sullivan (search result #1 and #4 for [Derek Powazek] today) and Peter da Vanzo (SERP position #9), can outrank a widely unknown guy like Derek Powazek at all major search engines. Now, for the rest of his presence on this planet, Derek Powazek has to live with search results that tell the world what kind of an “expert” he really is (example ).

He should have read Susan Moskwa’s very informative article about reputation management on Google’s Blog a day earlier. Not that reputation management doesn’t count as an SEO skill … actually, that’s SEO basics (as well as URI canonicalization).

Dear Derek Powazek, guess what all the bright folks you’ve bashed so cleverly will do when you ask them to take down their responses to your uncalled-for dirty talk?

So what can we learn from this gratuitous debacle? Do not piss in someone’s roses when

  • you suffer from an oversized ego,
  • you’ve not the slightest clue what you’re talking about,
  • you can’t make a point with proven facts, so you’ve to use false pretences and clueless assumptions,
  • you tend to insult people when you’re out of valid arguments,
  • willy whacking is not for you, because your dick is, well, somewhat undersized.

Ok, it’s Friday evening, so I’m supposed to enjoy TGIF’s. Why the fuck am I wasting my valuable spare time writing this pamphlet? Here’s why:

Having worked in, led, and coached WebDev teams on crawlability and best practices with regard to search engine crawling and indexing for ages now, I was faced with brain amputated wannabe geniuses more than once. Such assclowns are able to shipwreck great projects. From my experience the one and only way to keep teams sane and productive is sacking troublemakers at the moment you realize they’re unconvinceable. This Powazek dude has perfectly proven that his ignorance is persistent, and that his anti-social attitude is irreversible. He’s the prime example of a guy I’d never hire (except if I’d work for my worst enemy). Go figure.


Update 2009-10-19: I consider this a lame excuse. Actually, it’s even more pathetic than the malicious slamming of many good folks in his previous posts. If Derek Powazek really didn’t know what “SEO” means in the first place, his brain farts attacking something he didn’t understand at the time of publishing his rants are indefensible, provided he was anything south of sane then. Danny Sullivan doesn’t agree, and he’s right when he says that every industry has some black sheep, but as much as I dislike comment spammers, I dislike bullshit and baseness.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Less is more. Google Chrome is my preferred browser. Here’s why:

Recently I’ve bitched a lot, especially tearing utterly useles microformats that the InterWeb really doesn’t need (rel-nofollow, common tag …). Naturally, those pamplets get noticed as Google search engine bashing. Wait. Of course not everything a search engine company launches is crap. Actually, I do love –and use daily– lots of awesome services provided by Google, Yahoo & Co.

Google Chrome BrowserWhilst some SE engineers –probably due to my endless rants– have unsubscribed from my various you-porn social media streams, others have noticed that there’s also laudatory stuff in a grumpy old fart’s Twitter output, and asked for input. Thank you! Dear Johannes Müller, bypassing your WebForm, here is my greedy Google Chrome wish list (you do know the goodies yourself, hence I skip the cute stuff I should praise).

I’ll focus on functionality that I like or miss as a plain user, but I can’t resist to mention a few geeky thingies upfront. As a developer I do love that Chrome doesn’t die on faulty scripts (or on .htpasswd protected pages during startup with session restore like the current FF … on evaling perfectly valid JavaScript code from a server’s response to an AJAX request that exceeds 50 or 65k like IE …). Also, the debugging facilities are awesome (although I still can’t throw away Firebug and a few more FireFox plug-ins). I very much appreciate Chrome’s partial HTML-5 support, but besides neat video controls I’d love to see it render plain HTML-5 stuff like CITE attributes in Q elements correctly (4.7.3. User agents should allow users to follow such citation links), even when DOCTYPE says HTML 4.x or XHTML. ;)

WebKit is great, but it comes with disadvantages. Try to put radio buttons in a SPAN or DIV element with CSS controlling horizontal/vertical appearance as well as special label formats –instead of a RADIO-GROUP– and you’re toast. FF can handle that. Or set the MULTIPLE attribute of a SELECT element to FALSE (instead of ommitting it for combo-boxes) and you’ll suffer from select lists that you just can’t handle as a user, because WebKit (as well as other layout engines!) doesn’t render the element as a drop down list any more. Of course that’s non-standard coding, but stuff like that isn’t really uncommon on the Web. Just because other layout engines handle crap like this equally wrong, that doesn’t mean that the WebKit version used by Google Chrome must come with the same maladies, right?

What totally annoys me is that on the WordPress /wp-admin/post.php page the plus icons of “Post Slug” or “Post Status” just don’t work with Chrome. That means I’ve to fire up FF only to type in a value in a form field that Google Chrome sneakily hides from me. Nasty. Really nasty. Don’t tell me that I’m using an outdated WordPress version. I do know that, but I won’t upgrade because WP 0.87 (beta) perfectly fits my needs.

Ok, what do I like as a user? Google Chrome is lean and very easy to use, it eats less memory than any other browser I allow on my machines, and it executes JavaScript as well as nifty rounded corners amazingly fast. Because –at least with the naked version– I can’t install a gazillion of add-ons, I usually see complete landing pages rendered — instead of just the H1, an advertisement, and the very first P element along with an 1/6 clip of an image or video, because all the FF toolbars occupy nearly 3/4 of the browser window’s height. Try FireFox with a few plug-ins vs. Chrome on a machine running 1024*768 (not that unusual when traveling) and you’re convinced in a fraction of a nanosecond.

Now that I’ve completely switched to Chrome, at least at home (at work I have to test my stuff with everything except IE because that’s a not supportable user agent), I preferably sooner than later do want the FireFox nuggets. Dear Google Chrome developers, please find a way to extract the most wanted stuff from FF plug-ins. You can implement those as right-click popup menus, as well as an one-line toolbar (not stealing too much screen real estate), or both, or otherwise. It’s not too hard to detect that a user has a delicious or stumble-upon account (you read the cookies anyway …). You easily could show icons for the core functionality of such services, along with context sensitive menus enabling the whole functionality of a particular service as provided with overcrowded toolbars in other browsers. Examples:

Delicious  An icon “Remember this” to submit a page to delicious is enough, when “my delicious” and so on is available via context menu.
StumbleUpon  The same goes for StumbleUpon. Two icons, thumbs-up and thumbs-down, would provide 99% of the functionality I need quite often. Ok, my thumbs-down votes are rare, so you can even dump the second one.
TinyUrl  How cool would it be to create a tiny URI for the current tab with just one click?
PrefBar checkboxes  Next up, please feel somewhat challenged by PrefBar, an instrument I really can’t miss on the long haul.PrefBar combo-boxes 
Switching user agent strings, faking referrers, checking out Web pages without cookies, JavaScript and so on is a must have. Ok, I admit that’s geek stuff, so take it as an example transferable to some girlish stuff I refuse to recognize in my monster’s Web browsers.
Twitter  Also, let’s not forget Twitter, blogstuff and whatnot.
Imagine your preferred services, iconized in a one-line toolbar configurable compiled from single items of various 3rd party toolbars available on the InterWeb (of course you should enable Google Toolbar icons too). How cool would that be, in comparisation to the bookmarklets I must live with now?

Google Chrome bookmarklets

 

Context-menu stuff like “image properties” et cetera –as well known from other browsers– would be very helpful too. “Inspect element” is really neat and informative (for geeks), but way too complicated for the average user.

Another issue is Chrome’s lack of “Babylon functionality”. I want to configure my native language as well as a preferred language (read that as “at least one“). Say I’ve set native language to de-DE and preferred language to en-US, then when hovering a word or phrase on any Web object, I want to see a tooltip displaying the english translation from whatever gibberish the Web page is written in (of course for english text I’d expect the german translation); and when I select a piece of text I want to read the german (english) translation on right-click:translate in a popup dialog that allows copying to the clipboard as well as changing languages. I know you’ve the technology at your hands.

Oh, and please disable the defaulted DNS caching, that’s a royal PITA when you mostly consume dynamic contents because lots of previously visited URIs get displayed as error messages. Also, “reload” should pull images again, replacing their cached copies; right-click:reload should reposition to the current viewpoint.

I’d like to have “project windows”, that is on-demand Chrome windows loaded with particular tabs with URIs I’ve previouisly saved from a window under a project name. Those shouldn’t come up when I’ve set “load previous session at start-up”, but only when I want to restore such a window.

After a quite longish test phase I’d say that Google Chrome’s advantages beat the lack of functionality with ease. Pretty often the snipping of a particular commonly supplied feature (like search boxes in toolbars) dramatically enhances Chrome’s usability. Chrome’s KISS approach kicks ass. And I see it evolve.

Now that you’ve read my appraisal and suggestions, please consider picking a few items from my t-shirt wish list. You know, I’ve promised to link out to everybody sending me a (geeky|pornographic|funny|) XX(X)L t-shirt that I really like. ;) Just in case you’re not the type of reader who buys the author of a pamphlet a t-shirt, please subscribe to my RSS feed. Thanks.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

  1 | 2 | 3 | 4 | 5  Next Page »