Archived posts from the 'E-Commerce' Category

About time: EU crumbles monster cookies from hell

Some extremely bright bravehearts at the European Union headquarters in Bruxelles finally took the initiative and launched a law to fight the Interweb’s gremlins, that play down their sneaky and destructive behavior behind an innocent as well as inapplicable term: COOKIE.

Back in the good old days when every dog and its fleas used an InternetExplorer to consume free porn, and to read unbalanced left leaning news published by dubious online tabloids based in communist strongholds, only the Vatican spread free cookies to its occasional visitor. Today, every site you visit makes use of toxic cookies, thanks to Google Web Analytics, Facebook, Amazon, eBay and countless smut peddlers.

Not that stone age browsers like IE6 could handle neat 3rd party cookies (that today’s advertising networks use to shove targeted product news down your throat) without a little help from an 1×1 pixel iFrame at all, but that’s a completely other story. The point I want to bring home is: cookies never were harmless at all!

Quite the opposite is true. As a matter of fact, Internet cookies pose as digestible candies, but once swallowed they turn into greedy germs that produce torturous flatulence, charge your credit card with overprized Rolex® replicas and other stuff you really don’t need, and spam all your email accounts for the time being until you actually need your daily dose of Viagra® to handle all the big boobs and stiff enlarged dicks delivered to your inbox.

Now that you’re well informed on the increasing cookie pest, a.k.a. cookie pandemic, I’m dead sure you’ll applaud the EU anti cookie law that’s going to get enforced by the end of May 2012, world-wide. Common sense and experience of life tells us, that indeed local laws can tame the Wild Wild West (WWW).

Well, at least in the UK, so far. That’s quite astonishing by the way, because usually the UK vetoes or boycotts everything EU, until their lowbrow thinking and underpaid lawyers discover that previous governments already have signed some long-forgotten contracts defining EU regulations as binding for tiny North Sea islands, even if they’re located somewhere in the Atlantic Ocean and consider themselves huge.

Anyway, although literally nobody except a few Web savvy UK webmasters (but not even its creators who can’t find their asshole with both hands fumbling in the dark) know what the fuck this outlandish law is all about, we need to comply. For the sake of our unborn children, civic duty, or whatever.

Of course you can’t be bothered with researching all this complex stuff. Unfortunately, I can’t link to authorative sources, because not even the almighty Google told me how alien webmasters can implement a diffuse EU policy that didn’t make it to the code of law of any EU member state yet (except of the above mentioned remote islands, though even those have no fucking clue with regard to reasonable procedures and such). That makes this red crab the authorative source on everything ‘EU cookie law’. Sigh.

So here comes the ultimative guide for this planet’s webmasters who’d like to do business with EU countries (or suffer from an EU citizenship).

Step 1: Obfuscate your cookies

In order to make your most sneaky cookies undetectable, flood your vistor’s computer with a shitload of randomly generated and totally meaningless cookies. Make sure that everything important for advertising, shopping cart, user experience and so on gets set first, because the 1024th and all following cookies face the risk of getting ignored by the user agent.

Do not use meaningful variable names for cookies and decode all values. That is, instead of setting added_not_exactly_willingly_purchased_items_to_shopping_cart[13] = golden_humvee_with_diamond_break_pads just create an unobtrusive cookie like additional_discount_upc_666[13] = round(99.99, 0) + '%'.

Step 2: Ask your visitors for permission to accept your cookies

When a new visitor hits your site, create a hidden popunder window with a Web form like this one:


Of course

Why not

Yes, and don’t ask me again

Yup, get me to the free porn asap

I’ve read the TOS and I absolutely agree


 

Don’t forget to test the auto-submit functionality with all user agents (browsers) out there. Also, log the visitor’s IP addy, browser version and such stuff. Just in case you need to present it in a lawsuit later on.

Step 3: Be totally honest and explain every cookie to your visitors

Somewhere on a deeply buried TOS page linked from your privacy policy page that’s no-followed across your site with an anchor text formatted in 0.001pt, create an ugly table like this one:

My awesome Web site’s wonderful cookies:
_preg=true This cookie makes you pregnant. Also, it creates an order for 100 diapers, XXS, assorted pink and blue, to be delivered in 9 months. Your PayPal account (taken from a befriended Yahoo cookie) gets charged today.
_vote_rig=conditional If you’ve participated in a poll and your vote doesn’t match my current mood, I’ll email your mother in law that you’re cheating on your spouse. Also, regardless what awkward vote you’ve submitted, I’ll change it in a way that’s compatible with my opinion on the topic in question.
_auto_start=daily Adds my product of the day page to your auto start group. Since I’ve collected your credit card details already, I’m nice enough to automate the purchase process in an invisible browser window that closes after I’ve charged your credit card. If you dare to not reboot your pathetic computer at least once a day, I’ll force an hourly reboot in order to teach you how the cookie crumbles.
_joke=send If you see this cookie, I found a .pst file on your computer. All your contacts will enjoy links to questionable (that is NotSafeAtWork) jokes delivered by email from your account, often.
_boobs=show If you’re a male adult, you’ve just subscribed to my ‘weird boob job’ paysite.
_dicks=show That’s the female version of the _boobs cookie. Also delivered to my gay readers, just the landing page differs a little bit.
_google=provided You were thoughtless enough to surf my blog while logged into your Google account. You know, Google just stole my HTTP_REFERER data, so in revenge I overtook your account in order to gather the personal and very private information the privacy nazis at Google don’t deliver for free any more.
_twitter=approved Just in case you check out your Twitter settings by accident, do not go to the ‘Apps’ page and do not revoke my permissions. The few DMs I’ve sent to all your followers only feed my little very hungry monsters, so please leave my tiny spam operation alone.
_fb=new Heh. You zucker (pronounced sucker) lack a Facebook account. I’ve stepped in and assigned it to my various interests. Don’t you dare to join Facebook manually, I do own your name!
_443=nope Removes the obsolete ’s’ (SSL) from URIs in your browser’s address bar. That’s a prerequisite for my free services, like maintaining a backup of your Web mail as user generated content (UGC) in my x-rated movie blog’s comment area. Don’t whine, it’s only visible to search engine crawlers, so your dirty little secrets are totally safe. Also, I don’t publish emails containing Web site credentials, bank account details and such, because sharing those with my fellow blackhat webmasters would be plain silly.
eol=granted Your right to exist has expired, coz your bank account’s balance doesn’t allow any further abuse. This status is also known als ‘end of life’. Say thanks to the cookie community and get yourself a tombstone as long as you (respectively your clan, coz you went belly up by now) can afford it.

Because I’m somewhat lazy, the list above isn’t made up but an excerpt of my blog’s actual cookies.

As a side note, don’t forget to collect local VAT (different percentages per EU country, depending on the type of goods you don’t plan to deliver across the pond) from your EU customers, and do pay the taxman. If you’ve troubles finding the taxman in charge, ask your offshore bank for assistance.

Have fun maintaining a Web site that totally complies to international laws. And thanks for your time (which you would better have invested in developing a Web site that doesn’t rely on cookies for a great user experience).

Summary: The stupid EU cookie law in 2.5 minutes:

If you still don’t grasp how an Internet cookie really tastes, here is the explanation for the geeky preschooler: RFC 2109.

By the way, this comprehensive tutorial might make you believe that only the UK has implemented the EU cookie law yet. Of course the Brits wouldn’t have the balls to perform such a risky solo stunt, without being accompanied by two tiny countries bordering the Baltic Sea: Denmark and Estonia (don’t even try to find european ministates and islands on your globe without a precision magnifier). As soon as the Internet comes to these piddly shore lines, I’ll report on their progress (frankly, don’t really expect an update anytime soon).



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Act out your sophisticated affiliate link paranoia

GOOD: paranoid affiliate linkMy recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google’s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:

Links purchased for advertising should be designated as such. This can be done in several ways, such as:
* Adding a rel=”nofollow” attribute to the <a> tag
* Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file

Also, Google doesn’t define paid links that clearly, so try this paid link definition instead before your read on. Here is my linking guide for the paranoid affiliate marketer.

Google recommends hiding of any content provided by affiliate programs from their crawlers. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program’s sales tools page count as duplicate content (snippet) in its worst occurance.

Pasting code copied from a merchant’s site into a page’s or template’s HTML is not exactly a smart way to put ads. Those ads aren’t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you’re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts.

There are tons of such scripts out there, so I don’t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you’ve a WordPress blog you can adapt them with ease.

Cover your ass with a linking policy

Googlers as well as hired guns do review Web sites for violations of Google’s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) full disclosure of your linking attitude can help to pass a human review by search engine staff. By the way, having a policy for dofollowed blog comments is also a good idea.

Since crawler directives like link condoms are for search engines (only), and those pay attention to your source code and hints addressing search engines like robots.txt, you should leave a note there too, look into the source of this page for an example. View sample HTML comment.

Block crawlers from your propaganda scripts

Put all your stuff related to advertising (scripts, images, movies…) in a subdirectory and disallow search engine crawling in your /robots.txt file:
User-agent: *
Disallow: /propaganda/

Of course you’ll use an innocuous name like “gnisitrevda” for this folder, which lacks a default document and can’t get browsed because you’ve a
Options -Indexes

statement in your .htaccess file. (Watch out, Google knows what “gnisitrevda” means, so be creative or cryptic.)

Crawlers sent out by major search engines do respect robots.txt, hence it’s guaranteed that regular spiders don’t fetch it. As long as you don’t cheat too much, you’re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn’t prevent you from surfing search engine staff, but I don’t tell you things you’d better hide from Matt’s gang.

Detect search engine crawlers

Basically there are three common methods to detect requests by search engine crawlers.

  1. Testing the user agent name (HTTP_USER_AGENT) for strings like “Googlebot”, “Slurp”, “MSNbot” or so which identify crawlers. That’s easy to spoof, for example PrefBar for FireFox lets you choose from a list of user agents.
  2. Checking the user agent name, and only when it indicates a crawler, verifying the requestor’s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.
  3. Maintaining a list of all search engine crawler IP addresses known to man, checking the requestor’s IP (REMOTE_ADDR) against this list. (That alone isn’t bullet-proof, but I’m not going to write a tutorial on industrial-strength cloaking IP delivery, I leave that to the real experts.)

For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won’t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants’ search engine rankings, and that’s evil in Google’s eyes.

We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine’s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor’s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn’t match a crawler name.

View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

Grab and implement the PHP source, then you can code statements like
$isSpider = checkCrawlerUA ();
...
if ($isSpider) {
$relAttribute = " rel=\"nofollow\" ";
}
...
$affLink = "<a href=\"$affUrl\" $relAttribute>call for action</a>";

or
$isSpider = checkCrawlerIP ($sponsorUrl);
...
if ($isSpider) {
// don't redirect to the sponsor, return a 403 or 410 instead
}

More on that later.

Don’t deliver your advertising to search engine crawlers

It’s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense’s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can’t guarantee indexing, because Google doesn’t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.

Here is how you could output a totally uncrawlable banner ad:
...
$isSpider = checkCrawlerIP ($PHP_SELF);
...
print "<div class=\"css-class-sidebar robots-nocontent\">";
// output RSS buttons or so
if (!$isSpider) {
print "<script type=\"text/javascript\" src=\"http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner\"></script>";
...
}
...
print "</div>\n";
...

Lets look at the code above. First we detect crawlers “without doubt” (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-’.crawl.yahoo.net’ host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I’ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).

Currently the robots-nocontent  class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn’t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google’s section targeting that’s a crappy crawler directive, though. However, it doesn’t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), “buy now”/”view basket” links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) … and of course ads.

Ads aren’t outputted when a crawler requests a page. Of course that’s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren’t relevant for indexing that’s not deceitful cloaking in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster’s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers.

Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren’t penalized. Since Web robots can’t pull out the plastic, there’s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (Example from Google: “Allow search bots to crawl your sites without session IDs or arguments that track their path through the site” is a call for search engine friendly URL cloaking.)

Is hiding ads from crawlers “safe with Google” or not?

BAD: uncloaked affiliate linkCloaking ads away is a double edged sword from a search engine’s perspective. Way too strictly interpreted that’s against the cloaking rule which states “don’t show crawlers other content than humans”, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google’s section targeting or Yahoo’s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don’t count them as on-the-page contents.

Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don’t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify MFA pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings.

Since in theory the contents of JavaScript ads aren’t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can’t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can’t argue seriously that hiding such tiny signals –which they use for the sole purposes of possible downranks– is against their guidelines.

For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn’t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.

Dear search engines, if it does not count, then you cannot demand to keep it crawlable. Well, a few super mega white hat trolls might disagree, and depending on the implementation on individual sites maybe hiding ads isn’t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don’t try to hide the fact that I run ads on this blog.

Usually I don’t argue with fair vs. unfair, because we talk about war business here, what means that everything goes. However, Google does everything to talk the whole Internet into obfuscating disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. ;) (Of course that’s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don’t care about bad press. So please don’t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)

Outputting ads with JavaScript, preferably in iFrames

Delivering adverts with JavaScript does not mean that one can’t use server sided scripting to adjust them dynamically. With content management systems it’s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn’t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval’ing the PHP code).

Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:

And here is the JS call of the PHP script:
<script type="text/javascript" src="http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner"></script>

The PHP script /propaganda/output.js.php evaluates the query string to pull the requested ad’s components. In case it’s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor’s location and whatnot, but that’s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad’s links. Once the HTML code is ready, it outputs a JavaScript statement:
document.write(‘<a href="http://sebastians-pamphlets.com/propaganda/router.php? adName=seobook&adServed=banner" title="DOWNLOAD THE BOOK ON SEO!"><img src="http://sebastians-pamphlets.com/propaganda/seobook/468-60.gif" width="468" height="60" border="0" alt="The only current book on SEO" title="The only current book on SEO" /></a>’);
which the browser executes within the script tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag.

Matt Cutts said that JavaScript links don’t prevent Googlebot from crawling, but that those links don’t count for rankings (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can’t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it’s fetchable by crawlers, so I wouldn’t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That’s why I use rel-nofollow –on crawler requests– even in JS ads.

Change your user agent name to Googlebot or so, install Matt’s show nofollow hack or something similar, and you’ll see that the affiliate-URL gets nofollow’ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can’t offend my readers with nofollow’ed links (although I don’t care about crawler spoofing, actually that’s a good procedure to let advertisers check out my linking attitude).

We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don’t count pages displayed in iFrames as on-page contents, especially not when the iFrame’s content is hosted on another domain. Here is an example straight from the horse’s mouth:
<iframe name="google_ads_frame" src="http://pagead2.googlesyndication.com/pagead/ads? very-long-and-ugly-query-string" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" frameborder="0" height="90" scrolling="no" width="728"></iframe>
In a noframes tag we could put a static ad for surfers using browsers which don’t support frames/iFrames.

If for some reasons you don’t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael’s Full Text/HTML Obfuscator Tool (hat tip to Donna).

Always redirect to affiliate URLs

There’s absolutely no point in using ugly affiliate URLs on your pages. Actually, that’s the last thing you want to do for various reasons.

  • For example, affiliate URLs as well as source codes can change, and you don’t want to edit tons of pages if that happens.
  • When an affiliate program doesn’t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you’d replace outdated ads completely with one mouse click or so.
  • Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own –often confusing– layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don’t track your outgoing traffic, some sponsors will cheat and you can’t prove your gut feelings.
  • Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don’t smell like affiliate URLs.

When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really manage your affiliate links. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.

Affiliate links provide great footprints. Unique URL parts respectively query string variable names gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The values identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn’t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that’s one trackable footprint you can avoid to some degree.

If the redirect-script’s location is on the same server (in fact it’s not thanks to symlinks) and not named “adserver” or so, chances are that a heuristic check won’t identify the link’s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren’t crawlable with paramaters, or similar “legit” stuff. However, you can’t fool smart algos forever, but if you’ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine’s point of view, and users might agree on this).

Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn’t benefit you (usually). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.

The behavior of an adserver URL masking an affiliate link

Lets look at the redirect-script’s URL from my code example above:
/propaganda/router.php?adName=seobook&adServed=banner
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert.

Now that we’ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor’s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor’s IP addy.

If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor’s landing page:
$sponsorUrl = "http://www.seobook.com/262.html";
$requestProtocol = $_SERVER["SERVER_PROTOCOL"];
$protocolArr = explode("/",$requestProtocol);
$protocolName = trim($protocolArr[0]);
$protocolVersion = trim($protocolArr[1]);
if (stristr($protocolName,"HTTP")
&& strtolower($protocolVersion) > "1.0" ) {
$httpStatusCode = 307;
}
else {
$httpStatusCode = 302;
}
$httpStatusLine = "$requestProtocol $httpStatusCode Temporary Redirect";
@header($httpStatusLine, TRUE, $httpStatusCode);
@header("Location: $sponsorUrl");
exit;

A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can’t use the 301 response code, because 301 redirects are always cachable. (More information on HTTP redirects.)

If the requestor is a major search engine’s crawler, we perform the most brutal bounce back known to man:
if ($isSpider) {
@header("HTTP/1.1 403 Sorry Crawlers Not Allowed", TRUE, 403);
@header("X-Robots-Tag: nofollow,noindex,noarchive");
exit;
}

The 403 response code translates to “kiss my ass and get the fuck outta here”. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn’t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine “forget this URL, don’t request it again”. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.

If you don’t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don’t need to live in fear of penalties.

NICE: prospering affiliate linkBefore you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.

Don’t deliver different (editorial) contents to users and crawlers, but also don’t serve ads to crawlers. They just don’t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.

Your ads look plain ugly with dotted borders in firebrick, hence don’t apply rel=”nofollow” to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.

No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers.

Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google’s job is not exactly funny.

I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you’re not paranoid, that does not mean Google will not chase you down. You really don’t need to be a so called black hat to displease Google. Not knowing respectively not understanding Google’s 12 commandments doesn’t prevent you from being spanked for sins you’ve never heard of. If you’re keen on Google’s nicely targeted traffic, better play by Google’s rules, leastwise on creawler requests.

Feel free to contribute your tips and tricks in the comments.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google says you must manage your affiliate links in order to get indexed

Screwing affiliates recommended by Google ;=)I’ve worked hard to overtake the SERP positions of a couple merchants allowing me to link to them with an affiliate ID, and now the allmighty Google tells the sponsors they must screw me with internal 301 redirects to rescue their rankings. Bugger. Since I read the shocking news on Google’s official Webmaster blog this morning I worked on a counter strategy, with success. Affiliate programs will not screw me, not even with Google’s help. They’ll be hoist by their own petard. I’ll strike back with nofollow and I’ll take no prisoners.

Seriously, the story reads a little different and is not breaking news at all. Maile Ohye from Google just endorsed best practices I’ve recommended for ages. Here is my recap.

The problem

Actually, there are problems on both sides of an affiliate link. The affiliate needs to hide these links from Google to avoid a so called “thin affiliate site penalty”, and the affiliate program suffers from duplicate content issues, link juice dilution, and often even URL hijacking by affiliate links.

Diligent affiliates gathering tons of PageRank on their pages can “unintentionally” overtake URLs on the SERPs by fooling the canonicalization algos. When Google discovers lots of links from strong pages on different hosts pointing to http://sponsor.com/?affid=me and this page adds ?affid=me to its internal links, my URL on the sponsor’s site can “outrank” the official home page, or landing page, http://sponsor.com/. When I choose the right anchor text, Google will feed my affiliate page with free traffic, whilst the affiliate program’s very own pages don’t exist on the SERPs.

Managing incoming affiliate links (merchants)

The best procedure is capturing all incoming traffic before a single byte of content is sent to the user agent, extracting the affiliate ID from the URL, storing it in a cookie, then 301-redirecting the user agent to the canonical version of the landing page, that is a page without affiliate or user specific parameters in the URL. That goes for all user agents (humans accepting the cookie and Web robots which don’t accept cookies and start a new session with every request).

Users not accepting cookies are redirected to a version of the landing page blocked by robots.txt, the affiliate ID sticks with the URLs in this case. Search engine crawlers, identified by their user agent name or whatever, are treated as users and shall never see (internal) links to URLs with tracking parameters in the query string.

This 301 redirect passes all the link juice, that is PageRank & Co. as well as anchor text, to the canonical URL. Search engines can no longer index page versions owned by affiliates. (This procedure doesn’t prevent you from 302 hijacking where your content gets indexed under the affiliate’s URL.)

Putting safe affiliate links (online marketers)

Honestly, there’s no such thing as a safe affiliate link, at least not safe with regard to picky search engines. Masking complex URLs with redirect services like tinyurl.com or so doesn’t help, because the crawlers get the real URL from the redirect header and will leave a note in the record of the original link on the page carrying the affiliate link. Anyways, the tiny URL will fool most visitors, and if you own the redirect service it makes managing affiliate links easier.

Of course you can cloak the hell out of your thin affiliate pages by showing the engines links to authority pages whilst humans get the ads, but then better forget the Google traffic (I know, I know … cloaking still works if you can handle it properly, but not everybody can handle the risks so better leave that to the experts).

There’s only one official approach to make a page plastered with affiliate links safe with search engines: replace it with a content rich page, of course Google wants unique and compelling content and checks its uniqueness, then sensibly work in the commercial links. Best link within the content to the merchants, apply rel-nofollow to all affiliate links, and avoid banner farms in the sidebars and above the fold.

Update: I’ve sanitized the title, “Google says you must screw your affiliates in order to get indexed” was not one of my best title baits.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Another way to implement a site search facility

Providing kick ass navigation and product search is the key to success for e-commerce sites. Conversion rates highly depend on user friendly UIs which enable the shopper to find the desired product with a context sensitive search in combination with few drill-down clicks on navigational links. Unfortunately, the build-in search as well as navigation and site structure of most shopping carts simply sucks. Every online store is different, hence findability must be customizable and very flexible.

I’ve seen online shops crawling their product pages with a 3rd party search engine script because the shopping cart’s search functionality was totally and utterly useless. Others put fantastic efforts in self made search facilities which perfectly implement real life relations beyond the limitations of the e-commerce software’s data model, but need code tweaks for each and every featured product, specials, virtual shops assembling a particular niche from several product lines or whatever. Bugger.

Today I stumbled upon a very interesting approach which could become the holy grail for store owners suffering from crappy software. Progress invited me to discuss a product they’ve bought recently –EasyAsk– from a search geek’s perspective. Long story short, I was impressed. Without digging deep into the technology or reviewing implementations for weaknesses I think the idea behind that tool is promising.

Unfortunately, the EasyAsk Web site doesn’t provide solid technical and architectural information (I admit that I may have missed the tidbits within the promotional chatter), hence I try to explain it from what I’ve gathered today. Progress EasyAsk is a natural language interface connecting users to data sources. Users are shoppers, and staff. Data sources are (relational) databases, or data access layers (that is a logical tier providing a standardized interface to different data pools like all sorts of databases, (Web) services, an enterprise service bus, flat files, XML documents and whatever).

The shopper can submit natural language queries like “yellow XS tops under 30 bucks”. The SRP is a page listing tops and similar garments under 30.00$, size XS, illustrated with thumbnails of pics of yellow tops and bustiers, linked to the product pages. If yellow tops in XS are sold out, EasyAsk recommends beige tops instead of delivering a sorry-page. Now when a search query is submitted from a page listing suits, a search for “black leather belts” lists black leather belts for men. If the result set is too large and exceeds the limitations of one page, EasyAsk delivers drill-down lists of tags, categories and synonyms until the result set is viewable on one page. The context (category/tag tree) changes with each click and can be visualized for example as bread crumb nav link.

Technically spoken, EasyAsk does not deal with the content presentation layer itself. It returns XML which can be used to create a completely new page with a POST/GET request, or it gets invoked as AJAX request whose response just alters DOM objects to visualize the search results (way faster but not exactly search engine friendly - that’s not a big deal because SERPs shouldn’t be crawlable at all). Performance is not an issue from what I’ve seen. EasyAsk caches everything so that the server doesn’t need to bother the hard disk. All points of failure (WRT performance issues) belong to the implementation, thus developing a well thought out software architecture is a must-have.

Well, that’s neat, but where’s the USP? EasyAsk comes with a natural language (search) driven admin interface too. That means that product managers can define and retrieve everything (attributes, synonyms, relations, specials, price ranges, groupings …) using natural language. “Gimme gross sales of leather belts for men II/2007 compared to 2006″ delivers a statistic and “top is a synonym for bustier and the other way round” creates a relation. The admin interface runs in the Web browser, definitions can be submitted via forms and all admin functions come with previews. Really neat. That reduces the workload of the IT dept. WRT ad-hoc queries as well as for lots of structural change requests, and saves maintenance costs (Web design / Web development).

I’ve spotted a few weak points, though. For example in the current version the user has to type in SKUs because there’s no selection box. Or meta data are stored in flat files, but that’s going to change too. There’s no real word stemming, EasyAsk handles singular/plural correctly and interprets “bigger” as “big” or “xx-large” politically correct as “plus”, but typos must be collected from the “searches without results” report and defined as synonym. The visualization of concurrent or sequentially applied business rules is just rudimentary on preview pages in the admin interface, so currently it’s hard to track down why particular products get downranked respectively highlighted when more than one rule applies. Progress told me that they’ll make use of 3rd party tools as well as in house solutions to solve these issues in the near future - the integration of EasyAsk into the Progress landscape has just begun.

The definitions of business language / expected terms used by consumers as well as business rules are painless. EasyAsk has build-in mappings like color codes to common color names and vice versa, understands terms like “best selling” and “overstock”, and these definitions are easy to extend to match actual data structures and niche specific everyday language.

Setting up the product needs consultancy (as a consultant I love that!). To get EasyAsk running it must understand the structure of the customer’s data sources, respectively the methods provided to fetch data from various structured as well as unstructured sources. Once that’s configured, EasyAsk pulls (database) updates on schedule (daily, hourly, minutely or whatever). It caches all information needed to fulfill search requests, but goes back to the data source to fetch real time data when the search query requires knowledge of not (yet) cached details. In the beginning such events must be dealt with, but after a (short) while EasyAsk should run smoothly without requiring much technical interventions (as a consultant I hate that, but the client’s IT department will love it).

Full disclosure: Progress didn’t pay me for that post. For attending the workshop I got two books (”Enterprise Service Bus” by David A. Chappel and “Getting Started with the SID” by John P. Reilly) and a free meal, travel expenses were not refunded. I did not test the software discussed myself (yet), so perhaps my statements (conclusions) are not accurate.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Follow-up: Erol’s patch fixing Google troubles

Erol developers test their first Google-patch for sites hosted on UNIX boxes. You can preview it here: x55.html. When you request the page with a search engine crawler identifier as user-agent name, the JavaScript code redirecting to the frameset erol.html#55×0&& gets replaced with a HTML comment explaining why human visitors are treated different from search engine spiders. The anatomy of this patch is described here, your feedback is welcome.

Erol told me they will be running tests on this site over the coming weeks, as they always do before going live with an update. So stay tuned for the release. When things run smoothly on UNIX hosts, a patch for Windows environments shall follow. On IIS the implementation is a bit trickier, because it needs changes of the server configuration. I’ll keep you updated.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Erol to ship a Patch Fixing Google Troubles

Background: read these four posts on Google penalizing respectively deindexing e-commerce sites. Long story short: Recently Google’s enhanced algos began to deindex e-commerce sites powered by Erol’s shopping cart software. The shopping cart maintains a static HTML file which redirects user agents executing JavaScript to another URL. This happens with each and every page, so it’s quite understandable that Ms. Googlebot was not amused. I got involved as a few worried store owners asked for help in Google’s Webmaster Forum. After lots of threads and posts on the subject Erol’s managing director got in touch with me and we agreed to team up to find a solution to help the store owners suffering from a huge traffic loss. Here’s my report of the first technical round.

Understanding how Erol 4.x (and all prior versions) works:

The software generates a HTML page offline, which functions as an XML-like content source (called “x-page”, I use that term because all Erol customers are familar with it). The “x-page” gets uploaded to the server and is crawlable, but not really viewable. Requested by a robot it responds with 200-Ok. Requested by a human, it does a JavaScript redirect to a complex frameset, which loads the “x-page” and visualizes its contents. It responds to browsers if directly called, but returns a 404-NotFound error to robots. Example:

“x-page”: x999.html
Frameset: erol.html#999×0&&

To view the source of the “x-page” disable JavaScript before you click the link.

Understanding how search engines handle Erol’s pages:

There are two major weak points with regard to crawling and indexing. The crawlable page redirects, and the destination does not exist if requested by a crawler. This leads to these scenarios:

  1. A search engine ignoring JavaScript on crawled pages fetches the “x-page” and indexes it. That’s the default behavior of yesterdays crawlers, and still works this way at several search engines.
  2. A search engine not executing JavaScript on crawled pages fetches the “x-page”, analyzes the client sided script, and discovers the redirect (please note that a search engine crawler may change its behavior, so this can happen all of a sudden to properly indexed pages!). Possible consequences:
    • It tries to fetch the destination, gets the 404 response multiple times, and deindexes the “x-page” eventually. That would mean that depending on the crawling frequency and depth per domain the pages disappear quite fast or rather slow until the last page is phased out. Google would keep a copy in the supplemental index for a while, but this listing cannot return to the main index.
    • It’s trained to consider the unconditional JavaScript redirect “sneaky” and flags the URL accordingly. This can result in temporarily and permanent deindexing as well.
  3. A search engine executing JavaScript on crawled pages fetches the “x-page”, performs the redirect (thus ignores the contents of the “x-page”), and renders the frameset for indexing. Chances are it gives up on the complexity of the nested frames, indexes the noframe-tag of the frameset and perhaps a few snippets from subframes, considers the whole conglomerate thin, hence assignes the lowest possible priority for the query engine and moves on.

Unfortunately the search engine delivering the most traffic began to improve its crawling and indexing, hence many sites formerly receiving a fair amount of Google traffic began to suffer from scenario 2 — deindexing.

Outlining a possible work around to get the deleted pages back in the search index:

In six months or so Erol will ship version 5 of its shopping cart, and this software dumps frames, JavaScript redirects and ugly stuff like that in favor of clean XHTML and CSS. By the way, Erol has asked me for my input on their new version, so you can bet it will be search engine friendly. So what can we do in the meantime to help legions of store owners running version 4 and below?

We’ve got the static “x-page” which should not get indexed because it redirects, and which cannot be changed to serve the contents itself. The frameset cannot be indexed because it doesn’t exist for robots, and even if a crawler could eat it, we don’t consider it easy to digest spider fodder.

Let’s look at Google’s guidelines, which are the strictest around, thus applicable for other engines as well:

  1. Don’t […] present different content to search engines than you display to users, which is commonly referred to as “cloaking.”
  2. Don’t employ cloaking or sneaky redirects.

If we find a way to suppress the JavaScript code on the “x-page” when a crawler requests it, the now more sophisticated crawlers will handle the “x-page” like their predecessors, that is they would fetch the “x-pages” and hand them over to the indexer without vicious remarks. Serving identical content under different URLs to users and crawlers does not contradict the first prescript. And we’d comply to the second rule, because loading a frameset for human vistors but not for crawlers is definitely not sneaky.

Ok, now how to tell the static page that it has to behave dynamically, that is outputting different contents server sided depending on the user agent’s name? Well, Erol’s desktop software which generates the HTML can easily insert PHP tags too. The browser would not render those on a local machine, but who cares when it works after the upload on the server. Here’s the procedure for Apache servers:

In the root’s .htaccess file we enable PHP parsing of .html files:
AddType application/x-httpd-php .html

Next we create a PHP include file xinc.php which prevents crawlers from reading the offending JavaScript code:
<?php
$crawlerUAs = array(”Googlebot”, “Slurp”, “MSNbot”, “teoma”, “Scooter”, “Mercator”, “FAST”);
$isSpider = FALSE;
$userAgent = getenv(”HTTP_USER_AGENT”);
foreach ($crawlerUAs as $crawlerUA) {
if (stristr($userAgent, $crawlerUA)) $isSpider = TRUE;
}
if (!$isSpider) {
print “<script type=\”text/javascript\”> [a whole bunch of JS code] </script>\n”;
}
if ($isSpider) {
print “<!– Dear search engine staff: we’ve suppressed the JavaScript code redirecting browsers to “erol.html”, that’s a frameset serving this page’s contents more pleasant for human eyes. –>\n”;
}
?>

Erol’s HTML generator now puts <?php @include(”x.php”); ?> instead of a whole bunch of JavaScript code.

The implementation for other environments is quite similar. If PHP is not available we can do it with SSI and PERL. On Windows we can tell IIS to process all .html extensions as ASP (App Mappings) and use an ASP include. That would give three versions of that patch which should help 99% of all Erol customers until they can upgrade to version 5.

This solution comes with two disadvantages. First, the cached page copies, clickable from the SERPs and toolbars, would render pretty ugly because they lack the JavaScript code. Second, perhaps automated tools searching for deceitful cloaking might red-flag the URLs for a human review. Hopefully the search engine executioner reading the comment in the source code will be fine with it and give it a go. If not, there’s still the reinclusion request. I think store owners can live with that when they get their Google traffic back.

Rolling out the patch:

Erol thinks the above said makes sense and there is a chance of implementing it soon. While the developers are at work, please provide feedback if you think we didn’t interpret Google’s Webmaster Guidelines strict enough. Keep in mind that this is an interim solution and that the new version will handle things more standardized. Thanks.

Paid-Links-Disclosure: I do this pro bono job for the sake of the suffering store owners. Hence the links pointing to Erol and Erol’s customers are not nofollow’ed. Not that I’d nofollow them otherwise ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Follow-up on "Google penalizes Erol stores"

Background: these three posts on Google penalizing e-commerce sites.

Erol has contacted me and we will discuss the technical issues within the next days or maybe weeks or so. I understand this as a positive signal, especially because previously my impression was that Erol is not willing to listen constructive criticism, regardless Googles shot across the bow (more on that later). We agreed that before we come to the real (SEO) issues it’s a good idea to render a few points made in my previous posts more precisely. In the following I quote parts of Erol’s emails with permission:

Your blog has made for interesting reading but the first point I would like to raise with you is about the tone of your comments, not necessarily the comments themselves.

Question of personal style, point taken.

Your article entitled ‘Why eCommerce Systems Suck‘, dated March 12th, includes specific reference to EROL and your opinion of its SEO capability. Under such a generic title for an article, readers should expect to read about other shopping cart systems and any opinion you may care to share about them. In particular, the points you raise about other elements of SEO in the same article, (’Google doesn’t crawl search results’, navigation being ‘POST created results not crawlable’) are cited as examples of ways other shopping carts work badly in reference to SEO - importantly, this is NOT the way EROL stores work. Yet, because you do not include any other cart references by name or exclude EROL from these specific points, the whole article reads as if it is entirely aimed at EROL software and none others.

Indeed, that’s not fair. Navigation solely based on uncrawlable search results without crawler shortcuts or sheer POST results are definitely not issues I’ve stumbled upon while investigating penalized Erol driven online stores. Google’s problem with Erol driven stores is client sided cloaking without malicious intent. I’ve updated the post to make that clear.

Your comment in another article, ‘Beware of the narrow-minded coders‘ dated 26 March where you state: “I’ve used the case [EROL] as an example of a nice shopping cart coming with destructive SEO.” So by this I understand that your opinion is EROL is actually ‘a nice shopping cart’ but it’s SEO capabilities could be better. Yet your articles read through as EROL is generally bad all round. Your original article should surely be titled “Why eCommerce Systems Suck at SEO” and take a more rounded approach to shopping cart SEO capabilities, not merely “Why eCommerce Systems Suck”? This may seem a trivial point to you, but how it reflects overall on our product and clouds it’s capability to perform its main function (provide an online ecommerce solution) is really what concerns me.

Indeed, I meant that Erol is a nice shopping cart lacking SEO capabilities as long as not the major SEO issues get addressed asap. And I mean in the current version, which clearly violates Google’s quality guidelines. From what I’ve read in the meantime, the next version to be released in 6 months or so should eleminate the two major flaws with regard to search engine compatibility. I’ve changed the post’s title, the suggestion makes sense for me too.

I do not enjoy the Google.co.uk traffic from search terms like “Erol sucks” or “Erol is crap” because that’s simply not true. As I said before I think that Erol is a well rounded software nicely supporting the business processes its designed for, and the many store owners using Erol I’ve communicated with recently all tell me that too.

I noted with interest that your original article ‘Why eCommerce Systems Suck’ was dated 12th March. Coincidentally, this was the date Google began to re-index EROL stores following the Google update, so I presume that your article was originally written following the threads on the Google webmaster forums etc. prior to the 12th March where you had, no doubt, been answering questions for some of our customers about their de-listing during the update. You appear to add extra updates and information in your blogs but, disappointingly, you have not seen fit to include the fact that EROL stores are being re-listed in any update to your blog so, once again, the article reads as though all EROL stores have been de-listed completely, never to be seen again.

With all respect, nope. Google did not reindex Erol driven pages, Google had just lifted a “yellow card” penalty for a few sites. That is not a carte blanque but in the opposite Google’s last warning before the site in question gets the “red card”, that is a full ban lasting at least a couple of months or even longer. As said before it means absolutely nothing when Google crawls penalized sites or when a couple of pages reappear on the SERPs. Here is the official statement: “Google might also choose to give a site a ‘yellow card’ so that the site can not be found in the index for a short time. However, if a webmaster ignores this signal, then a ‘red card’ with a longer-lasting effect might follow.”
(Yellow / red cards: soccer terminology, yellow is a warning and red the sending-off.)

I found your comments about our business preferring “a few fast bucks”, suggesting we are driven by “greed” and calling our customers “victims” particularly distasteful. Especially the latter, because you infer that we have deliberately set out to create software that is not capable of performing its function and/or not capable of being listed in the search engines and that we have deliberately done this in pursuit of monetary gain at the expense of reputation and our customers. These remarks I really do find offensive and politely ask that they be removed or changed. In your article “Google deindexing Erol driven ecommerce sites” on March 23rd, you actually state that “the standard Erol content presentation is just amateurish, not caused by deceitful intent”. So which is it to be - are we deceitful, greedy, victimising capitalists, or just amateurish and without deceitful intent? I support your rights to your opinions on the technical proficiency of our product for SEO, but I certainly do not support your rights to your opinions of our company and its ethics which border on slander and, at the very least, are completely unprofessional from someone who is positioning themselves as just that - an SEO professional.

To summarise, your points of view are not the problem, but the tone and language with which they are presented and I sincerely hope you will see fit to moderate these entries.

C’mon, now you’re getting polemic;) In this post I’ve admitted to be polemic to bring my point home, and in the very first post on the topic I clearly stated that my intention was not slandering Erol. However, since you’ve agreed to an open discussion of the SEO flaws I think it’s no longer suitable to call your customers victims, so I’ve changed that. Also in my previous post I’ll insert a link near “greed” and “fast bucks” pointing to this paragraph to make it absolutely clear that I did not meant what you insinuate when I wrote:

Ignorance is no excuse […] Well, it seems to me that Erol prefers a few fast bucks over satisfied customers, thus I fear they will not tell their cutomers the truth. Actually, they simply don’t get it. However, I don’t care whether their intention to prevaricate is greed or ignorance, I really don’t know, but all the store operators suffering from Google’s penalties deserve the information.

Actually, I still stand by my provoking comments because at this time they perfectly described the impression you’ve created with your actions respectively lack of fitly activities in the public.

  1. Critical customers asking whether the loss of Google traffic might be caused by the way your software handles HTML outputs in your support forums were downtrodden and censored.
  2. Your public answers to worried customers were plain wrong, SEO-wise. Instead of “we take your hints seriously and will examine whether JavaScript redirects may cause Google penalties or not” you said that search engines do index cloaking pages just fine, that Googlebot crawling penalized sites is a good sign, and all the mess is kinda Google hiccup. At this point the truth was out long enough, so your most probably unintended disinformation has worried a number of your customers, and gave folks like me the impression that you’re not willing to undertake the necessary steps.
  3. Offering SEO services yourself as well as forum talks praising Erol’s SEO experts don’t put you in a “we just make great shopping cart software and are not responsible for search engine weaknesses” position. Frankly that’s not conceivable as responsible management of customer expectations. It’s great that your next version will dump frames and JavaScript redirects, but that’s a bit too late in the eyes of your customers, and way too late from a SEO perspective, because Google never permitted the use of JavaScript redirects and all the disadvantages of frames were public knowledge since the glory days of Altavista, Excite and Infoseek, long before Google overtook search.

To set the record straight: I don’t think and never thought that you’ve greedily or deliberately put your customers at risk in pursuit of monetary gain. You’ve just ignored Google’s guidelines and best practices of Web development too long, but –as the sub-title of my previous post hints– ignorance is no excuse.

Now that we’ve handled the public relation stuff, I’ll look into the remaining information Erol sent over hoping that I’ll be able to provide some reasonable input in the best interest of Erol’s customers.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google deindexing Erol driven ecommerce sites

Follow-up post - see why e-commerce software sucks.

Erol is a shopping cart software invented by DreamTeam, a UK based Web design firm. One of its core features is the on-the-fly conversion of crawlable HTML pages to fancy JS driven pages. Looks great in a JavaScript-enabled browser, and ugly w/o client sided formatting.

Erol, offering not that cheap SEO services itself, claims that it is perfectly OK to show Googlebot a content page without gimmicks, whilst human users get redirected to another URL.

Erol victims suffer from deindexing of all Erol-driven pages, Google just keeps pages in the index which do not contain Erol’s JS code. Considering how many online shops make use of Erol software in the UK, this massive traffic drop may have a visible impact on the gross national product ;) … Ok, sorry, kidding with so many businesses at risk does not amuse the Queen.

Dear “SEO experts” at Erol, could you please read Google’s quality guidelines:

· Don’t […] present different content to search engines than you display to users, which is commonly referred to as “cloaking.”
· Don’t employ cloaking or sneaky redirects.
· If a site doesn’t meet our quality guidelines, it may be blocked from the index.

Google did your customers a favour by not banning their whole sites, probably because the standard Erol content presentation technique is (SEO-wise) just amateurish, not caused by deceitful intent. So please stop whining

We are currently still investigating the recent changes Google have made which have caused some drop-off in results for some EROL stores. It is as a result of the changes by Google, rather than a change we have made in the EROL code that some sites have dropped. We are investigating all possible reasons for the changes affecting some EROL stores and we will, of course, feedback any definitive answers and solutions as soon as possible.

and listen to your customers stating

Hey Erol Support
Maybe you should investigate doorway pages with sneaky redirects? I’ve heard that they might cause “issues” such as full bans.

Tell your victims customers the truth, they deserve it.

Telling your customers that Googlebot crawling their redirecting pages will soon result in reindexing those is plain false by the way. Just because the crawler fetches a questionable page that doesn’t mean that the indexing process reinstates its accessibility for the query engine. Googlebot is just checking whether the sneaky JavaScript code was removed or not.

Go back to the whiteboard. See a professional SEO. Apply common sense. Develop a clean user interface pleasing human users and search engine robots as well. Without frames, sneaky respectively superfluous JavaScript redirects, and amateurish BS like that. In the meantime provide help and work arounds (for example a tutorial like “How to build an Erol shopping site without page loading messages which will result in search engine penalties”), otherwise you don’t need the revamp because your customer base will shrink to zilch.

Update: It seems that there’s a patch available. In Erol’s support forum member Craig Bradshaw posts “Erols new patch and instructions clearly tell customers not to use the page loading messages as these are no longer used by the software.”.

Tags: ()

Related links:
Matt Cutts August 19, 2005: “If you make lots of pages, don’t put JavaScript redirects on all of them … of course we’re working on better algorithmic solutions as well. In fact, I’ll issue a small weather report: I would not recommend using sneaky JavaScript redirects. Your domains might get rained on in the near future.”
Matt Cutts December 11, 2005: “A sneaky redirect is typically used to show one page to a search engine, but as soon as a user lands on the page, they get a JavaScript or other technique which redirects them to a completely different page.”
Matt Cutts September 18, 2005: “If […] you employ […] things outside Google’s guidelines, and your site has taken a precipitous drop recently, you may have a spam penalty. A reinclusion request asks Google to remove any potential spam penalty. … Are there […] pages that do a JavaScript or some other redirect to a different page? … Whatever you find that you think may have been against Google’s guidelines, correct or remove those pages. … I’d recommend giving a short explanation of what happened from your perspective: what actions may have led to any penalties and any corrective action that you’ve taken to prevent any spam in the future.”
Matt Cutts July 31, 2006: “I’m talking about JavaScript redirects used in a way to show users and search engines different content. You could also cloak and then use (meta refresh, 301/302) to be sneaky.”
Matt Cutts December 27, 2006 and December 28, 2006: “We have written about sneaky redirects in our webmaster guidelines for years. The specific part is ‘Don’t employ cloaking or sneaky redirects.’ We make our webmaster guidelines available in over 10 different languages … Ultimately, you are responsible for your own site. If a piece of shopping cart code put loads of white text on a white background, you are still responsible for your site. In fact, we’ve taken action on cases like that in the past. … If for example I did a search […] and saw a bunch of pages […], and when I clicked on one, I immediately got whisked away to a completely different url, that would be setting off alarm bells ringing in my head. … And personally, I’d be talking to the webshop that set that up (to see why on earth someone would put up pages like that) more than talking to the search engine.”

Matt Cutts heads Google’s Web spam team and has discussed these issues since the stone age at many places. Look at the dates above, penalties for cloaking / JS redirects are not a new thing. The answer to “It is as a result of the changes by Google, rather than a change we have made in the EROL code that some sites have dropped.” (Erol statement) is: Just because you’ve got away so long that does not mean that JS redirects are fine with Google. The cause of the mess is not a recent change of code, it’s the architecture by itself which is considered “cloaking / sneaky redirect” by Google. Google recently has improved its automated detection of client sided redirects, not its guidelines. Considering that both Erol created pages (the crawlable static page and the contents served by the URL invoked by the JS redirect) present similar contents, Google will have sympathy for all reinclusion requests, provided that the sites in question were made squeaky-clean before.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Why eCommerce Systems Suck at SEO

Listening to whiners and disappointed site owners across the boards I guess in a few weeks we’ll discuss Google’s brand new e-commerce penalties in instances of -30, -900 and -supphell. NOT! A recent algo tweak may have figured out how to identify more crap, but I doubt Google has launched an anti-eCommerce campaign.

One don’t need an award-winning mid-range e-commerce shopping cart like Erol to gain the Google death penalty. Thanks to this award winning software sold as “search engine friendly” on the home page, respectively its crappy architecture (sneaky JS redirects as per Google’s Webmaster guidelines), many innocent shopping sites from Erol’s client list have vanished, or will be deindexed soon. Unbelievable when you read more about their so-called SEO Services. Oh well, so far an actual example. The following comments do not address Erol shopping carts, but e-commerce systems in general.

My usual question when asked to optimize eCommerce sites is “are you willing to dump everything except the core shopping cart module?”. Unfortunately, that’s the best as well as the cheapest solution in most cases. The technical crux with eCommerce software is, that it’s developed by programmers, not Web developers, and software shops don’t bother asking for SEO advice. The result is often fancy crap.

Another common problem is, that the UI is optimized for shoppers (that’s a subclass of ’surfers’, the latter is decently emulated by search engine crawlers). Navigation is mostly shortcut- and search driven (POST created results not crawlable) and relies on variables stored in cookies and whereever (invisible to spiders). All the navigational goodies which make the surfing experience are implemented with client sided technologies, or -if put server sided- served by ugly URLs with nasty session-IDs (ignored by crawlers or at least heavily downranked for various reasons). What’s left for the engines? Deep hierarchical structures of thin pages plastered with duplicated text and buy-now links. That’s not the sort of spider food Ms. Googlebot and her colleagues love to eat.

Guess why Google doesn’t crawl search results. Because search results are an inedible spider fodder not worth indexing. The same goes for badly linked conglomerates of thin product pages. Think of a different approach. Instead of trying to shove thin product pages into search indexes write informative pages on product lines/groups/… and link to the product pages within the text. When these well linked info pages provide enough product details they’ll rank for product related search queries. And you’ll generate linkworthy content. Don’t forget to disallow /shop, /search and /products in your robots.txt.

Disclaimer: I’ve checked essentialaids.com, Erol’s software does JavaScript redirects obfuscating the linked URLs to deliver the content client sided. I’ve followed this case over a few days watching Google deindexing the whole site page by page. This kind of redirects is considered “sneaky” by Google and Google’s spam filters detect it automatically. Although there is no bad intent, Google bans all sites using this technique. Since this is a key feature of the software, how can they advertise it as “search engine friendly”? From their testimonials (most are affiliates) I’ve looked at irishmusicmail.com and found that Google has indexed only 250 pages from well over 800, it looks like the Erol shopping system was removed. The other non-affiliated testimonial is from heroesforkids.co.uk, a badly framed site which is also not viewable without JavaScript. Due to SE-unfriendliness Google has indexed only 50 out of 190 pages (deindexing the site a few days later). Another reference brambleandwillow.com didn’t load at all, Google has no references but I found Erol-styled URLs in Yahoo’s index. Next pensdirect.co.uk suffers from the same flawed architecture as heroesforkids.co.uk, although the pages/indexed-URLs ratio is slightly better (15 of 40+). From a quick look at the Erol JS source all pages will get removed from Google’s search index. I didn’t write that to slander Erol and its inventor Dreamteam UK, however these guys would deserve it. It’s just a warning that good looking software which might perfectly support all related business processes can be extremely destructive from a SEO perspective.

Update: Probably it’s possible to make Erol driven shops compliant to Google’s quality guidelines by creating the pages without a software functionality called “page loading messages”. More information is provided by several posts in Erol’s support forums.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments