Search Quality

Archived posts from the 'Search Quality' Category

Microsoft funding bankrupt Live Search experiment with porn spam

Posted on 16 November, 2007

If only this headline would be linkbait … of course it’s not sarcastic.

Rumors are out that Microsoft will launch a porn affiliate programm soon. The top secret code name for this project is “pornbucks”, but analysts say that it will be launched as “M$ SMUT CASH” next year or so.

Since Microsoft just can’t ship anything in time, and the usual delays aren’t communicated internally, their search dept. began to promote it to Webmasters this summer.

Surprisingly, Webmasters across the globe weren’t that excited to find promotinal messages from Live Search in their log files, so a somewhat confused MSN dude posted a lame excuse to a large Webmaster forum.

Meanwhile we found out that Microsoft Live Search does not only target the adult entertainment industry, they’re testing the waters with other money terms like travel or pharmaceutic products too.

Anytime soon the Live Search menu bar will be updated to something like this:

Here is the sad -but true- story of a search engine’s downfall.

A few months ago Microsoft Live Search discovered that x-rated referrer spam is a must-have technique in a sneaky smut peddlar’s marketing toolbox.

Since August 2007 a bogus Web robot follows Microsoft’s search engine crawler “MSNbot” to spam the referrer logs of all Web sites out there with URLs pointing to MSN search result pages featuring porn.

Read your referrer logs and you’ll find spam from Microsoft too, but perhaps they peeve you with viagra spam, offer you unwanted but cheap payday loans, or try to enlarge your penis. Of course they know every trick in the book on spam, so check for harmless catchwords too. Here is an example URL: http://search.live.com/results.aspx?q= spammy-keyword &mrt=en-us&FORM=LIVSOP

Microsoft’s spam bot not only leaves bogus URLs in log files, hoping that Webmasters will click them on their referrer stats pages and maybe sign up for something like “M$ Porn Bucks” or so. It downloads and renders even adverts powered by their rival Google, lowering their CTR; obviously to make programs like AdSense less attractive im comparison with Microsoft’s own ads (sorry, no link love from here).

Let’s look at Microsoft’s misleading statement:

The traffic you are seeing is part of a quality check we run on selected pages. While we work on addressing your conerns, we would request that you do not actively block the IP addreses used by this quality check; blocking these IP addresses could prevent your site from being included in the Live Search index.

That’s not traffic, that’s bot activity: These hits come within seconds of being indexed by MSNBot. The pattern is like this: the page is requested by MSNBot (which is authenticated, so it’s genuine) and within a few seconds, the very same page is requested with a live.com search result URL as referer by the MSN spam bot faking a human visitor.
If that’s really a quality check to detect cloaking, that’s more than just lame. The IP addresses don’t change, the bogus bot uses a static user agent name, and there are other footprints which allow every cloaking script out there to serve this sneaky bot the exact same spider fodder that MSNbot got seconds before. This flawed technique might catch poor man’s cloaking every once in a while, but it can’t fool savvy search marketers.
The FUD “could prevent your site from being included in the Live Search index” is laughable, because in most niches MSN search traffic is not existent.

All major search engines, including MSN, promise that they obey the robots exclusion standard. Obeying robots.txt is the holy grail of search engine crawling. A search engine that ignores robots.txt and other normed crawler directives cannot be trusted. The crappy MSN bot not even bothers to read robots.txt, so there’s no chance to block it with standardized methods. Only IP blocking can keep it out, but then it still seems to download ads from Google’s AdSense servers by executing the JavaScript code that the MSN crawler gathered before (not obeying Google’s AdSense robots.txt as well).

This unethical spam bot downloading all images, external CSS and JS files, and whatnot also burns bandwidth. That’s plain theft.

Since this method cannot detect (most) cloaking, and the so called “search quality control bot” doesn’t stop visiting sites which obviously do not cloak, it is a sneaky marketing tool. Whether or not Microsoft Live Search tries to promote cyberspace porn and on-line viagra shops plays no role. Even spamming with safe-at-work keywords is evil. Do these assclowns really believe that such unethical activities will increase the usage of their tiny and pretty unpopular search engine? Of course they do, otherwise they would have shutted down the spam bot months ago.

Dear reader, please tell me: what do you think of a search engine that steals (bandwidth and AdSense revenue), lies, spams away, and is not clever enough to stop their criminal activities when they’re caught?

Recently a Live Search rep whined in an interview because so many robots.txt files out there block their crawler:

One thing that we noticed for example while mining our logs is that there are still a fair number of sites that specifically only allow Googlebot and do not allow MSNBot.

There’s a suitable answer, though. Update your robots.txt:User-agent: MSNbot Disallow: /

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | Internet Marketing, MSN, Search Quality, Spam, Crawler Directives, Crap

Act out your sophisticated affiliate link paranoia

Posted on 13 November, 2007

GOOD: paranoid affiliate link My recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google’s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:

Links purchased for advertising should be designated as such. This can be done in several ways, such as:
* Adding a rel=”nofollow” attribute to the <a> tag
* Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file

Also, Google doesn’t define paid links that clearly, so try this paid link definition instead before your read on. Here is my linking guide for the paranoid affiliate marketer.

Google recommends hiding of any content provided by affiliate programs from their crawlers. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program’s sales tools page count as duplicate content (snippet) in its worst occurance.

Pasting code copied from a merchant’s site into a page’s or template’s HTML is not exactly a smart way to put ads. Those ads aren’t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you’re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts.

There are tons of such scripts out there, so I don’t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you’ve a WordPress blog you can adapt them with ease.

Cover your ass with a linking policy

Googlers as well as hired guns do review Web sites for violations of Google’s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) full disclosure of your linking attitude can help to pass a human review by search engine staff. By the way, having a policy for dofollowed blog comments is also a good idea.

Since crawler directives like link condoms are for search engines (only), and those pay attention to your source code and hints addressing search engines like robots.txt, you should leave a note there too, look into the source of this page for an example. View sample HTML comment.

Block crawlers from your propaganda scripts

Put all your stuff related to advertising (scripts, images, movies…) in a subdirectory and disallow search engine crawling in your /robots.txt file: User-agent: * Disallow: /propaganda/
Of course you’ll use an innocuous name like “gnisitrevda” for this folder, which lacks a default document and can’t get browsed because you’ve a Options -Indexes
statement in your .htaccess file. (Watch out, Google knows what “gnisitrevda” means, so be creative or cryptic.)

Crawlers sent out by major search engines do respect robots.txt, hence it’s guaranteed that regular spiders don’t fetch it. As long as you don’t cheat too much, you’re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn’t prevent you from surfing search engine staff, but I don’t tell you things you’d better hide from Matt’s gang.

Detect search engine crawlers

Basically there are three common methods to detect requests by search engine crawlers.

Testing the user agent name (HTTP_USER_AGENT) for strings like “Googlebot”, “Slurp”, “MSNbot” or so which identify crawlers. That’s easy to spoof, for example PrefBar for FireFox lets you choose from a list of user agents.
Checking the user agent name, and only when it indicates a crawler, verifying the requestor’s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.
Maintaining a list of all search engine crawler IP addresses known to man, checking the requestor’s IP (REMOTE_ADDR) against this list. (That alone isn’t bullet-proof, but I’m not going to write a tutorial on industrial-strength ~~cloaking~~ IP delivery, I leave that to the real experts.)

For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won’t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants’ search engine rankings, and that’s evil in Google’s eyes.

We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine’s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor’s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn’t match a crawler name.

View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)
// file system path to crawler IP log, scripts etc., // without trailing slash: $includePath = $_SERVER["DOCUMENT_ROOT"] . "/propaganda"; // edit "propaganda" and CHMOD 777 the directory ! // file names: $crawlerIps = $includePath ."/crawler-ip-addresses.txt"; // misc. stuff: $timestamp = date(’Y-m-d H:i:s’); $ipAddy = $_SERVER["REMOTE_ADDR"]; $referrer = $_SERVER["HTTP_REFERER"]; $userAgent = $_SERVER["HTTP_USER_AGENT"]; $requestUri = $_SERVER["REQUEST_URI"]; $queryString = $_SERVER["QUERY_STRING"]; $isCrawler = FALSE; $crawlerServer = ""; $delimiter = "|"; $idString = ""; if (empty($includePath)) { $includePath = $_SERVER["DOCUMENT_ROOT"] . "/propaganda"; // CHMOD 777 } // Write a file to disk if (!function_exists("writeLocalFile")) { function writeLocalFile ($file, $content) { if (!is_writable($file)) { $lok = @chmod ( $file, 0777 ); } // file_put_contents() not avail in PHP 4.3x $fp = @fopen("$file","w+"); if ($fp) { $lOk = @fwrite($fp, $content, strlen($content)); @fclose($fp); // make sure file may get overwritten or removed later on $lok = @chmod ( $file, 0777 ); return TRUE; } // endif $fp return FALSE; } // end function writeLocalFile } if (!function_exists("checkCrawlerUA")) { function checkCrawlerUA () { GLOBAL $userAgent; GLOBAL $crawlerServer; $crawlerServer = ""; $crawlers = array("Googlebot","Mediapartners","Slurp","MSNbot","Ask","Teoma"); foreach ($crawlers as $crawler) { if (stristr($userAgent,$crawler)) { if (stristr($crawler,"Googlebot") || stristr($crawler,"Mediapartners")) { $crawlerServer = ".googlebot.com"; } // Google if (stristr($crawler,"Slurp")) { $crawlerServer = ".crawl.yahoo.net"; } // Yahoo if (stristr($crawler,"MSNbot")) { $crawlerServer = ".search.live.com"; } // MSN/Live if (stristr($crawler,"Ask") || stristr($crawler,"Teoma")) { $crawlerServer = ".ask.com"; } // Ask } } // foreach crawlers if (!empty($crawlerServer)) return TRUE; return FALSE; } // end function checkCrawlerUA } if (!function_exists("checkCrawlerIP")) { function checkCrawlerIP ($idString) { GLOBAL $ipAddy; GLOBAL $crawlerIps; GLOBAL $delimiter; GLOBAL $timestamp; GLOBAL $userAgent; GLOBAL $crawlerServer; $isCrawler = checkCrawlerUA(); if ($isCrawler === FALSE) return FALSE; if (empty($crawlerServer)) return FALSE; // // DEBUG: $crawlerServer = ".national-net.com"; // Use your ISPs host name for testing with a spoofed user agent name // $crawlerIpsContent = @file_get_contents($crawlerIps); if (!empty($crawlerIpsContent)) { if (stristr($crawlerIpsContent, "\n$ipAddy$delimiter")) { return TRUE; } } $crawlerHost = @gethostbyaddr($ipAddy); if (!stristr($crawlerHost,$crawlerServer)) { return FALSE; } if ("$crawlerHost" == "$ipAddy") { return FALSE; } $ipAddyRev = @gethostbyname($crawlerHost); if ("$ipAddyRev" != "$ipAddy") { return FALSE; } $crawlerIpsContent .= "\n" .$ipAddy .$delimiter .$timestamp .$delimiter .$crawlerHost .$delimiter .$idString .$delimiter .$userAgent .$delimiter; $lOk = writeLocalFile ($crawlerIps, $crawlerIpsContent); return TRUE; } // end function checkCrawlerIP }
Grab and implement the PHP source, then you can code statements like $isSpider = checkCrawlerUA (); ... if ($isSpider) { $relAttribute = " rel=\"nofollow\" "; } ... $affLink = "<a href=\"$affUrl\" $relAttribute>call for action</a>";
or $isSpider = checkCrawlerIP ($sponsorUrl); ... if ($isSpider) { // don't redirect to the sponsor, return a 403 or 410 instead }
More on that later.

Don’t deliver your advertising to search engine crawlers

It’s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense’s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can’t guarantee indexing, because Google doesn’t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.

Here is how you could output a totally uncrawlable banner ad: ... $isSpider = checkCrawlerIP ($PHP_SELF); ... print "<div class=\"css-class-sidebar robots-nocontent\">"; // output RSS buttons or so if (!$isSpider) { print "<script type=\"text/javascript\" src=\"http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner\"></script>"; ... } ... print "</div>\n"; ...
Lets look at the code above. First we detect crawlers “without doubt” (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-’.crawl.yahoo.net’ host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I’ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).

Currently the robots-nocontent class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn’t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google’s section targeting that’s a crappy crawler directive, though. However, it doesn’t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), “buy now”/”view basket” links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) … and of course ads.

Ads aren’t outputted when a crawler requests a page. Of course that’s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren’t relevant for indexing that’s not deceitful cloaking in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster’s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers.

Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren’t penalized. Since Web robots can’t pull out the plastic, there’s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (Example from Google: “Allow search bots to crawl your sites without session IDs or arguments that track their path through the site” is a call for search engine friendly URL cloaking.)

Is hiding ads from crawlers “safe with Google” or not?

BAD: uncloaked affiliate link Cloaking ads away is a double edged sword from a search engine’s perspective. Way too strictly interpreted that’s against the cloaking rule which states “don’t show crawlers other content than humans”, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google’s section targeting or Yahoo’s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don’t count them as on-the-page contents.

Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don’t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify MFA pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings.

Since in theory the contents of JavaScript ads aren’t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can’t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can’t argue seriously that hiding such tiny signals -which they use for the sole purposes of possible downranks- is against their guidelines.

For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn’t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.

Dear search engines, if it does not count, then you cannot demand to keep it crawlable. Well, a few super mega white hat trolls might disagree, and depending on the implementation on individual sites maybe hiding ads isn’t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don’t try to hide the fact that I run ads on this blog.

Usually I don’t argue with fair vs. unfair, because we talk about ~~war~~ business here, what means that everything goes. However, Google does everything to talk the whole Internet into ~~obfuscating~~ disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. (Of course that’s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don’t care about bad press. So please don’t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)

Outputting ads with JavaScript, preferably in iFrames

Delivering adverts with JavaScript does not mean that one can’t use server sided scripting to adjust them dynamically. With content management systems it’s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn’t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval’ing the PHP code).

Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:

And here is the JS call of the PHP script: <script type="text/javascript" src="http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner"></script>

The PHP script /propaganda/output.js.php evaluates the query string to pull the requested ad’s components. In case it’s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor’s location and whatnot, but that’s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad’s links. Once the HTML code is ready, it outputs a JavaScript statement: document.write(‘<a href="http://sebastians-pamphlets.com/propaganda/router.php? adName=seobook&adServed=banner" title="DOWNLOAD THE BOOK ON SEO!"><img src="http://sebastians-pamphlets.com/propaganda/seobook/468-60.gif" width="468" height="60" border="0" alt="The only current book on SEO" title="The only current book on SEO" /></a>’); which the browser executes within the script tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag.

Matt Cutts said that JavaScript links don’t prevent Googlebot from crawling, but that those links don’t count for rankings (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can’t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it’s fetchable by crawlers, so I wouldn’t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That’s why I use rel-nofollow -on crawler requests- even in JS ads.

Change your user agent name to Googlebot or so, install Matt’s show nofollow hack or something similar, and you’ll see that the affiliate-URL gets nofollow’ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can’t offend my readers with nofollow’ed links (although I don’t care about crawler spoofing, actually that’s a good procedure to let advertisers check out my linking attitude).

We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don’t count pages displayed in iFrames as on-page contents, especially not when the iFrame’s content is hosted on another domain. Here is an example straight from the horse’s mouth: <iframe name="google_ads_frame" src="http://pagead2.googlesyndication.com/pagead/ads? very-long-and-ugly-query-string" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" frameborder="0" height="90" scrolling="no" width="728"></iframe> In a noframes tag we could put a static ad for surfers using browsers which don’t support frames/iFrames.

If for some reasons you don’t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael’s Full Text/HTML Obfuscator Tool (hat tip to Donna).

Always redirect to affiliate URLs

There’s absolutely no point in using ugly affiliate URLs on your pages. Actually, that’s the last thing you want to do for various reasons.

For example, affiliate URLs as well as source codes can change, and you don’t want to edit tons of pages if that happens.
When an affiliate program doesn’t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you’d replace outdated ads completely with one mouse click or so.
Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own -often confusing- layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don’t track your outgoing traffic, some sponsors will cheat and you can’t prove your gut feelings.
Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don’t smell like affiliate URLs.
…

When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really manage your affiliate links. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.

Affiliate links provide great footprints. Unique URL parts respectively query string variable names gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The values identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn’t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that’s one trackable footprint you can avoid to some degree.

If the redirect-script’s location is on the same server (in fact it’s not thanks to symlinks) and not named “adserver” or so, chances are that a heuristic check won’t identify the link’s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren’t crawlable with paramaters, or similar “legit” stuff. However, you can’t fool smart algos forever, but if you’ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine’s point of view, and users might agree on this).

Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn’t benefit you (usually). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.

The behavior of an adserver URL masking an affiliate link

Lets look at the redirect-script’s URL from my code example above:
/propaganda/router.php?adName=seobook&adServed=banner
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert.

Now that we’ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor’s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor’s IP addy.

If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor’s landing page: $sponsorUrl = "http://www.seobook.com/262.html"; $requestProtocol = $_SERVER["SERVER_PROTOCOL"]; $protocolArr = explode("/",$requestProtocol); $protocolName = trim($protocolArr[0]); $protocolVersion = trim($protocolArr[1]); if (stristr($protocolName,"HTTP") && strtolower($protocolVersion) > "1.0" ) { $httpStatusCode = 307; } else { $httpStatusCode = 302; } $httpStatusLine = "$requestProtocol $httpStatusCode Temporary Redirect"; @header($httpStatusLine, TRUE, $httpStatusCode); @header("Location: $sponsorUrl"); exit;
A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can’t use the 301 response code, because 301 redirects are always cachable. (More information on HTTP redirects.)

If the requestor is a major search engine’s crawler, we perform the most brutal bounce back known to man: if ($isSpider) { @header("HTTP/1.1 403 Sorry Crawlers Not Allowed", TRUE, 403); @header("X-Robots-Tag: nofollow,noindex,noarchive"); exit; }
The 403 response code translates to “kiss my ass and get the fuck outta here”. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn’t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine “forget this URL, don’t request it again”. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.

If you don’t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don’t need to live in fear of penalties.

Recap

NICE: prospering affiliate link Before you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.

Don’t deliver different (editorial) contents to users and crawlers, but also don’t serve ads to crawlers. They just don’t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.

Your ads look plain ugly with dotted borders in firebrick, hence don’t apply rel=”nofollow” to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.

No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers.

Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google’s job is not exactly funny.

I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you’re not paranoid, that does not mean Google will not chase you down. You really don’t need to be a so called black hat to displease Google. Not knowing respectively not understanding Google’s 12 commandments doesn’t prevent you from being spanked for sins you’ve never heard of. If you’re keen on Google’s nicely targeted traffic, better play by Google’s rules, leastwise on creawler requests.

Feel free to contribute your tips and tricks in the comments.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Search Quality, Risky Linkage, Web development, X-Robots-Tag, Redirects, Paid Links, Crawler Directives, SEO, Google, robots.txt, E-Commerce, Cloaking, Nofollow

Internet marketing is one big popularity contest, and that’s not a good thing

Posted on 9 November, 2007

This is a guest post by Tanner Christensen.

What are you doing to make Internet marketing a better industry to be a part of? As it sits now: Internet marketing is one big popularity contest, and that’s not a good thing. Internet marketers are making it nearly impossible for the average person to find valuable content.

The real online content providers - the websites who deserve all of your attention - are becoming harder and harder to discover because of Internet marketers like us. Though Internet marketers - both you and I - can’t really be blamed, our job is all about getting attention. The more attention we get for our website(s), the more popular our website(s) become, the more money we can make.

But because of the recent surge of interest in Internet marketing and search engine optimization, websites that focus on providing content - rather than getting attention - are being ignored. And because these content-focused websites are being cast into the shadows of attention-focused websites, they too are jumping on the Internet marketing popularity contest bandwagon.

Even though every webmaster and his or her mother is jumping on the bandwagon, it’s not accurate to say that Internet marketers are making all less-important, less-helpful, and less-useful websites more popular than really helpful website, but there is definitely the possibility of real news and information being masked by attention-seeking content.

So what do we do? What do Internet marketers and search engine optimizers do to make sure that the Internet popularity contest doesn’t become a contest of lies and attention-seeking tactics; but rather a contest of quality, helpful, interesting, important, groundbreaking content?

The first step is to become a part of the online community. I’m not talking about the Internet marketing community - it’s biased in a lot of ways. I’m talking about the real online communities. Doing so will help create a universal feeling of online morals; or what’s good information and what is bad information.

And discovering where the real helpful and important websites are online will help Internet marketers such as ourselves learn where the websites we work with really should be ranked.

Sure, there are still those people who don’t care about quality of content and only care about the all-mighty dollar sign. But poor-content will eventually catch up with them, when websites that really deserve attention in the online popularity contest are lost in the fold and the dollar sign loses it’s value.

Tanner is a Web specialist and designer who writes helpful, inspiring, and creative internet-related articles. A while ago I’ve contributed an article to his blog Internet Hunger: The anatomy of a debunking post. I think “can agessive SMO tactics push crap on the long haul” would be an interesting, and related discussion. I mean, search engines evolve too, not only in Web search, so kinda fair rankings of well linked crap as well as good stuff not on the SM radar might be possible to some extent.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | Internet Marketing, Folks, Search Quality

A pragmatic defence against Google’s anti paid links campaign

Posted on 26 October, 2007

Google’s recent shot across the bows of a gazillion sites handling paid links, advertising, or internal cross links not compliant to Google’s imagination of a natural link is a call for action. Google’s message is clear: “condomize your commercial links or suffer” (from deducted toolbar PageRank, links without the ability to pass real PageRank and relevancy signals, or perhaps even penalties).

Paid links: good versus evil Of course that’s somewhat evil, because applying nofollow values to all sorts of links is not exactly a natural thing to do; visitors don’t care about invisible link attributes and sometimes they’re even pissed when they get redirected to an URL not displayed in their status bar. Also, this requirement forces Webmasters to invest enormous efforts in code maintenance for the sole purpose of satisfying search engines. The argument “if Google doesn’t like these links, then they can discount them in their system, without bothering us” has its merits, but unfortunately that’s not the way Google’s cookie crumbles for various reasons. Hence lets develop a pragmatic procedure to handle those links.

The problem

Google thinks that uncondomized paid links as well as commercial links to sponsors or affiliated entities aren’t natural, because the terms “sponsor|pay for review|advertising|my other site|sign-up|…” and “editorial vote” are not compatible in the sense of Google’s guidelines. This view at the Web’s linkage is pretty black vs. white.

Either you link out because a sponsor bought ads, or you don’t sell ads and link out for free because you honestly think your visitors will like a page. Links to sponsors without condom are black, links to sites you like and which you don’t label “sponsor” are white.

There’s nothing in between, respectively gray areas like links to hand picked sponsors on a page with a gazillion of links count as black. Google doesn’t care whether or not your clean links actually pass a reasonable amount of PageRank to link destinations which buy ad space too, the sole possibility that those links could influence search results is enough to qualify you as sort of a link seller.

The same goes for paid reviews on blogs and whatnot, see for example Andy’s problem with his honest reviews which Google classifies as paid links, and of course all sorts of traffic deals, affiliate links, banner ads and stuff like that.

You don’t even need to label a clean link as advert or sponsored. If the link destination matches a domain in Google’s database of on-line advertisers, link buyers, e-commerce sites / merchants etcetera, or Google figures out that you link too much to affiliated sites or other sites you own or control, then your toolbar PageRank is toast and most probably your outgoing links will be penalized. Possibly these penalties have impact on your internal links too, what results in less PageRank landing on subsidiary pages. Less PageRank gathered by your landing pages means less crawling, less ranking, less SERP referrers, less revenue.

The solution

You’re absolutely right when you say that such search engine nitpicking should not force you to throw nofollow crap on your links like confetti. From your and my point of view condomizing links is wrong, but sometimes it’s better to pragmatically comply to such policies in order to stay in the game.

Although uncrawlable redirect scripts have advantages in some cases, the simplest procedure to condomize a link is the rel-nofollow microformat. Here is an example of a googlified affiliate link:<a href="http://sponsor.com/?affID=1" rel="nofollow">Sponsor</a>

Why serve your visitors search engine crawler directives?

Complying to Google’s laws does not mean that you must deliver crawler directives like rel=”nofollow” to your visitors. Since Google is concerned about search engine rankings influenced by uncondomized links with commercial intent, serving crawler directives to crawlers and clean links to users is perfectly in line with Google’s goals. Actually, initiatives like the X-Robots-Tag make clear that hiding crawler directives from users is fine with Google. To underline that, here is a quote from Matt Cutts:

[…] If you want to sell a link, you should at least provide machine-readable disclosure for paid links by making your link in a way that doesn’t affect search engines. […]

The other best practice I’d advise is to provide human readable disclosure that a link/review/article is paid. You could put a badge on your site to disclose that some links, posts, or reviews are paid, but including the disclosure on a per-post level would better. Even something as simple as “This is a paid review” fulfills the human-readable aspect of disclosing a paid article. […]

Google’s quality guidelines are more concerned with the machine-readable aspect of disclosing paid links/posts […]

To make sure that you’re in good shape, go with both human-readable disclosure and machine-readable disclosure, using any of the methods [uncrawlable redirects, rel-nofollow] I mentioned above.
[emphasis mine]

Since Google devalues paid links anyway, search engine friendly cloaking of rel-nofollow for Googlebot is a non-issue with advertisers, as long as this fact is disclosed. I bet most link buyers look at the magic green pixels anyway, but that’s their problem.

How to cloak rel-nofollow for search engine crawlers

I’ll discuss a PHP/Apache example, but this method is adaptable to other server sided scripting languages like ASP or so with ease. If you’ve a static site and PHP is available on your (*ix) host, you need to tell Apache that you’re using PHP in .html (.htm) files. Put this statement in your root’s .htaccess file: AddType application/x-httpd-php .html .htm

Next create a plain text file, insert the code below, and upload it as “funct_nofollow.php” or so to your server’s root directory (or a subdirectory, but then you need to change some code below). <?php function makeRelAttribute ($linkClass) { $numargs = func_num_args(); // optional 2nd input parameter: $relValue if ($numargs >= 2) { $relValue = func_get_arg(1) ." "; } $referrer = $_SERVER["HTTP_REFERER"]; $refUrl = parse_url($referrer); $isSerpReferrer = FALSE; if (stristr($refUrl[host], "google.") || stristr($refUrl[host], "yahoo.")) $isSerpReferrer = TRUE; $userAgent = $_SERVER["HTTP_USER_AGENT"]; $isCrawler = FALSE; if (stristr($userAgent, "Googlebot") || stristr($userAgent, "Slurp")) $isCrawler = TRUE; if ($isCrawler /*|| $isSerpReferrer*/ ) { if ("$linkClass" == "ad") $relValue .= "advertising nofollow"; if ("$linkClass" == "paid") $relValue .= "sponsored nofollow"; if ("$linkClass" == "own") $relValue .= "affiliated nofollow"; if ("$linkClass" == "vote") $relValue .= "editorial dofollow"; } if (empty($relValue)) return ""; return " rel=\"" .trim($relValue) ."\" "; } // end function makeRelValue ?>

Next put the code below in a PHP file you’ve included in all scripts, for example header.php. If you’ve static pages, then insert the code at the very top. <?php @include($_SERVER["DOCUMENT_ROOT"] ."/funct_nofollow.php"); ?>
Do not paste the function makeRelValue itself! If you spread code this way you’ve to edit tons of files when you need to change the functionality later on.

Now you can use the function makeRelValue($linkClass,$relValue) within the scripts or HTML pages. The function has an input parameter $linkClass and knows the (self-explanatory) values “ad”, “paid”, “own” and “vote”. The second (optional) input parameter is a value for the A element’s REL attribute itself. If you provide it, it gets appended, or, if makeRelValue doesn’t detect a spider, it creates a REL attribute with this value. Examples below. You can add more user agents, or serve rel-nofollow to visitors coming from SERPs by enabling the || $isSerpReferrer condition (remove the bold /*&*/).

When you code a hyperlink, just add the function to the A tag. Here is a PHP example: print "<a href=\"http://google.com/\"" .makeRelAttribute("ad") .">Google</a>";
will output
<a href="http://google.com/" rel="advertising nofollow" >Google</a>
when the user agent is Googlebot, and
<a href="http://google.com/">Google</a>
to a browser.

If you can’t write nice PHP code, for example because you’ve to follow crappy guidelines and worst practices with a WordPress blog, then you can mix HTML and PHP tags: <a href="http://search.yahoo.com/"<?php print makeRelAttribute("paid"); ?>>Yahoo</a>

Please note that this method is not safe with search engines or unfriendly competitors when you want to cloak for other purposes. Also, the link condoms are served to crawlers only, that means search engine staff reviewing your site with a non-crawler user agent name won’t spot the nofollow’ed links unless they check the engine’s cached page copy. An HTML comment in HEAD like “This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.” as well as a similar comment line in robots.txt would certainly help to pass reviews by humans.

A Google-friendly way to handle paid links, affiliate links, and cross linking

Load this page with different user agents and referrers. You can do this for example with a FireFox extension like PrefBar. For testing purposes you can use these user agent names: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
and these SERP referrer URLs: http://google.com/search?q=viagra http://search.yahoo.com/search?p=viagra&ei=utf-8&iscqry=&fr=sfp
Just enter these values in PrefBar’s user agent respectively referrer spoofing options (click “Customize” on the toolbar, select “User Agent” / “Referrerspoof”, click “Edit”, add a new item, label it, then insert the strings above). Here is the code above in action:

Referrer URL:	http://sebastians-pamphlets.com/category/search-quality/
User Agent Name:	Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
Ad makeRelAttribute(”ad”):	Google
Paid makeRelAttribute(”paid”):	Yahoo
Own makeRelAttribute(”own”):	Sebastian’s Pamphlets
Vote makeRelAttribute(”vote”):	The Link Condom
External makeRelAttribute(”", “external”):	W3C `rel="external"`
Without parameters makeRelAttribute(”"):	Sphinn

When you change your browser’s user agent to a crawler name, or fake a SERP referrer, the REL value will appear in the right column.

When you’ve developed a better solution, or when you’ve a nofollow-cloaking tutorial for other programming languages or platforms, please let me know in the comments. Thanks in advance!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

20 comments Sebastian | Paid Links, Risky Linkage, Search Quality, Crawler Directives, Cloaking, Google, SEO, Nofollow

Google says you must manage your affiliate links in order to get indexed

Posted on 12 September, 2007

Screwing affiliates recommended by Google ;=) I’ve worked hard to overtake the SERP positions of a couple merchants allowing me to link to them with an affiliate ID, and now the allmighty Google tells the sponsors they must screw me with internal 301 redirects to rescue their rankings. Bugger. Since I read the shocking news on Google’s official Webmaster blog this morning I worked on a counter strategy, with success. Affiliate programs will not screw me, not even with Google’s help. They’ll be hoist by their own petard. I’ll strike back with nofollow and I’ll take no prisoners.

Seriously, the story reads a little different and is not breaking news at all. Maile Ohye from Google just endorsed best practices I’ve recommended for ages. Here is my recap.

The problem

Actually, there are problems on both sides of an affiliate link. The affiliate needs to hide these links from Google to avoid a so called “thin affiliate site penalty”, and the affiliate program suffers from duplicate content issues, link juice dilution, and often even URL hijacking by affiliate links.

Diligent affiliates gathering tons of PageRank on their pages can “unintentionally” overtake URLs on the SERPs by fooling the canonicalization algos. When Google discovers lots of links from strong pages on different hosts pointing to http://sponsor.com/?affid=me and this page adds ?affid=me to its internal links, my URL on the sponsor’s site can “outrank” the official home page, or landing page, http://sponsor.com/. When I choose the right anchor text, Google will feed my affiliate page with free traffic, whilst the affiliate program’s very own pages don’t exist on the SERPs.

Managing incoming affiliate links (merchants)

The best procedure is capturing all incoming traffic before a single byte of content is sent to the user agent, extracting the affiliate ID from the URL, storing it in a cookie, then 301-redirecting the user agent to the canonical version of the landing page, that is a page without affiliate or user specific parameters in the URL. That goes for all user agents (humans accepting the cookie and Web robots which don’t accept cookies and start a new session with every request).

Users not accepting cookies are redirected to a version of the landing page blocked by robots.txt, the affiliate ID sticks with the URLs in this case. Search engine crawlers, identified by their user agent name or whatever, are treated as users and shall never see (internal) links to URLs with tracking parameters in the query string.

This 301 redirect passes all the link juice, that is PageRank & Co. as well as anchor text, to the canonical URL. Search engines can no longer index page versions owned by affiliates. (This procedure doesn’t prevent you from 302 hijacking where your content gets indexed under the affiliate’s URL.)

Putting safe affiliate links (online marketers)

Honestly, there’s no such thing as a safe affiliate link, at least not safe with regard to picky search engines. Masking complex URLs with redirect services like tinyurl.com or so doesn’t help, because the crawlers get the real URL from the redirect header and will leave a note in the record of the original link on the page carrying the affiliate link. Anyways, the tiny URL will fool most visitors, and if you own the redirect service it makes managing affiliate links easier.

Of course you can cloak the hell out of your thin affiliate pages by showing the engines links to authority pages whilst humans get the ads, but then better forget the Google traffic (I know, I know … cloaking still works if you can handle it properly, but not everybody can handle the risks so better leave that to the experts).

There’s only one official approach to make a page plastered with affiliate links safe with search engines: replace it with a content rich page, of course Google wants unique and compelling content and checks its uniqueness, then sensibly work in the commercial links. Best link within the content to the merchants, apply rel-nofollow to all affiliate links, and avoid banner farms in the sidebars and above the fold.

Update: I’ve sanitized the title, “Google says you must screw your affiliates in order to get indexed” was not one of my best title baits.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

23 comments Sebastian | Search Quality, Duplicate Content, Redirects, Internet Marketing, Risky Linkage, Paid Links, SEO, E-Commerce, Cloaking, Google

NOPREVIEW - The missing X-Robots-Tag

Posted on 7 August, 2007

Google provides previews of non-HTML resources listed on their SERPs:
These “view as text” and “view as HTML” links are pretty useful when you for example want to scan a PDF document before you clutter your machine’s RAM with 30 megs of useless digital rights management (aka Adobe Reader). You can view contents even when the corresponding application is not installed, Google’s transformed previews should not stuff your maiden box with unwanted malware, etcetera. However, under some circumstances it would make sound sense to have a NOPREVIEW X-Robots-Tag, but unfortunately Google forgot to introduce it yet.

Google is rightfully proud of their capability to transform various file formats to readable HTML or plain text: Adobe Portable Document Format (pdf), Adobe PostScript (ps), Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku), Lotus WordPro (lwp), MacWrite (mw), Microsoft Excel (xls), Microsoft PowerPoint (ppt), Microsoft Word (doc), Microsoft Works (wks, wps, wdb), Microsoft Write (wri), Rich Text Format (rtf), Shockwave Flash (swf), of course Text (ans, txt) plus a couple of “unrecognized” file types like XML. New formats are added from time to time.

According to Adam Lasnik currently there is no way for Webmasters to tell Google not to include the “View as HTML” option. You can try to fool Google’s converters by messing up the non-HTML resource in a way that a sane parser can’t interpret it. Actually, when you search a few minutes you’ll find e.g. PDF files without the preview links on Google’s SERPs. I wouldn’t consider this attempt a bullet-proof nor future-proof tactic though, because Google is pretty intent on improving their conversion/interpretation process.

I like the previews not only because sometimes they allow me to read documents behind a login screen. That’s a loophole Google should close as soon as possible. When for example PDF documents or Excel sheets are crawlable but not viewable for searchers (at least not with the second click) that’s plain annoying both for the site as well as for the search engine user.

With HTML documents the Webmaster can apply a NOARCHIVE crawler directive to prevent non paying visitors from lurking via Google’s cached page copies. Thanks to the newish REP header tags one can do that with non-HTML resources too, but neither NOARCHIVE nor NOSNIPPET etch away the “view-as HTML” link.

<speculation>Is the lack of a NOPREVIEW crawler directive just an oversight, or is it stuck in the pipeline because Google is working on supplemental components and concepts? Google’s yet inconsistent handling of subscription content comes to mind as an ideal playground for such a robots directive in combination with a policy change.</speculation>

Anyways, there is a need for a NOPREVIEW robots tag, so why not implement it now? Thanks in advance.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | X-Robots-Tag, Search Quality, Robots Meta Tags, Crawler Directives, Google

SEOs home alone - Google’s nightmare

Posted on 1 August, 2007

Being a single parent of three monsters at the moment brings me newish insights. I now deeply understand the pain of father Google dealing with us, and what doing the chores all day long means to Matt’s gang in building 43, Dublin, and whereever. What a nightmare of a household.

If you don’t suffer from an offspring plague you won’t believe what sneaky and highly intelligent monsters having too much time on their tiny greedy hands will do to gain control over their environment. Outsmarting daddy is not a hobby, it’s their mission, and everything in perfect order is attackable. Each of them tries to get as much attention as possible, and if nothing helps, negative attention is fine too. There’s no such thing as bad traffic, err … mindfulness.

Every rule is breakable, and there’s no way to argue seriously with a cute 5 yo gal burying her 3 yo brother in the mud whilst honestly telling me that she has nothing to do with the dirty laundry because she never would touch anything hanging on the clothesline. Then my little son speaks out telling me that’s all her fault, so she promises to do it never, never, never again in her whole life and even afterwards. In such a situation I’ve not that much options: I archive my son’s paid links report, accept her reconsideration request but throttle her rankings for a while, recrawl and remove the unpurified stuff from the … Oups … I clear the scene with a pat on her muddy fingers, forgive all blackhatted kids involved in the scandal and just do the laundry again, writing a note to myself to improve the laundry algo in a way that muddy monsters can’t touch laundered bed sheets again.

Anything not on the explicit don’ts list goes, so while I’m still stuffing the washer with muddy bed sheets I hear a weird row in the living room. Running upstairs I spot my 10 yo son and his friend playing soccer with a ball I had to fish out of a heap of broken crockery and uprooted indoor plants to confiscate it just two hours ago. Yelling that’s against our well known rules and why the heck is that […] ball in the game again I get stopped immediately by the boys. First, they just played soccer and the recent catastrophe was the result of a strictly forbidden basketball joust. I’ve to admit that I said they must not play basketball in the house. Second, it’s my fault when I don’t hide the key to the closet where I locked the confiscated ball away. Ok, enough is enough. I banned my son’s friend and grounded himself for a week, took away the ball, and ran to the backyard to rescue two bitterly crying muddy dwarfs from the shed’s roof. Later on, while two little monsters play games in the bath tub which I really don’t want to watch too closely currently, I read a thread titled “Daddy is soooo unfair” in the house arrest forum where my son and his buddy tell the world that they didn’t do anything wrong, just sheer whitehatted stuff, but I stole their toy and banned them from the playground. Sigh.

I’m exhausted. I’m supposed to deliver a script to merge a few feeds giving fresh contents, a crawlability review, and whatnot tonight, but I just wonder what else will happen when I leave the monsters alone in their beds after supper and story hour, provided I get them into their beds without a medium-size flame war. Now I understand why another daddy supplemented the family with a mom.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Search Quality, Fun

Unavailable_After is totally and utterly useless

Posted on 30 July, 2007

I’ve a lot of respect for Dan Crow, but I’m struggling with my understanding, or possible support, of the unavailable_after tag. I don’t want to put my reputation for bashing such initiatives from search engines at risk, so sit back and grab your popcorn, here comes the roasting:

As a Webmaster, I did not find a single scenario where I could or even would use it. That’s because I’m a greedy traffic whore. A bazillion other Webmasters are greedy too. So how the heck is Google going to sell the newish tag to the greedy masses?

Ok, from a search engine’s perspective unavailable_after makes sound sense. Outdated pages bind resources, annoy searchers, and in a row of useless crap the next bad thing after an outdated page is intentional Webspam.

So convincing the great unwashed to put that thingy on their pages inviting friends and family to granny’s birthday party on 25-Aug-2007 15:00:00 EST would improve search quality. Not that family blog owners care about new meta tags, RFC 850-ish date formats, or search engine algos rarely understanding that the announced party is history on Aug/26/2007. Besides there may be painful aftermaths worth submitting a desperate call for aspirins the day after in the comments, what would be news of the day after expiration. Kinda dilemma, isn’t it?

Seriously, unless CMS vendors support the new tag, tiny sites and clique blogs aren’t Google’s target audience. This initiative addresses large sites which are responsible for a huge amount of outdated contents in Google’s search index.

So what is the large site Webmaster’s advantage of using the unavailable_after tag? A loss of search engine traffic. A loss of link juice gained by the expired page. And so on. Losses of any kind are not that helpful when it comes to an overdue raise nor in salary negotiations. Hence the Webmaster asks for the sack when s/he implements Google’s traffic terminator.

Who cares about Google’s search quality problems when it leads to traffic losses? Nobody. Caring Webmasters do the right thing anyway. And they don’t need no more useless meta tags like unavailable_after. “We don’t need no stinking metas” from “Another Brick in the Wall Part Web 2.0″ expresses my thoughts perfectly.

So what separates the caring Webmaster from the ‘ruthless traffic junky’ who Google wants to implement the unavailable_after tag? The traffic junkie lets his stuff expire without telling Google about it’s state, is happy that frustrated searchers click the URL from the SERPs even years after the event, and enjoys the earnings from tons of ads placed above the content minutes after the party was over. Dear Google, you can’t convince this guy.

[It seems this is a post about repetitive “so whats”. And I came to the point before the 4th paragraph … wow, that’s new … and I’ve put a message in the title which is not even meant as link bait. Keep on reading.]

So what does the caring Webmaster do without the newish unavailable_after tag? Business as usual. Examples:

Say I run a news site where the free contents go to the subscription area after a while. I’d closely watch which search terms generate traffic, write a search engine optimized summary containing those keywords, put that on the sales pitch, and move the original article to the archives accessible to subscribers only. It’s not my fault that the engines think they point to the original article after the move. When they recrawl and reindex the page my traffic will increase because my summary fits their needs more perfectly.

Say I run an auction site. Unfortunately particular auctions expire, but I’m sure that the offered products will return to my site. Hence I don’t close the page, but I search my database for similar offerings and promote them under a H3 heading like “[product] (stuffed keywords) is hot” /H3 P buy [product] here: /P followed by a list of identical products for sale or similar auctions.

Say I run a poll expiring in two weeks. With Google’s newish near real time indexing that’s enough time to collect keywords from my stats, so the textual summary under the poll’s results will attract the engines as well as visitors when the poll is closed. Also, many visitors will follow the links to related respectively new polls.

From Google’s POV there’s nothing wrong with my examples, because the visitor gets what s/he was searching for, and I didn’t cheat. Now tell me, why should I give up these valuable sources of nicely targeted search engine traffic just to make Google happy? Rather I’d make my employer happy. Dear Google, you didn’t convince me.

Update: Tanner Christensen posted a remarkable comment at Sphinn:

I’m sure there is some really great potential for the tag. It’s just none of us have a need for it right now.

Take, for example, when you buy your car without a cup holder. You didn’t think you would use it. But then, one day, you find yourself driving home with three cups of fruit punch and no cup holders. Doh!

I say we wait it out for a while before we really jump on any conclusions about the tag.

John Andrews was the first to report an evil use of unavailable_after.

Also, Dan Crow from Google announced a pretty neat thing in the same post: With the X-Robots-Tag you can now apply crawler directives valid in robots meta tags to non-HTML documents like PDF files or images.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Search Quality, Robots Meta Tags, Crawler Directives, SEO, Google

Buying cheap viagra algorithmically

Posted on 18 July, 2007

Since Google can’t manage to clean up [Buy cheap viagra] let’s do it ourselves. Go seek a somewhat trusted search blog mentioning “buy cheap viagra” somewhere in the archives and link to the post with a slightly diversified anchor text like “how to buy cheap viagra online“. Matt deserves a #1 spot by the way so spread many links …

Then when Matt is annoyed enough and Google has kicked out the unrelated stuff from this search hopefully my viagra spam will rank as deserved again

Update a few hours later: Matt ranks #1 for [buy cheap viagra algorithmically]:

His ranking for [buy cheap viagra] fell about 10 positions to #17 but for [buy cheap viagra online] he’s still on the first SERP, now at position #10 (#3 yesterday). Interesting. It seems that Google’s newish turbo-blog-indexing influences the rankings of pages linked from blog posts relatively short dated but not exactly long lasting.

Related posts:
Negative SEO At Work: Buying Cheap Viagra From Google’s Very Own Matt Cutts - Unless You Prefer Reddit? Or Topix? by Fantomaster
Trust + keywords + link = Good ranking (or: How Matt Cutts got ranked for “Buy Cheap Viagra”) by Wiep

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

8 comments Sebastian | Webspam, Search Quality, Fun, Crap, Anchor Text, Google

Why eBay and Wikipedia rule Google’s SERPs

Posted on 5 July, 2007

It’s hard to find an obscure search query like [artificial link] which doesn’t deliver eBay spam or a Wikipedia stub within the first few results at Google. Although both Wikipedia and eBay are large sites, the Web is huge, so two that different sites shouldn’t dominate the SERPs for that many topics. Hence it’s safe to say that many nicely ranked search results at Googledia, pulled from eBaydia, are plain artificial positioned non-results.

Curious why my beloved search engine fails so badly, I borrowed a Google-savvy spy from GHN and sent him to Mountain View to uncover the eBaydia ranking secrets. He came back with lots of pay-dirt scraped from DVDs in the safe of building 43. Before I sold Google’s ranking algo to Ask (the price Yahoo! and MSN offered was laughable), I figured out why Googledia prefers eBaydia from comments in the source code. Here is the unbelievable story of a miserable failure:

When Yahoo! launched Mindset, Larry Page and Sergey Brin threw chairs out of anger because Google wasn’t able to accomplish such a simple task. The engineers, eager to fulfill their founder’s wishes asap, tried to integrate mindset-functionality without changing Google’s fascinating simple search interface (that means without a shopping/research slider). Personalized search still lived in the labs, but provided a somewhat suitable API (mega beta): scanSearchersBrainForContext([search query]). Not knowing that this function of personalized search polls a nano-bugging-device (pre alpha) which Google had not yet released nor implemented into any searcher’s brain at this time, they made use of that piece of experimental code to evaluate the search query’s context. Since the method always returned “false”, though they had to deliver results quickly, they made up some return values to test their algo tweaks:

/* debug - praying S&L don't throw more chairs */ if (scanSearchersBrainForContext($searchQuery) === false) then { $contextShopping = “%ebay%”; $contextResearch = “%wikipedia%”; $context = both($contextShopping, $contextResearch); } else {[pretty complex algo])

This worked fine and found its way into the ranking algo under time pressure. The result is that with each and every search query where a page from eBay and/or Wikipedia is in the raw result set, those get a ranking boost. Sergey was happy because eBay is generally listed on page #1, and Larry likes the Wikipedia results on the first SERP. Tell me why the heck should the engineers comment out these made up return values? No engineer on this planet likes flying chairs, especially not in his office.

PS: Some SEOs push Wikipedia stubs too.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Internet Marketing, Search Quality, Fun, Crap, Google

« Previous Page 1 | 2 | 3 | 4 Next Page »

Sebastian’s Pamphlets

Archived posts from the 'Search Quality' Category

Microsoft funding bankrupt Live Search experiment with porn spam

Act out your sophisticated affiliate link paranoia

Cover your ass with a linking policy

Block crawlers from your propaganda scripts

Detect search engine crawlers

Don’t deliver your advertising to search engine crawlers

Is hiding ads from crawlers “safe with Google” or not?

Outputting ads with JavaScript, preferably in iFrames

Always redirect to affiliate URLs

The behavior of an adserver URL masking an affiliate link

Recap

Internet marketing is one big popularity contest, and that’s not a good thing

A pragmatic defence against Google’s anti paid links campaign

The problem

The solution

Why serve your visitors search engine crawler directives?

How to cloak rel-nofollow for search engine crawlers

A Google-friendly way to handle paid links, affiliate links, and cross linking

Google says you must manage your affiliate links in order to get indexed

The problem

Managing incoming affiliate links (merchants)

Putting safe affiliate links (online marketers)

NOPREVIEW - The missing X-Robots-Tag

SEOs home alone - Google’s nightmare

Unavailable_After is totally and utterly useless

Buying cheap viagra algorithmically

Why eBay and Wikipedia rule Google’s SERPs

Categories

Monthly Archives

Links

RSS Feeds