Nofollow

Archived posts from the 'Nofollow' Category

How to disagree on Twitter, machine-readable

Posted on 7 January, 2010

URI link condom for social media With standard hyperlinks you can add a rel="crap nofollow" attribute to your A elements. But how do you tell search engine crawlers and other Web robots that you disagree with a link’s content, when you post the URI at Twitter or elsewhere?

You cannot rely on the HTML presentation layer of social media sites. Despite the fact that most of them add a condom to all UGC links, crawlers do follow those links. Nowadays crawlers grab tweets and their embedded links long before they bother to fetch the HTML pages. They fatten their indexers with contents scraped from feeds. That means indexers don’t (really) take the implicit disagreement into account.

As long as you operate your own URI shortener, there’s a solution.

Condomize URIs, not A elements

Here’s how to nofollow a plain link drop, where you’ve no control over link attributes like rel-nofollow:

Prerequisite: understanding the anatomy of a URI shortener.
Add an attribute like shortUri.suriNofollowed, boolean, default=false, to your shortened URIs database table. In the Web form where you create and edit short URIs, add a corresponding checkbox and update your affected scripts.
Make sure your search engine crawler detection is up-to-date.
Change the piece of code that redirects to the original URI:if ($isCrawler && $suriNofollowed) { header("HTTP/1.1 403 Forbidden redirect target", TRUE, 403); print "<html><head><title>This link is condomized!</title></head><body><p>Search engines are not allowed to follow this link: <code>$suriUri</code></p></body></html>"; } else { header("HTTP/1.1 301 Here you go", TRUE, 301); header("Location: $suriUri"); } exit;

Here’s an example: This shortened URI takes you to a Bing SEO tip. Search engine crawlers get bagged in a 403 link condom.

Since you can’t test it yourself (user agent spoofing doesn’t work), here’s a header reported by Googlebot (requesting the condomized URI above) today:

HTTP/1.1 403 Forbidden Date: Thu, 07 Jan 2010 10:19:16 GMT ... Connection: close Transfer-Encoding: chunked Content-Type: text/html
The error page just says:Title + H1: Link is nofollow'ed P: Sorry, this shortened URI must not get followed by search engines.

If you can’t roll your own, feel free to make use of my URI Condomizer. Have fun condomizing crappy links on Twitter.

If you check “Nofollow” your URI gets condomized. That means, search engines can’t request it from the shortened URI, but users and other Web robots get redirected.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | URI shortening, Social Web, Twitter, Nofollow

How to handle a machine-readable pandemic that search engines cannot control

Posted on 19 June, 2009

R.I.P. rel-nofollow When you’re familiar with my various rants on the ever morphing ~~rel-nofollow microformat~~ infectious link disease, don’t read further. This post is not polemic, ironic, insulting, or otherwise meant to entertain you. I’m just raving about a way to delay the downfall of the InterWeb.

Lets recap: The World Wide Web is based on hyperlinks. Hyperlinks are supposed to lead humans to interesting stuff they want to consume. This simple and therefore brilliant concept worked great for years. The Internet grew up, bubbled a bit, but eventually it gained world domination. Internet traffic was counted, sold, bartered, purchased, and even exchanged for free in units called “hits”. (A “hit” means one human surfer landing on a sales pitch. That is a popup hell designed in a way that somebody involved just has to make a sale).

Then in the past century two smart guys discovered that links scraped from Web pages can be misused to provide humans with very accurate search results. They even created a new currency on the Web, and quickly assigned their price tags to Web pages. Naturally, folks began to trade green pixels instead of traffic. After a short while the Internet voluntarily transferred it’s world domination to the company founded by those two smart guys from Stanford.

Of course the huge amount of green pixel trades made the search results based on link popularity somewhat useless, because the webmasters gathering the most incoming links got the top 10 positions on the search result pages (SERPs). Search engines claimed that a few webmasters cheated on their way to the first SERPs, although lawyers say there’s no evidence of any illegal activities related to search engine optimization (SEO).

However, after suffering from heavy attacks from a whiny blogger, the Web’s dominating search engine got somewhat upset and required that all webmasters have to assign a machine-readable tag (link condom) to links sneakily inserted into their Web pages by other webmasters. “Sneakily inserted links” meant references to authors as well as links embedded in content supplied by users. All blogging platforms, CMS vendors and alike implemented the link condom, eliminating presumably 5.00% of the Web’s linkage at this time.

A couple of months later the world dominating search engine demanded that webmasters have to condomize their banner ads, intercompany linkage and other commercial links, as well as all hyperlinked references that do not count as pure academic citation (aka editorial links). The whole InterWeb complied, since this company controlled nearly all the free traffic available from Web search, as well as the Web’s purchasable traffic streams.

Roughly 3.00% of the Web’s links were condomized, as the search giant spotted that their users (searchers) missed out on lots and lots of valuable contents covered by link condoms. Ooops. Kinda dilemma. Taking back the link condom requirements was no option, because this would have flooded the search index with billions of unwanted links empowering commercial content to rank above boring academic stuff.

So the handling of link condoms in the search engine’s crawling engine as well as in it’s ranking algorithm was changed silently. Without telling anybody outside their campus, some condomized links gained power, whilst others were kept impotent. In fact they’ve developed a method to judge each and every link on the whole Web without a little help from their ~~friends~~ link condoms. In other words, the link condom became obsolete.

Of course that’s what they should have done in the first place, without asking the world’s webmasters for gazillions of free-of-charge man years producing shitloads of useless code bloat. Unfortunately, they didn’t have the balls to stand up and admit “sorry folks, we’ve failed miserably, link condoms are history”. Therefore the Web community still has to bother with an obsolete microformat. And if they -the link comdoms- are not dead, then they live today. In your markup. Hurting your rankings.

If you, dear reader, are a Googler, then please don’t feel too annoyed. You may have thought that you didn’t do evil, but the above said reflects what webmasters outside the ‘Plex got from your actions. Don’t ignore it, please think about it from our point of view. Thanks.

Still here and attentive? Great. Now lets talk about scenarios in WebDev where you still can’t avoid rel-nofollow. If there are any — We’ll see.

PageRank™ sculpting

Dude, PageRank™ sculpting with rel-nofollow doesn’t work for the average webmaster. It might even fail when applied as high sophisticated SEO tactic. So don’t even think about it. Simply remove the rel=nofollow from links to your TOS, imprint, and contact page. Cloak away your links to signup pages, login pages, shopping carts and stuff like that.

Link monkey business

I leave this paragraph empty, because when you know what you do, you don’t need advice.

Affiliate links

There’s no point in serving A elements to Googlebot at all. If you haven’t cloaked your aff links yet, go see a SEO doctor.

Advanced SEO purposes

See above.

So what’s left? User generated content. Lets concentrate our extremely superfluous condomizing efforts on the one and only occasion that might allow to apply rel-nofollow to a hyperlink on request of a major search engine, if there’s any good reason to paint shit brown at all.

Blogging

If you link out in a blog post, then you vouch for the link’s destination. In case you disagree with the link destination’s content, just put the link as

<strong class="blue_underlined" title="http://myworstenemy.org/" onclick="window.location=this.title;">My Worst Enemy</strong>

or so. The surfer can click the link and lands at the estimated URI, but search engines don’t pass reputation. Also, they don’t evaporate link juice, because they don’t interpret the markup as hyperlink.

Blog comments

My rule of thumb is: Moderate, DoFollow quality, DoDelete crap. Install a conditional do-follow plug-in, set everything on moderation, use captchas or something similar, then let the comment’s link juice flow. You can maintain a white list that allows instant appearance of comments from your buddies.

Forums, guestbooks and unmoderated stuff like that

Separate all Web site areas that handle user generated content. Serve “index,nofollow” meta tags or x-robots-headers for all those pages, and link them from a site map or so. If you gather index-worthy content from users, then feed crawlers the content in a parallel -crawlable- structure, without submit buttons, perhaps with links from trusted users, and redirect human visitors to the interactive pages. Vice versa redirect crawlers requesting live pages to the spider fodder. All those redirects go with a 301 HTTP response code.

If you lack the technical skills to accomplish that, then edit your /robots.txt file as follows:

User-agent: Googlebot # Dear Googlebot, drop me a line when you can handle forum pages # w/o rel-nofollow crap. Then I'll allow crawling. # Treat that as conditional disallow: Disallow: /forum

As soon as Google can handle your user generated content naturally, they might send you a message in their Webmaster console.

Anything else

Judge yourself. Most probably you’ll find a way to avoid rel-nofollow.

Conclusion

Absolutely nobody needs the rel-nofollow microformat. Not even search engines for the sake of their index. Hence webmasters as well as search engines can stop wasting resources. Farewell rel="nofollow", rest in peace. We won’t miss you.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

22 comments Sebastian | Search Quality, Web development, X-Robots-Tag, Blogging, Risky Linkage, Paid Links, Microformats, Google, SEO, Cloaking, Nofollow

Vaporize yourself before Google burns your linking power

Posted on 16 June, 2009

PIC-1: Google PageRank(tm) 2007 I couldn’t care less about PageRank™ sculpting, because a well thought out link architecture does the job with all search engines, not just Google. That’s where Google is right on the money.

They own PageRank™, hence they can burn, evaporate, nillify, and even divide by zero or multiply by -1 as much PageRank™ as they like; of course as long as they rank my stuff nicely above my competitors.

Picture 1 shows Google’s PageRank™ factory as of 2007 or so. Actually, it’s a pretty simplified model, but since they’ve changed the PageRank™ algo anyway, you don’t need to bother with all the geeky details.

As a side note: you might ask why I don’t link to Matt Cutts and Danny Sullivan discussing the whole mess on their blogs? Well, probably Matt can’t afford my advertising rates, and the whole SEO industry has linked to Danny anyway. If you’re nosy, check out my source code to learn more about state of the art linkage very compliant to Google’s newest guidelines for advanced SEOs (summary: “Don’t trust underlined blue text on Web pages any longer!”).

PIC-2: Google PageRank(tm) 2009 What really matters is picture 2, revealing Google’s new PageRank™ facilities, silently launched in 2008. Again, geeky details are of minor interest. If you really want to know everything, then search for [operation bendover] at !Yahoo (it’s still top secret, and therefore not searchable at Google).

Unfortunately, advanced SEO folks (whatever that means, I use this term just because it seems to be an essential property assigned to the participants of the current PageRank™ ~~uprising~~ discussion) always try to confuse you with overcomplicated graphics and formulas when it comes to PageRank™. Instead, I ask you to focus on the (important) hard core stuff. So go grab a magnifier, and work out the differences:

PageRank™ 2009 in comparision to PageRank™ 2007 comes with a pipeline supplying unlimited fuel. Also, it seems they’ve implemented the green new deal, switching from gas to natural gas. That means they can vaporize way more link juice than ever before.
PageRank™ 2009 produces more steam, and the clouds look slightly different. Whilst PageRank™ 2007 ignored nofollow crap as well as links put with client sided scripting, PageRank™ 2009 evaporates not only juice covered with link condoms, but also tons of other permutations of the standard A element.
To compensate the huge overall loss of PageRank™ caused by those changes, Google has decided to pass link juice from condomized links to their target URI hidden to Googlebot with JavaScript. Of course Google formerly has recommended the use of JavaScript-links to prevent the webmasters from penalties for so-called “questionable” outgoing links. Just as they’ve not only invented rel-nofollow, but heavily recommended the use of this microformat with all links disliked by Google, and now they take that back as if a gazillion links on the Web could magically change just because Google tweeks their algos. Doh! I really hope that the WebSpam-team checks the age of such links before they penalize everything implemented according to their guidelines before mid-2009 or the InterWeb’s downfall, whatever comes last.

I guess in the meantime you’ve figured out that I’m somewhat pissed. Not that the secretly changed flow of PageRank™ a year ago in 2008 had any impact on my rankings, or SERP traffic. I’ve always designed my stuff with PageRank™ flow in mind, but without any misuses of rel=”nofollow”, so I’m still fine with Google.

What I can’t stand is when a search engine tries to tell me how I’ve to link (out). Google engineers are really smart folks, they’re perfectly able to develop a PageRank™ algo that can decide how much Google-juice a particular link should pass. So dear Googlers, please -WRT to the implementation of hyperlinks- leave us webmasters alone, dump the rel-nofollow crap and rank our stuff in the best interest of your searchers. No longer bother us with linking guidelines that change yearly. It’s not our job nor responsibility to act as your ~~cannon fodder~~ slavish code monkeys when you spot a loophole in your ranking- or spam-detection-algos.

Of course the above said is based on common sense, so Google won’t listen (remember: I’m really upset, hence polemic statements are absolutely appropriate). To prevent webmasters from irrational actions by misleaded search engines, I hereby introduce the

Webmaster guidelines for search engine friendly links

What follows is pseudo-code, implement it with your preferred server sided scripting language.

if (getAttribute($link, 'rel') matches '*nofollow*' && $userAgent matches '*Googlebot*') { print '<strong rev="' + getAttribute(link, 'href') + '"' + ' style="color:blue; text-decoration:underlined;"' + ' onmousedown="window.location=document.getElementById(this.id).rev; "' + '>' + getAnchorText($link) + '</strong>'; } else { print $link; }

Probably it’s a good idea to snip both the onmousedown trigger code as well as the rev attribute, when the script gets executed by Googlebot. Just because today Google states that they’re going to pass link juice to URIs grabbed from the onclick trigger, that doesn’t mean they’ll never look at the onmousedown event or misused (X)HTML attributes.

This way you can deliver Googlebot exactly the same stuff that the ~~punter~~ surfer gets. You’re perfectly compliant to Google’s cloaking restrictions. There’s no need to bother with complicated stuff like iFrames or even disabled blog comments, forums or guestbooks.

Just feed the crawlers with all the crap the search engines require, then concentrate all your efforts on your UI for human vistors. Web robots (bots, crawlers, spiders, …) don’t supply your signup-forms w/ credit card details. Humans do. If you find the time to upsell them while search engines keep you busy with thoughtless change requests all day long.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

18 comments Sebastian | Webspam, Paid Links, Search Quality, Web development, Internet Marketing, Crawler Directives, Crap, Google, Microformats, SEO, Cloaking, Anchor Text, Nofollow

Nofollow still means don’t follow, and how to instruct Google to crawl nofollow’ed links nevertheless

Posted on 23 February, 2008

painting a nofollow'ed link dofollow What was meant as a quick test of rel-nofollow once again (inspired by Michelle’s post stating that nofollow’ed comment author links result in rankings), turned out to some interesting observations:

Google uses sneaky JavaScript links (that mask nofollow’ed static links) for discovery crawling, and indexes the link destinations despite there’s no hard coded link on any page on the whole Web.
Google doesn’t crawl URIs found in nofollow’ed links only.
Google most probably doesn’t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.
Google most probably doesn’t pass anchor text of JavaScript links to the link destination.
Google doesn’t pass anchor text of (hard coded) nofollow’ed links to the link destination.

As for my inspiration, I guess not all links in Michelle’s test were truly nofollow’ed. However, she’s spot on stating that condomized author links aren’t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don’t pay too much attention on REL attributes when you spread your links.

As for my quick test explained below, please consider it an inspiration too. It’s not a full blown SEO test, because I’ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn’t influenced by external noise, for example scraped links and such stuff.

On 2008-02-22 06:20:00 I’ve put a new nofollow’ed link onto my sidebar: Zilchish Crap <a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-a" rel="nofollow"><span id="repstuff-something-b">Zilchish Crap</span></a> <script type="text/javascript"> handle=document.getElementById(‘repstuff-something-b’); handle.firstChild.data=‘Nillified, Nil’; handle=document.getElementById(‘repstuff-something-a’); handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js1’; handle.rel=‘dofollow’; </script>
(The JavaScript code changes the link’s HREF, REL and anchor text.)

The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow’ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.

Google crawls URIs extracted from somewhat sneaky JavaScript code

20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element’s HREF: 66.249.72.5 2008-02-22 06:47:07 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js1

Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page: handle=document.getElementById(‘a1’); handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js2’; handle.rel=‘dofollow’;
From the log: 66.249.72.5 2008-02-22 09:37:11 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js2

So far Google ignored the hidden JavaScript link to /repstuff/something.php?nil=js3 on the test page. Its code doesn’t change a static link, so that makes sense in the context of repeated statements like “Google ignores JavaScript links / treats them like nofollow’ed links” by Google reps.

Of course the JS code above is easy to analyze, but don’t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!

Google indexes pages that have only JavaScript links pointing to them

The next day I’ve checked the search index, and the results are interesting:

rel-nofollow-test search results

The first search result is the content of the URI with the query string parameter ?nil=js1, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI /repstuff/something.php without query string. There’s not a single real link to this URI elsewhere.

The second search result is a post URI where Google recognized the hard coded anchor text “zilchish crap”, but not the JS code that overwrites it with “Nillified, Nil”. With the SERP-URI parameter “&filter=0″ Google shows more posts that are findable with the search term [zilchish]. (Hey Matt and Brian, here’s room for improvement!)

Google doesn’t pass anchor text of nofollow’ed links to the link destination

A search for [zilchish site:sebastians-pamphlets.com] doesn’t show the testpage that doesn’t carry this term. In other words, so far the anchor text “zilchish crap” of the nofollow’ed sidebar link didn’t impact the test page’s rankings yet.

Google doesn’t treat anchor text of JavaScript links as textual content

A search for [nillified site:sebastians-pamphlets.com] doesn’t show any URIs that have “nil, nillified” as client sided anchor text on the sidebar, just the test page:

rel-nofollow-test search results

Results, conclusions, speculation

This test wasn’t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately “nil” and “nillified” appear both in the JS anchor text as well as on the page, so that’s for another post. However, it seems the JS anchor text isn’t indexed for the pages carrying the JS code, at least they don’t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination’s relevancy for “nil” or “nillified” as well.

Maybe Google’s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows.

However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I’m quite confident that condomized links still don’t pass reputation, nor topically relevance. My test page is unfindable for the nofollow’ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.

How to safely strip a link condom

And what’s the actual “news”? Well, say you’ve links that you must condomize because they’re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow’ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don’t count, so that’s perfectly in line with Google’s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don’t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn’t that neat?

Link condoms with juicy taste faking good karma

Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they look plain ugly otherwise.

Here is how you can do this for a complete Web page. This link is nofollow’ed. The JavaScript code below changed its REL value to “dofollow”. When you put this code at the bottom of your pages, it will un-condomize all your nofollow’ed links. <script type="text/javascript"> if (document.getElementsByTagName) { var aElements = document.getElementsByTagName("a"); for (var i=0; i<aElements.length; i++) { var relvalue = aElements[i].rel.toUpperCase(); if (relvalue.match("NOFOLLOW") != "null") { aElements[i].rel = "dofollow"; } } } </script>

(You’ll find still condomized links on this page. That’s because the JavaScript routine above changes only links placed above it.)

When you add JavaScript routines like that to your pages, you’ll increase their page loading time. IOW you slow them down. Also, you should add a note to your linking policy to avoid confused advertisers who chase toolbar PageRank.

Updates: Obviously Google distrusts me, how come? Four days after the link discovery the search quality archangel requested the nofollow’ed URI -without query string- possibly to check whether I serve different stuff to bots and people. As if I’d cloak, laughable. (Or an assclown linked the URI without condom.)
Day five: Google’s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren’t interested in client-sided links at all?

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

19 comments Sebastian | Paid Links, Testing, Anchor Text, Cloaking, Google, SEO, Nofollow

My plea to Google - Please sanitize your REP revamps

Posted on 3 January, 2008

Standardization of REP tags as robots.txt directives

Google is confules on REP standards and robots.txt This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It’s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported by all major search engines. Furthermore it considers work in progress leaked out from Google.

In the following I’ll try to define a few robots.txt directives that Webmasters really need.

Show Table of Contents

Currently Google experiments with new robots.txt directives, that is REP tags like “noindex” adapted for robots.txt. That’s a welcomed and brilliant move.

Unfortunately, they got it totally wrong, again. (Skip the longish explanation of the rel-nofollow fiasco and my rant on Google’s current robots.txt experiments.)

Google’s last try to enhance the REP by adapting a REP tag’s value in another level was a miserable failure. Not because crawler directives on link-level are a bad thing, the opposite is true, but because the implementation of rel-nofollow confused the hell out of Webmasters, and still does.

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Don’t get me wrong, an instrument to steer search engine crawling and indexing on link level is a great utensil in a Webmaster’s toolbox. Rel-nofollow just lacks granularity, and it was sneakily introduced for the wrong purposes.

Recap: When Google launched rel-nofollow in 2005, they promoted it as a tool to fight comment spam.

From now on, when Google sees the attribute (rel=”nofollow”) on hyperlinks, those links won’t get any credit when we rank websites in our search results. This isn’t a negative vote for the site where the comment was posted; it’s just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists.

Technically spoken, this translates to “search engine crawlers shall/can use rel-nofollow links for discovery crawling, but indexers and ranking algos processing links must not credit link destinations with PageRank, anchor text, nor other link juice originating from rel-nofollow links”. Rel=”nofollow” meant rel=”pass-no-reputation”.

All blog platforms implemented the beast, and it seemed that Google got rid of a major problem (gazillions of irrelevant spam links manipulating their rankings). Not so the bloggers, because the spammers didn’t bother to check whether a blog dofollows inserted links or not. Despite all the condomized links the amount of blog comment spam increased dramatically, since the spammers were forced to attack even more blogs in order to earn the same amount of uncondomized links from blogs that didn’t update to a software version that supported rel-nofollow.

Experiment failed, move on to better solutions like Akismet, captchas or ajax’ed comment forms? Nope, it’s not that easy. Google had a hidden agenda. Fighting blog comment spam was just a snake oil sales pitch, an opportunity to establish rel-nofollow by jumping on a popular band wagon. In 2005 Google had mastered the guestbook spam problem already. Devaluing comment links in well structured pages like blog posts is as easy as doing the same with guestbook links, or identifying affiliate links. In other words, when Google launched rel-nofollow, blog comment spam was definitely not a major search quality issue any more.

Identifying paid links on the other hand is not that easy, because they often appear as editorial links within the content. And that was a major problem for Google, a problem that they weren’t able to solve algorithmically without cooperation of all webmasters, site owners, and publishers. Google actually invented rel-nofollow to get a grip on paid links. Recently they announced that Googlebot no longer follows condomized links (pre-Bigdaddy Google followed condomized links and indexed contents discovered from rel-nofollow links), and their cold war on paid links became hot.

Of course the sneaky morphing of rel-nofollow from “pass no reputation” to a full blown “nofollow” is just a secondary theater of war, but without this side issue (with regard to REP standardization) Google would have lost, hence it was decisive for the outcome of their war on paid links.

To stay fair, Danny Sullivan said twice that rel-nofollow is Dave Winer’s fault, and Google as the victim is not to blame.

Rel-nofollow is settled now. However, I don’t want to see Google using their enormous power to manipulate the REP for selfish goals again. I wrote this rel-nofollow recap because probably, or possibly, Google is just doing it once more:

Google’s “Noindex: in robots.txt” experiment

Google supports a Noindex: directive in robots.txt. It seems Google’s Noindex: blocks crawling like Disallow:, but additionally prevents URLs blocked with Noindex: both from accumulating PageRank as well as from indexing based on 3rd party signals like inbound links.

This functionality would be nice to have, but accomplishing it with “Noindex” is badly wrong. The REP’s “Noindex” value without an explicit “Nofollow” means “crawl it, follow its links, but don’t list it on SERPs”. With pagel-level directives (robots meta tags and X-Robots-Tags) Google handles “Noindex” exactly as defined, that means with an implicit “Follow”. Not so in robots.txt. Mixing crawler directives (Disallow:) with indexer directives (Noindex:) this way takes the “Follow” out of the game, because a search engine can’t follow links from uncrawled documents.

Webmasters will not understand that “Nofollow” means totally different things in robots.txt and meta tags. Also, this approach steals granularity that we need, for example for use with technically structured sitemap pages and other hubs.

According to Google their current interpretation of Noindex: in robots.txt is not yet set in stone. That means there’s an opportunity for improvement. I hope that Google, and other search engines as well, listen to the needs of Webmasters.

Dear Googlers, don’t take the above said as Google bashing. I know, and often wrote, that Google is the search engine that puts the most efforts in boring tasks like REP evolvement. I just think that a dog company like Google needs to take real-world Webmasters into the boat when playing with standards like the REP, for the sake of the cats.

Recap: Existing robots.txt directives

The /path example in the following sections refers to any way to assign URIs to REP directives, not only complete URIs relative to the server’s root. Patterns can be useful to set crawler directives for a bunch of URIs:

*: any string in path or query string, including the query string delimiter “?”, multiple wildcards should be allowed.
$: end of URI
Trailing /: (not exactly a pattern) addresses a directory, its files and subdirectories, the subdirectorie’s files etc., for example
- Disallow: /path/
  matches /path/index.html but not /path.html
- /path
  matches both /path/index.html and /path.html, as well as /path_1.html. It’s a pretty common mistake to “forget” the trailing slash in crawler directives meant to disallow particular directories. Such mistakes can result in blocking script/page-URIs that should get crawled and indexed.

Please note that patterns aren’t supported by all search engines, for example MSN supports only file extensions (yet?).

User-agent: [crawler name]
Groups a set of instructions for a particular crawler. Crawlers that find their own section in robots.txt ignore the User-agent: * section that addresses all Web robots. Each User-agent: section must be terminated with at least one empty line.

Disallow: /path
Prevents from crawling, but allows indexing based on 3rd party information like anchor text and surrounding text of inbound links. Disallow’ed URLs can gather PageRank.

Allow: /path
Refines previous Disallow: statements. For example Disallow: /scripts/ Allow: /scripts/page.php
tells crawlers that they may fetch http://example.com/scripts/page.php or http://example.com/scripts/page.php?article=1, but not any other URL in http://example.com/scripts/.

Sitemap: [absolute URL]
Announces XML sitemaps to search engines. Example: Sitemap: http://example.com/sitemap.xml Sitemap: http://example.com/video-sitemap.xml
points all search engines that support Google’s Sitemaps Protocol to the sitemap locations. Please note that sitemap autodiscovery via robots.txt doesn’t replace sitemap submissions. Google, Yahoo and MSN provide Webmaster Consoles where you not only can submit your sitemaps, but follow the indexing process (wishful thinking WRT particular SEs). In some cases it might be a bright idea to avoid the default file name “sitemap.xml” and keep the sitemap URLs out of robots.txt, sitemap autodiscovery is not for everyone.

Recap: Existing REP tags

REP tags are values that you can use in a page’s robots meta tag and X-Robots-Tag. Robots meta tags go to the HTML document’s HEAD section <meta name="robots" content="noindex, follow, noarchive" />
whereas X-Robots-Tags supply the same information in the HTTP header X-Robots-Tag: noindex, follow, noarchive
and thus can instruct crawlers how to handle non-HTML resources like PDFs, images, videos, and whatnot.

Widely supported REP tags are:

INDEX|NOINDEX - Tells whether the page may be indexed (listed on SERPs) or not
FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided in the document or not
ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
NOODP - tells search engines not to use page titles and descriptions pulled from DMOZ on their SERPs.
NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

Problems with REP tags in robots.txt

REP tags (index, noindex, follow, nofollow, all, none, noarchive, nosnippet, noodp, noydir, unavailable_after) were designed as page-level directives. Setting those values for groups of URLs makes steering search engine crawling and indexing a breeze, but also comes with more complexity and a few pitfalls as well.

Page-level directives are instructions for indexers and query engines, not crawlers. A search engine can’t obey REP tags without crawling the resource that supplies them. That means that not a single REP tag put as robots.txt statement shall be misunderstood as crawler directive.

For example Noindex: /path must not block crawling, not even in combination with Nofollow: /path, because there’s still the implicit “archive” (= absence of Noarchive: /path). Providing a cached copy even of a not indexed page makes sense for toolbar users.

Whether or not a search engine actually crawls a resource that’s tagged with “noindex, nofollow, noarchive, nosnippet” or so is up to the particular SE, but none of those values implies a Disallow: /path.
Historically, a crawler instruction on HTML element level overrules the robots meta tag. For example when the meta tag says “follow” for all links on a page, the crawler will not follow a link that is condomized with rel=”nofollow”.

Does that mean that a robots meta tag overrules a conflicting robots.txt statement? Of course not in any case. Robots.txt is the gatekeeper, and so to say the “highest REP instance”. Actually, to this question there’s no absolute answer that satisfies everybody.

A Webmaster sitting on a huge conglomerate of legacy code may want to totally switch to robots.txt directives, that means search engines shall ignore all the BS in ancient meta tags of pages created in the stone age of the Internet. Back then the rules were different. An alternative/secondary landing page’s “index,follow” from 1998 most probably doesn’t fly with 2008’s duplicate content filters and high sophisticated link pattern analytics.

The Webmaster of a well designed brand new site on the other hand might be happy with a default behavior where page-level REP tags overrule site-wide directives in robots.txt.
REP tags used in robots.txt might refine crawler directives. For example a disallow’ed URL can accumulate PageRank, and may be listed on SERPs. We need at least two different directives ruling PageRank caluculation and indexing for uncrawlable resources (see below under Noodp:/Noydir:, Noindex: and Norank:).

Google’s current approach to handle this with the Noindex: directive alone is not acceptable, we need a new REP tag to handle this case. Next up, when we introduce a new REP tag for use in robots.txt, we should allow it in meta tags and HTTP headers too.
In theory it makes no sense to maintain a directive that describes a default behavior. But why has the REP “follow” although the absence of “nofollow” perfectly expresses “follow”? Because of the way non-geeks think (try to explain why the value nil/null doesn’t equal empty/zero/blank to a non-geek. Not!).

Implicit directives that aren’t explicitely named and described in the rules don’t exist for the masses. Even in the 10 commandments someone had to write “thou shalt not hotlink|scrape|spam|cloak|crosslink|hijack…” instead of a no-brainer like “publish unique and compelling content for people and make your stuff crawlable”. Unfortunately, that works the other way round too. If a statement (Index: or Follow:) is dependent on another one (Allow: respectively the absence of Disallow:) folks will whine, rant and argue when search engines ignore their stuff.

Obviously we need at least Index:, Follow: and Archive to keep the standard usable and somewhat understandable. Of course crawler directives might thwart such indexer directives. Ignorant folks will write alphabetically ordered robots.txt files like Disallow: /cgi-bin/ Disallow: /content/ ... Follow: /cgi-bin/redirect.php Follow: /content/links/ ... Index: /content/articles/
without Allow: /content/links/, Allow: /content/articles/ and Allow: /cgi-bin/redirect.

Whether or not indexer directives that require crawling can overrule the crawler directive Disallow: is open for discussion. I vote for “not”.
Applying REP tags on site-level would be great, but it doesn’t solve other problems like the need of directives on block and element level. Both Google’s section targeting as well as Yahoo’s robots-nocontent class name aren’t acceptable tools capable to instruct search engines how to handle content in particular page areas (advertising blocks, navigation and other templated stuff, links in footers or sidebar elements, and so on).

Instead of editing bazillions of pages, templates, include files and whatnot to insert rel-nofollow/nocontent stuff for the sole purpose of sucking up to search engines, we need an elegant way to apply such micro-directives via robots.txt, or at least site-wide sets of instructions referenced in robots.txt. Once that’s doable, Webmasters will make use of such tools to improve their rankings, and not alone to comply to the ever changing search engine policies that cost the Webmaster community billions of man hours each year.

I consider these robots.txt statements sexy: Nofollow a.advertising, div#adblock, span.cross-links: /path Noindex .inherited-properties, p#tos, p#privacy, p#legal: /path
but that’s a wish list for another post. However, while designing site-wide REP statements we should at least think of block/element level directives.

Remember the rel-nofollow fiasco where a REP tag was used on HTML element level producing so much confusion and conflicts. Lets learn from past mistakes and make it perfect this time. A perfect standard can be complex, but it’s clear and unambiguous.

Priority settings

The REP’s command hierarchy must be well defined:

robots.txt
Page meta tags and X-Robots-Tags in the HTTP header. X-Robots-Tag values overrule conflicting meta tag values.
[Future block level directives]
Element level directives like rel-nofollow

That means, when crawling is allowed, page level instructions overrule robots.txt, and element level (or future block level) directives overrule page level instructions as well as robots.txt. As long as the Webmaster doesn’t revert the latter:

Priority-page-level: /path
Default behavior, directives in robots meta tags overrule robots.txt statements. Necessary to reset previous Priority-site-level: statements.

Priority-site-level: /path
Robots.txt directives overrule conflicting directives in robots meta tags and X-Robots-Tags.

Priority-site-level All: /path
Robots.txt directives overrule all directives in robots meta tags or provided elsewhere, because those are completely ignored for all URIs under /path. The “All” parameter would even dofollow nofollow’ed links when the robots.txt lacks corresponding Nofollow: statements.

Noindex: /path

Follow outgoing links, archive the page, but don’t list it on SERPs. The URLs can accumulate PageRank etcetera. Deindex previously indexed URLs.

[Currently Google doesn’t crawl Noindex’ed URLs and most probably those can’t accumulate PageRank, hence URLs in /path can’t distribute PageRank. That’s plain wrong. Those URLs should be able to pass PageRank to outgoing links when there’s no explicit Nofollow:, nor a “nofollow” meta tag respectively X-Robots-Tag.]

Norank: /path

Prevents URLs from accumulating PageRank, anchor text, and whatever link juice.

Makes sense to refine Disallow: statements in company with Noindex: and Noodp:/Noydir:, or to prevent TOS/contact/privacy/… pages and alike from sucking PageRank (nofollow’ing TOS links and stuff like that to control PageRank flow is fault-prone).

Nofollow: /path

The uber-link-condom. Don’t use outgoing links, not even internal links, for discovery crawling. Don’t credit the link destinations with any reputation (PageRank, anchor text, and whatnot).

Noarchive: /path

Don’t make a cached copy of the resource available to searchers.

Nosnippet: /path

List the resource with linked page title on SERPs, but don’t create a text snippet, and don’t reprint the description meta tag.

[Why don’t we have a REP tag saying “use my description meta tag or nothing”?]

Nopreview: /path

Don’t create/link an HTML preview of this resource. That’s interesting for subscriptions sites and applies mostly to PDFs, Word documents, spread sheets, presentations, and other non-HTML resources. More information here.

Noodp: /path

Don’t use the DMOZ title nor the DMOZ description for this URL on SERPs, not even when this resource is a non-HTML document that doesn’t supply its own title/meta description.

Noydir: /path

I’m not sure this one makes sense in robots.txt, because only Yahoo search uses titles and descriptions from the Yahoo directory. Anyway: “Don’t overwrite the page title listed on the SERPs with information pulled from the Yahoo directory, although I paid for it.”

Unavailable_after [date]: /path

Deindex the resource the day after [date]. The parameter [date] is put in any date or date/time format, if it lacks a timezone then GMT is assumed.

[Google’s RFC 850 obsession is somewhat weird. There are many ways to put a timestamp other than “25-Aug-2007 15:00:00 EST”.]

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

In the search index remove the unwanted variable/value pair(s) from the URL’s query string and transfer PageRank and other link juice to the matching URL without those parameters. If this “bare URL” redirects, or is uncrawlable for other reasons, index it with the content pulled from the page with the more complex URL.

Regardless whether the variable name or the variable’s value matches the pattern, “Truncate_*” statements remove a complete argument from the query string, that is &variable=value. If after the (last) truncate operation the query string is empty, the querystring delimiter “?” (questionmark) must be removed too.

Order-arguments [charset]: /path

Sort the query strings of all dynamic URLs by variable name, then within the ordered variables by their values. Pick the first URL from each set of identical results as canonical URL. Transfer PageRank etcetera from all dupes to the canonical URL.

Lots of sites out there were developed by coders who are utterly challenged by all things SEO. Most Web developers don’t even know what URL canonicalization means. Those sites suffer from tons of URLs that all serve identical contents, just because the query string arguments are put in random order, usually inventing a new sequence for each script, function, or include file. Of course most search engines run high sophisticated URL canonicalization routines to prevent their indexes from too much duplicate content, but those algos can fail because every Web site is different.

I totally can resist to suggest a Canonical-uri /: /Default.asp statement that gathers all IIS default-document-URI maladies. Also, case issues shouldn’t get fixed with Case-insensitive-uris: / but by the clueless developers in Redmond.

Will all this come true?

Well, Google has silently started to support REP tags in robots.txt, it totally makes sense both for search engines as well as for Webmasters, and Joe Webmaster’s life would be way more comfortable having REP tags for robots.txt.

A better question would be “will search engines implement REP tags for robots.txt in a way that Webmasters can live with it?”. Although Google launched the sitemaps protocol without significant help from the Webmaster community, I strongly feel that they desperately need our support with this move.

Currently it looks like they will fuck up the REP, respectively the robots.txt standard, hence go grab your AdWords rep and choke her/him until s/he promises to involve Larry, Sergey, Matt, Adam, John, and the whole Webmaster Support Team for the sake of common sense and the worldwide Webmaster community. Thank you!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

21 comments Sebastian | Robots Meta Tags, Web development, X-Robots-Tag, MSN, Crawler Directives, XML-Sitemaps, Microformats, Google, Yahoo, robots.txt, Nofollow

Google to change the Robots Exclusion Protocol again

Posted on 30 November, 2007

Web crawler directives, partly standardized in the Robots Exclusion Protocol (REP), evolved since 1994. Nowadays we’ve to deal with a conglomerate of not binding de facto standards and microformats, all of them extended by various organizations. All search engines claim that they obey “the standard”, but they refer to their very own REP implementation. In fact, each search engine supports a proprietary set of REP directives, differently from other players as a rule.

Google is the search engine putting the most efforts into Robots Exclusion Protocol (REP) evolvements. Their XML Sitemaps handling submissions instead of crawl restrictions changed the REP to a wider scope, the X-Robots-Tag brought us robots meta tags for non-HTML resources like PDF documents, images or video clips, and with Unavailable_after Google made a few clueless news sites happy. With the rel-nofollow microformat on the other hand, respectively its sneaky morphing from a spam fighting tool to its current shape, Google made nobody happy. Yahoo contributed the well meant but half-assed “robots-nocontent” class name, and of course “noydir” (it’s unlikely that any other engine will support those).

Now Google is working on new robots.txt syntax, and I am, politely put, not amused. Here is why I fear that Google is going to totally mess up the REP:

Google supports a “Noindex:” directive in robots.txt, which is treated as “Disallow:”¹⁾. Of course that’s an experiment, but if this behavior doesn’t change we’ll get a beast that is -with regard to the confusion it will produce- way more evil than the rel-nofollow fiasco.

A noindex-alias for disallow makes no sense, even when such syntax errors are out there.
Mixing crawler directives (allow/disallow) with indexer directives (noindex) is not always a bright idea. It’s bad enough that most Webmasters still believe that “Googlebot ranks their stuff”. (Actually, in some cases it can make sense. For example “nofollow” in robots meta tags (or at least for Google in REL attributes too) is both a crawler instruction as well as an indexer directive.)
Noindex and disallow are completely different commands. The REP’s noindex directive means “crawl it, follow its links, but don’t list it on the SERPs”. Disallow forbids crawling, but allows indexing URLs from directory listings or other inbound links.

Standards should be clear and unambiguous. Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they’ve the power to fuck up the REP, but they also have “do no evil”.

Considering that Google is run by a bunch of smart engineers, I hope that they’ll do the right thing eventually. The right thing in this case is giving more power to REP evolvements, before questionable and selfish anti-search initiatives like ACAP ruin both the robots.txt consensus as well as the robots meta tag standard.

My idea of more power to REP evolvements is:

Sensible implementation of crawler/indexer-directives adapted from REP tags in robots.txt. Applying page-level instructions ((no)index, (no)follow, noarchive, nosnippet, noodp/noydir, unavailable_after and hopefully nopreview) to groups of URIs is a great way to steer crawling and indexing, especially for sites which for various reasons cannot make use of the HTTP header’s X-Robots-Tag.
Implementation of block-level directives in robots.txt. Allowing Webmasters to apply crawler instructions like “noindex” or “nofollow” to particular page areas, like advertising blocks, duplicated text or repetitive navigation elements, addressed via HTML element names and class names and/or DOM-IDs, would be a very flexible instrument to steer crawling and indexing, and it could eleminate many points of failure.
Getting Webmasters, Publishers, SEOs and all major engines together to discuss possibly missing granularity and to develop a binding norm obeyed by all players.

The last one sounds like wishful thinking. The alternative is that Google (and, if possible, the bigger engines) talk with Webmasters and then launch the necessary REP extensions. The other engines will follow sooner or later. The publishers, although not getting all their desired ACAP restrictions, will be happy too. Standards like the Robots Exclusion Protocol should be developed by engineers.

¹⁾ Noindex: is not a plain Disallow:, there’s an interesting difference. In Google’s experiment both directives block crawling, but Disallow: allows URL-indexing based on 3rd party information, and Disallow:‘ed URLs can accumulate PageRank from internal as well as external links. Noindex:‘ed URLs on the other hand will not appear on SERPs as URL-only listing or with an ODP title and snippet, and I’m quite sure that they will not gather PageRank nor other link juice. That means links from any pages to such URLs get an implicit rel-nofollow in Google’s PageRank calculation, just like dangling links. This apparatus could be a great way to handle PageRank leaks (monthly blog archives, printer friendly pages and stuff like that), because shit happens, hence some links to such pages will slip through without condom. I admit that’s a neat idea, but its implementation is flawed because it doesn’t consider the implicit Follow: (that’s syntax Google doesn’t support in robots.txt). A better way to mark site areas which shall not gather PageRank without raping the REP would be a Norank: directive or so. Noindex: without a Nofollow: must not block crawling. Googlebot must fetch those URLs to follow their links.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

8 comments Sebastian | Crawler Directives, Robots Meta Tags, X-Robots-Tag, XML-Sitemaps, robots.txt, Microformats, Google, Yahoo, Nofollow

Act out your sophisticated affiliate link paranoia

Posted on 13 November, 2007

GOOD: paranoid affiliate link My recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google’s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:

Links purchased for advertising should be designated as such. This can be done in several ways, such as:
* Adding a rel=”nofollow” attribute to the <a> tag
* Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file

Also, Google doesn’t define paid links that clearly, so try this paid link definition instead before your read on. Here is my linking guide for the paranoid affiliate marketer.

Google recommends hiding of any content provided by affiliate programs from their crawlers. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program’s sales tools page count as duplicate content (snippet) in its worst occurance.

Pasting code copied from a merchant’s site into a page’s or template’s HTML is not exactly a smart way to put ads. Those ads aren’t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you’re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts.

There are tons of such scripts out there, so I don’t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you’ve a WordPress blog you can adapt them with ease.

Cover your ass with a linking policy

Googlers as well as hired guns do review Web sites for violations of Google’s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) full disclosure of your linking attitude can help to pass a human review by search engine staff. By the way, having a policy for dofollowed blog comments is also a good idea.

Since crawler directives like link condoms are for search engines (only), and those pay attention to your source code and hints addressing search engines like robots.txt, you should leave a note there too, look into the source of this page for an example. View sample HTML comment.

Block crawlers from your propaganda scripts

Put all your stuff related to advertising (scripts, images, movies…) in a subdirectory and disallow search engine crawling in your /robots.txt file: User-agent: * Disallow: /propaganda/
Of course you’ll use an innocuous name like “gnisitrevda” for this folder, which lacks a default document and can’t get browsed because you’ve a Options -Indexes
statement in your .htaccess file. (Watch out, Google knows what “gnisitrevda” means, so be creative or cryptic.)

Crawlers sent out by major search engines do respect robots.txt, hence it’s guaranteed that regular spiders don’t fetch it. As long as you don’t cheat too much, you’re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn’t prevent you from surfing search engine staff, but I don’t tell you things you’d better hide from Matt’s gang.

Detect search engine crawlers

Basically there are three common methods to detect requests by search engine crawlers.

Testing the user agent name (HTTP_USER_AGENT) for strings like “Googlebot”, “Slurp”, “MSNbot” or so which identify crawlers. That’s easy to spoof, for example PrefBar for FireFox lets you choose from a list of user agents.
Checking the user agent name, and only when it indicates a crawler, verifying the requestor’s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.
Maintaining a list of all search engine crawler IP addresses known to man, checking the requestor’s IP (REMOTE_ADDR) against this list. (That alone isn’t bullet-proof, but I’m not going to write a tutorial on industrial-strength ~~cloaking~~ IP delivery, I leave that to the real experts.)

For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won’t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants’ search engine rankings, and that’s evil in Google’s eyes.

We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine’s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor’s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn’t match a crawler name.

View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)
// file system path to crawler IP log, scripts etc., // without trailing slash: $includePath = $_SERVER["DOCUMENT_ROOT"] . "/propaganda"; // edit "propaganda" and CHMOD 777 the directory ! // file names: $crawlerIps = $includePath ."/crawler-ip-addresses.txt"; // misc. stuff: $timestamp = date(’Y-m-d H:i:s’); $ipAddy = $_SERVER["REMOTE_ADDR"]; $referrer = $_SERVER["HTTP_REFERER"]; $userAgent = $_SERVER["HTTP_USER_AGENT"]; $requestUri = $_SERVER["REQUEST_URI"]; $queryString = $_SERVER["QUERY_STRING"]; $isCrawler = FALSE; $crawlerServer = ""; $delimiter = "|"; $idString = ""; if (empty($includePath)) { $includePath = $_SERVER["DOCUMENT_ROOT"] . "/propaganda"; // CHMOD 777 } // Write a file to disk if (!function_exists("writeLocalFile")) { function writeLocalFile ($file, $content) { if (!is_writable($file)) { $lok = @chmod ( $file, 0777 ); } // file_put_contents() not avail in PHP 4.3x $fp = @fopen("$file","w+"); if ($fp) { $lOk = @fwrite($fp, $content, strlen($content)); @fclose($fp); // make sure file may get overwritten or removed later on $lok = @chmod ( $file, 0777 ); return TRUE; } // endif $fp return FALSE; } // end function writeLocalFile } if (!function_exists("checkCrawlerUA")) { function checkCrawlerUA () { GLOBAL $userAgent; GLOBAL $crawlerServer; $crawlerServer = ""; $crawlers = array("Googlebot","Mediapartners","Slurp","MSNbot","Ask","Teoma"); foreach ($crawlers as $crawler) { if (stristr($userAgent,$crawler)) { if (stristr($crawler,"Googlebot") || stristr($crawler,"Mediapartners")) { $crawlerServer = ".googlebot.com"; } // Google if (stristr($crawler,"Slurp")) { $crawlerServer = ".crawl.yahoo.net"; } // Yahoo if (stristr($crawler,"MSNbot")) { $crawlerServer = ".search.live.com"; } // MSN/Live if (stristr($crawler,"Ask") || stristr($crawler,"Teoma")) { $crawlerServer = ".ask.com"; } // Ask } } // foreach crawlers if (!empty($crawlerServer)) return TRUE; return FALSE; } // end function checkCrawlerUA } if (!function_exists("checkCrawlerIP")) { function checkCrawlerIP ($idString) { GLOBAL $ipAddy; GLOBAL $crawlerIps; GLOBAL $delimiter; GLOBAL $timestamp; GLOBAL $userAgent; GLOBAL $crawlerServer; $isCrawler = checkCrawlerUA(); if ($isCrawler === FALSE) return FALSE; if (empty($crawlerServer)) return FALSE; // // DEBUG: $crawlerServer = ".national-net.com"; // Use your ISPs host name for testing with a spoofed user agent name // $crawlerIpsContent = @file_get_contents($crawlerIps); if (!empty($crawlerIpsContent)) { if (stristr($crawlerIpsContent, "\n$ipAddy$delimiter")) { return TRUE; } } $crawlerHost = @gethostbyaddr($ipAddy); if (!stristr($crawlerHost,$crawlerServer)) { return FALSE; } if ("$crawlerHost" == "$ipAddy") { return FALSE; } $ipAddyRev = @gethostbyname($crawlerHost); if ("$ipAddyRev" != "$ipAddy") { return FALSE; } $crawlerIpsContent .= "\n" .$ipAddy .$delimiter .$timestamp .$delimiter .$crawlerHost .$delimiter .$idString .$delimiter .$userAgent .$delimiter; $lOk = writeLocalFile ($crawlerIps, $crawlerIpsContent); return TRUE; } // end function checkCrawlerIP }
Grab and implement the PHP source, then you can code statements like $isSpider = checkCrawlerUA (); ... if ($isSpider) { $relAttribute = " rel=\"nofollow\" "; } ... $affLink = "<a href=\"$affUrl\" $relAttribute>call for action</a>";
or $isSpider = checkCrawlerIP ($sponsorUrl); ... if ($isSpider) { // don't redirect to the sponsor, return a 403 or 410 instead }
More on that later.

Don’t deliver your advertising to search engine crawlers

It’s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense’s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can’t guarantee indexing, because Google doesn’t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.

Here is how you could output a totally uncrawlable banner ad: ... $isSpider = checkCrawlerIP ($PHP_SELF); ... print "<div class=\"css-class-sidebar robots-nocontent\">"; // output RSS buttons or so if (!$isSpider) { print "<script type=\"text/javascript\" src=\"http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner\"></script>"; ... } ... print "</div>\n"; ...
Lets look at the code above. First we detect crawlers “without doubt” (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-’.crawl.yahoo.net’ host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I’ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).

Currently the robots-nocontent class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn’t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google’s section targeting that’s a crappy crawler directive, though. However, it doesn’t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), “buy now”/”view basket” links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) … and of course ads.

Ads aren’t outputted when a crawler requests a page. Of course that’s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren’t relevant for indexing that’s not deceitful cloaking in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster’s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers.

Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren’t penalized. Since Web robots can’t pull out the plastic, there’s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (Example from Google: “Allow search bots to crawl your sites without session IDs or arguments that track their path through the site” is a call for search engine friendly URL cloaking.)

Is hiding ads from crawlers “safe with Google” or not?

BAD: uncloaked affiliate link Cloaking ads away is a double edged sword from a search engine’s perspective. Way too strictly interpreted that’s against the cloaking rule which states “don’t show crawlers other content than humans”, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google’s section targeting or Yahoo’s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don’t count them as on-the-page contents.

Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don’t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify MFA pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings.

Since in theory the contents of JavaScript ads aren’t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can’t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can’t argue seriously that hiding such tiny signals -which they use for the sole purposes of possible downranks- is against their guidelines.

For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn’t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.

Dear search engines, if it does not count, then you cannot demand to keep it crawlable. Well, a few super mega white hat trolls might disagree, and depending on the implementation on individual sites maybe hiding ads isn’t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don’t try to hide the fact that I run ads on this blog.

Usually I don’t argue with fair vs. unfair, because we talk about ~~war~~ business here, what means that everything goes. However, Google does everything to talk the whole Internet into ~~obfuscating~~ disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. (Of course that’s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don’t care about bad press. So please don’t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)

Outputting ads with JavaScript, preferably in iFrames

Delivering adverts with JavaScript does not mean that one can’t use server sided scripting to adjust them dynamically. With content management systems it’s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn’t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval’ing the PHP code).

Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:

And here is the JS call of the PHP script: <script type="text/javascript" src="http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner"></script>

The PHP script /propaganda/output.js.php evaluates the query string to pull the requested ad’s components. In case it’s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor’s location and whatnot, but that’s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad’s links. Once the HTML code is ready, it outputs a JavaScript statement: document.write(‘<a href="http://sebastians-pamphlets.com/propaganda/router.php? adName=seobook&adServed=banner" title="DOWNLOAD THE BOOK ON SEO!"><img src="http://sebastians-pamphlets.com/propaganda/seobook/468-60.gif" width="468" height="60" border="0" alt="The only current book on SEO" title="The only current book on SEO" /></a>’); which the browser executes within the script tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag.

Matt Cutts said that JavaScript links don’t prevent Googlebot from crawling, but that those links don’t count for rankings (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can’t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it’s fetchable by crawlers, so I wouldn’t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That’s why I use rel-nofollow -on crawler requests- even in JS ads.

Change your user agent name to Googlebot or so, install Matt’s show nofollow hack or something similar, and you’ll see that the affiliate-URL gets nofollow’ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can’t offend my readers with nofollow’ed links (although I don’t care about crawler spoofing, actually that’s a good procedure to let advertisers check out my linking attitude).

We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don’t count pages displayed in iFrames as on-page contents, especially not when the iFrame’s content is hosted on another domain. Here is an example straight from the horse’s mouth: <iframe name="google_ads_frame" src="http://pagead2.googlesyndication.com/pagead/ads? very-long-and-ugly-query-string" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" frameborder="0" height="90" scrolling="no" width="728"></iframe> In a noframes tag we could put a static ad for surfers using browsers which don’t support frames/iFrames.

If for some reasons you don’t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael’s Full Text/HTML Obfuscator Tool (hat tip to Donna).

Always redirect to affiliate URLs

There’s absolutely no point in using ugly affiliate URLs on your pages. Actually, that’s the last thing you want to do for various reasons.

For example, affiliate URLs as well as source codes can change, and you don’t want to edit tons of pages if that happens.
When an affiliate program doesn’t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you’d replace outdated ads completely with one mouse click or so.
Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own -often confusing- layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don’t track your outgoing traffic, some sponsors will cheat and you can’t prove your gut feelings.
Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don’t smell like affiliate URLs.
…

When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really manage your affiliate links. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.

Affiliate links provide great footprints. Unique URL parts respectively query string variable names gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The values identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn’t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that’s one trackable footprint you can avoid to some degree.

If the redirect-script’s location is on the same server (in fact it’s not thanks to symlinks) and not named “adserver” or so, chances are that a heuristic check won’t identify the link’s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren’t crawlable with paramaters, or similar “legit” stuff. However, you can’t fool smart algos forever, but if you’ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine’s point of view, and users might agree on this).

Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn’t benefit you (usually). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.

The behavior of an adserver URL masking an affiliate link

Lets look at the redirect-script’s URL from my code example above:
/propaganda/router.php?adName=seobook&adServed=banner
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert.

Now that we’ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor’s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor’s IP addy.

If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor’s landing page: $sponsorUrl = "http://www.seobook.com/262.html"; $requestProtocol = $_SERVER["SERVER_PROTOCOL"]; $protocolArr = explode("/",$requestProtocol); $protocolName = trim($protocolArr[0]); $protocolVersion = trim($protocolArr[1]); if (stristr($protocolName,"HTTP") && strtolower($protocolVersion) > "1.0" ) { $httpStatusCode = 307; } else { $httpStatusCode = 302; } $httpStatusLine = "$requestProtocol $httpStatusCode Temporary Redirect"; @header($httpStatusLine, TRUE, $httpStatusCode); @header("Location: $sponsorUrl"); exit;
A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can’t use the 301 response code, because 301 redirects are always cachable. (More information on HTTP redirects.)

If the requestor is a major search engine’s crawler, we perform the most brutal bounce back known to man: if ($isSpider) { @header("HTTP/1.1 403 Sorry Crawlers Not Allowed", TRUE, 403); @header("X-Robots-Tag: nofollow,noindex,noarchive"); exit; }
The 403 response code translates to “kiss my ass and get the fuck outta here”. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn’t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine “forget this URL, don’t request it again”. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.

If you don’t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don’t need to live in fear of penalties.

Recap

NICE: prospering affiliate link Before you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.

Don’t deliver different (editorial) contents to users and crawlers, but also don’t serve ads to crawlers. They just don’t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.

Your ads look plain ugly with dotted borders in firebrick, hence don’t apply rel=”nofollow” to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.

No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers.

Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google’s job is not exactly funny.

I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you’re not paranoid, that does not mean Google will not chase you down. You really don’t need to be a so called black hat to displease Google. Not knowing respectively not understanding Google’s 12 commandments doesn’t prevent you from being spanked for sins you’ve never heard of. If you’re keen on Google’s nicely targeted traffic, better play by Google’s rules, leastwise on creawler requests.

Feel free to contribute your tips and tricks in the comments.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Search Quality, Risky Linkage, Web development, X-Robots-Tag, Redirects, Paid Links, Crawler Directives, SEO, Google, robots.txt, E-Commerce, Cloaking, Nofollow

A pragmatic defence against Google’s anti paid links campaign

Posted on 26 October, 2007

Google’s recent shot across the bows of a gazillion sites handling paid links, advertising, or internal cross links not compliant to Google’s imagination of a natural link is a call for action. Google’s message is clear: “condomize your commercial links or suffer” (from deducted toolbar PageRank, links without the ability to pass real PageRank and relevancy signals, or perhaps even penalties).

Paid links: good versus evil Of course that’s somewhat evil, because applying nofollow values to all sorts of links is not exactly a natural thing to do; visitors don’t care about invisible link attributes and sometimes they’re even pissed when they get redirected to an URL not displayed in their status bar. Also, this requirement forces Webmasters to invest enormous efforts in code maintenance for the sole purpose of satisfying search engines. The argument “if Google doesn’t like these links, then they can discount them in their system, without bothering us” has its merits, but unfortunately that’s not the way Google’s cookie crumbles for various reasons. Hence lets develop a pragmatic procedure to handle those links.

The problem

Google thinks that uncondomized paid links as well as commercial links to sponsors or affiliated entities aren’t natural, because the terms “sponsor|pay for review|advertising|my other site|sign-up|…” and “editorial vote” are not compatible in the sense of Google’s guidelines. This view at the Web’s linkage is pretty black vs. white.

Either you link out because a sponsor bought ads, or you don’t sell ads and link out for free because you honestly think your visitors will like a page. Links to sponsors without condom are black, links to sites you like and which you don’t label “sponsor” are white.

There’s nothing in between, respectively gray areas like links to hand picked sponsors on a page with a gazillion of links count as black. Google doesn’t care whether or not your clean links actually pass a reasonable amount of PageRank to link destinations which buy ad space too, the sole possibility that those links could influence search results is enough to qualify you as sort of a link seller.

The same goes for paid reviews on blogs and whatnot, see for example Andy’s problem with his honest reviews which Google classifies as paid links, and of course all sorts of traffic deals, affiliate links, banner ads and stuff like that.

You don’t even need to label a clean link as advert or sponsored. If the link destination matches a domain in Google’s database of on-line advertisers, link buyers, e-commerce sites / merchants etcetera, or Google figures out that you link too much to affiliated sites or other sites you own or control, then your toolbar PageRank is toast and most probably your outgoing links will be penalized. Possibly these penalties have impact on your internal links too, what results in less PageRank landing on subsidiary pages. Less PageRank gathered by your landing pages means less crawling, less ranking, less SERP referrers, less revenue.

The solution

You’re absolutely right when you say that such search engine nitpicking should not force you to throw nofollow crap on your links like confetti. From your and my point of view condomizing links is wrong, but sometimes it’s better to pragmatically comply to such policies in order to stay in the game.

Although uncrawlable redirect scripts have advantages in some cases, the simplest procedure to condomize a link is the rel-nofollow microformat. Here is an example of a googlified affiliate link:<a href="http://sponsor.com/?affID=1" rel="nofollow">Sponsor</a>

Why serve your visitors search engine crawler directives?

Complying to Google’s laws does not mean that you must deliver crawler directives like rel=”nofollow” to your visitors. Since Google is concerned about search engine rankings influenced by uncondomized links with commercial intent, serving crawler directives to crawlers and clean links to users is perfectly in line with Google’s goals. Actually, initiatives like the X-Robots-Tag make clear that hiding crawler directives from users is fine with Google. To underline that, here is a quote from Matt Cutts:

[…] If you want to sell a link, you should at least provide machine-readable disclosure for paid links by making your link in a way that doesn’t affect search engines. […]

The other best practice I’d advise is to provide human readable disclosure that a link/review/article is paid. You could put a badge on your site to disclose that some links, posts, or reviews are paid, but including the disclosure on a per-post level would better. Even something as simple as “This is a paid review” fulfills the human-readable aspect of disclosing a paid article. […]

Google’s quality guidelines are more concerned with the machine-readable aspect of disclosing paid links/posts […]

To make sure that you’re in good shape, go with both human-readable disclosure and machine-readable disclosure, using any of the methods [uncrawlable redirects, rel-nofollow] I mentioned above.
[emphasis mine]

Since Google devalues paid links anyway, search engine friendly cloaking of rel-nofollow for Googlebot is a non-issue with advertisers, as long as this fact is disclosed. I bet most link buyers look at the magic green pixels anyway, but that’s their problem.

How to cloak rel-nofollow for search engine crawlers

I’ll discuss a PHP/Apache example, but this method is adaptable to other server sided scripting languages like ASP or so with ease. If you’ve a static site and PHP is available on your (*ix) host, you need to tell Apache that you’re using PHP in .html (.htm) files. Put this statement in your root’s .htaccess file: AddType application/x-httpd-php .html .htm

Next create a plain text file, insert the code below, and upload it as “funct_nofollow.php” or so to your server’s root directory (or a subdirectory, but then you need to change some code below). <?php function makeRelAttribute ($linkClass) { $numargs = func_num_args(); // optional 2nd input parameter: $relValue if ($numargs >= 2) { $relValue = func_get_arg(1) ." "; } $referrer = $_SERVER["HTTP_REFERER"]; $refUrl = parse_url($referrer); $isSerpReferrer = FALSE; if (stristr($refUrl[host], "google.") || stristr($refUrl[host], "yahoo.")) $isSerpReferrer = TRUE; $userAgent = $_SERVER["HTTP_USER_AGENT"]; $isCrawler = FALSE; if (stristr($userAgent, "Googlebot") || stristr($userAgent, "Slurp")) $isCrawler = TRUE; if ($isCrawler /*|| $isSerpReferrer*/ ) { if ("$linkClass" == "ad") $relValue .= "advertising nofollow"; if ("$linkClass" == "paid") $relValue .= "sponsored nofollow"; if ("$linkClass" == "own") $relValue .= "affiliated nofollow"; if ("$linkClass" == "vote") $relValue .= "editorial dofollow"; } if (empty($relValue)) return ""; return " rel=\"" .trim($relValue) ."\" "; } // end function makeRelValue ?>

Next put the code below in a PHP file you’ve included in all scripts, for example header.php. If you’ve static pages, then insert the code at the very top. <?php @include($_SERVER["DOCUMENT_ROOT"] ."/funct_nofollow.php"); ?>
Do not paste the function makeRelValue itself! If you spread code this way you’ve to edit tons of files when you need to change the functionality later on.

Now you can use the function makeRelValue($linkClass,$relValue) within the scripts or HTML pages. The function has an input parameter $linkClass and knows the (self-explanatory) values “ad”, “paid”, “own” and “vote”. The second (optional) input parameter is a value for the A element’s REL attribute itself. If you provide it, it gets appended, or, if makeRelValue doesn’t detect a spider, it creates a REL attribute with this value. Examples below. You can add more user agents, or serve rel-nofollow to visitors coming from SERPs by enabling the || $isSerpReferrer condition (remove the bold /*&*/).

When you code a hyperlink, just add the function to the A tag. Here is a PHP example: print "<a href=\"http://google.com/\"" .makeRelAttribute("ad") .">Google</a>";
will output
<a href="http://google.com/" rel="advertising nofollow" >Google</a>
when the user agent is Googlebot, and
<a href="http://google.com/">Google</a>
to a browser.

If you can’t write nice PHP code, for example because you’ve to follow crappy guidelines and worst practices with a WordPress blog, then you can mix HTML and PHP tags: <a href="http://search.yahoo.com/"<?php print makeRelAttribute("paid"); ?>>Yahoo</a>

Please note that this method is not safe with search engines or unfriendly competitors when you want to cloak for other purposes. Also, the link condoms are served to crawlers only, that means search engine staff reviewing your site with a non-crawler user agent name won’t spot the nofollow’ed links unless they check the engine’s cached page copy. An HTML comment in HEAD like “This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.” as well as a similar comment line in robots.txt would certainly help to pass reviews by humans.

A Google-friendly way to handle paid links, affiliate links, and cross linking

Load this page with different user agents and referrers. You can do this for example with a FireFox extension like PrefBar. For testing purposes you can use these user agent names: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
and these SERP referrer URLs: http://google.com/search?q=viagra http://search.yahoo.com/search?p=viagra&ei=utf-8&iscqry=&fr=sfp
Just enter these values in PrefBar’s user agent respectively referrer spoofing options (click “Customize” on the toolbar, select “User Agent” / “Referrerspoof”, click “Edit”, add a new item, label it, then insert the strings above). Here is the code above in action:

Referrer URL:	http://sebastians-pamphlets.com/
User Agent Name:	Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot)
Ad makeRelAttribute(”ad”):	Google
Paid makeRelAttribute(”paid”):	Yahoo
Own makeRelAttribute(”own”):	Sebastian’s Pamphlets
Vote makeRelAttribute(”vote”):	The Link Condom
External makeRelAttribute(”", “external”):	W3C `rel="external"`
Without parameters makeRelAttribute(”"):	Sphinn

When you change your browser’s user agent to a crawler name, or fake a SERP referrer, the REL value will appear in the right column.

When you’ve developed a better solution, or when you’ve a nofollow-cloaking tutorial for other programming languages or platforms, please let me know in the comments. Thanks in advance!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

20 comments Sebastian | Paid Links, Risky Linkage, Search Quality, Crawler Directives, Cloaking, Google, SEO, Nofollow

Google’s 5 sure-fire steps to safer indexing

Posted on 15 August, 2007

Are you wondering why Gray Hat Search Engine News (GHN) is so quiet recently?

One reason may be that I’ve borrowed their Google savvy spy. I’ve sent him to Mountain View again to learn more about Google’s nofollow strategy.

He returned with a copy of Google’s recently revised mission statement, discovered in the wastebasket of a conference room near office 211 in building 43. Read the shocking and unbelievable head note printed in bold letters:

Google’s mission is to condomize the world’s information and make it universally uncrawlable and useless.

Read and reread it, then some weird facts begin to make sense. Now you’ll understand why:

The rel-nofollow plague was designed to maximize collateral damage by devaluing all hyperlinked votes by honest users of nearly all platforms you’re using everyday, for example Twitter, Wikipedia, corporate blogs, GoogleGroups … ostensibly to nullify the efforts of a few spammers.
Nobody bothers to comment on your nofollow’ed blog.
Google invented the supplemental index (to store scraped resources suffering from too many condomized links) and why it grows faster than the main index.
Google installed the Bigdaddy infrastructure (to prevent Ms. Googlebot from following nofollow’ed links).
Google switched to BlitzCrawling (to list timely contents for a moment whilst fat resources from large archives get buried in the supplemental index). RIP deep crawler and freshbot.

Seriously, the deep crawler isn’t defunct, it’s called supplemental crawler nowadays, and the freshbot is still alive as Feedfetcher.

Disclaimer: All these hard facts were gathered by torturing sources close to Google, robbery and other unfair methods. If anyone bothers to debunk all that as bad joke, one question still remains: Why does Google next to nothing to stop the nofollow plague? I mean, ongoing mass abuse of rel-nofollow is obviously counterproductive with regard to their real mission.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

3 comments Sebastian | Fun, Google, Nofollow

Just another victim of the nofollow plague

Posted on 13 August, 2007

It’s evil, it sucks even more than the crappy tinyurl nonsense obfuscating link destinations, nobody outside some SEO cliques really cares about or noticed it, I’m not sure it’s newsworthy because it’s perfectly in line with rel-nofollow semantics, but it annoys me and others so here is the news of late last week: Twitter drank the nofollow kool-aid.

Folks, remove Twitter from your list of PageRank sources and drop links for fun and traffic only. I wonder whether particular people change their linking behavior on Twitter or not. I won’t.

Following Nofollow’s questionably tradition of maximizing collateral damage Twitter nofollows even links leading to Matt’s mom’s charity site. More ~~PageRank~~ power to you, Betty Cutts! Your son deserves a bold nofollow for inventing the beast

Twitter should hire a SEO consultant because they totally fuck up on search engine friendliness.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | Social Web, Twitter, Nofollow

1 | 2 | 3 Next Page »

Archived posts from the 'Nofollow' Category

Condomize URIs, not A elements

PageRank™ sculpting

Link monkey business

Affiliate links

Advanced SEO purposes

Blogging

Blog comments

Forums, guestbooks and unmoderated stuff like that

Anything else

Conclusion

Webmaster guidelines for search engine friendly links

Google crawls URIs extracted from somewhat sneaky JavaScript code

Google indexes pages that have only JavaScript links pointing to them

Google doesn’t pass anchor text of nofollow’ed links to the link destination

Google doesn’t treat anchor text of JavaScript links as textual content

Results, conclusions, speculation

How to safely strip a link condom

Link condoms with juicy taste faking good karma

Standardization of REP tags as robots.txt directives

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Google’s “Noindex: in robots.txt” experiment

Recap: Existing robots.txt directives

Recap: Existing REP tags

Problems with REP tags in robots.txt

Priority settings

Noindex: /path

Norank: /path

Nofollow: /path

Noarchive: /path

Nosnippet: /path

Nopreview: /path

Noodp: /path

Noydir: /path

Unavailable_after [date]: /path

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

Order-arguments [charset]: /path

Will all this come true?

Cover your ass with a linking policy

Block crawlers from your propaganda scripts

Detect search engine crawlers

Don’t deliver your advertising to search engine crawlers

Is hiding ads from crawlers “safe with Google” or not?

Outputting ads with JavaScript, preferably in iFrames

Always redirect to affiliate URLs

The behavior of an adserver URL masking an affiliate link

Recap

The problem

The solution

Why serve your visitors search engine crawler directives?

How to cloak rel-nofollow for search engine crawlers

A Google-friendly way to handle paid links, affiliate links, and cross linking

Categories

Monthly Archives

Links

RSS Feeds