Archived posts from the 'Anchor Text' Category

Vaporize yourself before Google burns your linking power

PIC-1: Google PageRank(tm) 2007I couldn’t care less about PageRank™ sculpting, because a well thought out link architecture does the job with all search engines, not just Google. That’s where Google is right on the money.

They own PageRank™, hence they can burn, evaporate, nillify, and even divide by zero or multiply by -1 as much PageRank™ as they like; of course as long as they rank my stuff nicely above my competitors.

Picture 1 shows Google’s PageRank™ factory as of 2007 or so. Actually, it’s a pretty simplified model, but since they’ve changed the PageRank™ algo anyway, you don’t need to bother with all the geeky details.

As a side note: you might ask why I don’t link to Matt Cutts and Danny Sullivan discussing the whole mess on their blogs? Well, probably Matt can’t afford my advertising rates, and the whole SEO industry has linked to Danny anyway. If you’re nosy, check out my source code to learn more about state of the art linkage very compliant to Google’s newest guidelines for advanced SEOs (summary: “Don’t trust underlined blue text on Web pages any longer!”).

PIC-2: Google PageRank(tm) 2009What really matters is picture 2, revealing Google’s new PageRank™ facilities, silently launched in 2008. Again, geeky details are of minor interest. If you really want to know everything, then search for [operation bendover] at !Yahoo (it’s still top secret, and therefore not searchable at Google).

Unfortunately, advanced SEO folks (whatever that means, I use this term just because it seems to be an essential property assigned to the participants of the current PageRank™ uprising discussion) always try to confuse you with overcomplicated graphics and formulas when it comes to PageRank™. Instead, I ask you to focus on the (important) hard core stuff. So go grab a magnifier, and work out the differences:

  • PageRank™ 2009 in comparision to PageRank™ 2007 comes with a pipeline supplying unlimited fuel. Also, it seems they’ve implemented the green new deal, switching from gas to natural gas. That means they can vaporize way more link juice than ever before.
  • PageRank™ 2009 produces more steam, and the clouds look slightly different. Whilst PageRank™ 2007 ignored nofollow crap as well as links put with client sided scripting, PageRank™ 2009 evaporates not only juice covered with link condoms, but also tons of other permutations of the standard A element.
  • To compensate the huge overall loss of PageRank™ caused by those changes, Google has decided to pass link juice from condomized links to their target URI hidden to Googlebot with JavaScript. Of course Google formerly has recommended the use of JavaScript-links to prevent the webmasters from penalties for so-called “questionable” outgoing links. Just as they’ve not only invented rel-nofollow, but heavily recommended the use of this microformat with all links disliked by Google, and now they take that back as if a gazillion links on the Web could magically change just because Google tweeks their algos. Doh! I really hope that the WebSpam-team checks the age of such links before they penalize everything implemented according to their guidelines before mid-2009 or the InterWeb’s downfall, whatever comes last.

I guess in the meantime you’ve figured out that I’m somewhat pissed. Not that the secretly changed flow of PageRank™ a year ago in 2008 had any impact on my rankings, or SERP traffic. I’ve always designed my stuff with PageRank™ flow in mind, but without any misuses of rel=”nofollow”, so I’m still fine with Google.

What I can’t stand is when a search engine tries to tell me how I’ve to link (out). Google engineers are really smart folks, they’re perfectly able to develop a PageRank™ algo that can decide how much Google-juice a particular link should pass. So dear Googlers, please –WRT to the implementation of hyperlinks– leave us webmasters alone, dump the rel-nofollow crap and rank our stuff in the best interest of your searchers. No longer bother us with linking guidelines that change yearly. It’s not our job nor responsibility to act as your cannon fodder slavish code monkeys when you spot a loophole in your ranking- or spam-detection-algos.

Of course the above said is based on common sense, so Google won’t listen (remember: I’m really upset, hence polemic statements are absolutely appropriate). To prevent webmasters from irrational actions by misleaded search engines, I hereby introduce the

Webmaster guidelines for search engine friendly links

What follows is pseudo-code, implement it with your preferred server sided scripting language.

if (getAttribute($link, 'rel') matches '*nofollow*' &&
    $userAgent matches '*Googlebot*') {
    print '<strong rev="' + getAttribute(link, 'href') + '"'
    + ' style="color:blue; text-decoration:underlined;"'
    + ' onmousedown="window.location=document.getElementById(this.id).rev; "'
    + '>' + getAnchorText($link) + '</strong>';
}
else {
    print $link;
}

Probably it’s a good idea to snip both the onmousedown trigger code as well as the rev attribute, when the script gets executed by Googlebot. Just because today Google states that they’re going to pass link juice to URIs grabbed from the onclick trigger, that doesn’t mean they’ll never look at the onmousedown event or misused (X)HTML attributes.

This way you can deliver Googlebot exactly the same stuff that the punter surfer gets. You’re perfectly compliant to Google’s cloaking restrictions. There’s no need to bother with complicated stuff like iFrames or even disabled blog comments, forums or guestbooks.

Just feed the crawlers with all the crap the search engines require, then concentrate all your efforts on your UI for human vistors. Web robots (bots, crawlers, spiders, …) don’t supply your signup-forms w/ credit card details. Humans do. If you find the time to upsell them while search engines keep you busy with thoughtless change requests all day long.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Nofollow still means don’t follow, and how to instruct Google to crawl nofollow’ed links nevertheless

painting a nofollow'ed link dofollowWhat was meant as a quick test of rel-nofollow once again (inspired by Michelle’s post stating that nofollow’ed comment author links result in rankings), turned out to some interesting observations:

  • Google uses sneaky JavaScript links (that mask nofollow’ed static links) for discovery crawling, and indexes the link destinations despite there’s no hard coded link on any page on the whole Web.
  • Google doesn’t crawl URIs found in nofollow’ed links only.
  • Google most probably doesn’t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.
  • Google most probably doesn’t pass anchor text of JavaScript links to the link destination.
  • Google doesn’t pass anchor text of (hard coded) nofollow’ed links to the link destination.

As for my inspiration, I guess not all links in Michelle’s test were truly nofollow’ed. However, she’s spot on stating that condomized author links aren’t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don’t pay too much attention on REL attributes when you spread your links.

As for my quick test explained below, please consider it an inspiration too. It’s not a full blown SEO test, because I’ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn’t influenced by external noise, for example scraped links and such stuff.

On 2008-02-22 06:20:00 I’ve put a new nofollow’ed link onto my sidebar: Zilchish Crap
<a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-a" rel="nofollow"><span id="repstuff-something-b">Zilchish Crap</span></a>
<script type="text/javascript">
handle=document.getElementById(‘repstuff-something-b’);
handle.firstChild.data=‘Nillified, Nil’;
handle=document.getElementById(‘repstuff-something-a’);
handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js1’;
handle.rel=‘dofollow’;
</script>

(The JavaScript code changes the link’s HREF, REL and anchor text.)

The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow’ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.

Google crawls URIs extracted from somewhat sneaky JavaScript code

20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element’s HREF:
66.249.72.5 2008-02-22 06:47:07 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js1

Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page:
handle=document.getElementById(‘a1’);
handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js2’;
handle.rel=‘dofollow’;

From the log:
66.249.72.5 2008-02-22 09:37:11 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js2

So far Google ignored the hidden JavaScript link to /repstuff/something.php?nil=js3 on the test page. Its code doesn’t change a static link, so that makes sense in the context of repeated statements like “Google ignores JavaScript links / treats them like nofollow’ed links” by Google reps.

Of course the JS code above is easy to analyze, but don’t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!

Google indexes pages that have only JavaScript links pointing to them

The next day I’ve checked the search index, and the results are interesting:

rel-nofollow-test search results

The first search result is the content of the URI with the query string parameter ?nil=js1, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI /repstuff/something.php without query string. There’s not a single real link to this URI elsewhere.

The second search result is a post URI where Google recognized the hard coded anchor text “zilchish crap”, but not the JS code that overwrites it with “Nillified, Nil”. With the SERP-URI parameter “&filter=0″ Google shows more posts that are findable with the search term [zilchish]. (Hey Matt and Brian, here’s room for improvement!)

Google doesn’t pass anchor text of nofollow’ed links to the link destination

A search for [zilchish site:sebastians-pamphlets.com] doesn’t show the testpage that doesn’t carry this term. In other words, so far the anchor text “zilchish crap” of the nofollow’ed sidebar link didn’t impact the test page’s rankings yet.

Google doesn’t treat anchor text of JavaScript links as textual content

A search for [nillified site:sebastians-pamphlets.com] doesn’t show any URIs that have “nil, nillified” as client sided anchor text on the sidebar, just the test page:

rel-nofollow-test search results

Results, conclusions, speculation

This test wasn’t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately “nil” and “nillified” appear both in the JS anchor text as well as on the page, so that’s for another post. However, it seems the JS anchor text isn’t indexed for the pages carrying the JS code, at least they don’t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination’s relevancy for “nil” or “nillified” as well.

Maybe Google’s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows.

However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I’m quite confident that condomized links still don’t pass reputation, nor topically relevance. My test page is unfindable for the nofollow’ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.

How to safely strip a link condom

And what’s the actual “news”? Well, say you’ve links that you must condomize because they’re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow’ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don’t count, so that’s perfectly in line with Google’s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don’t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn’t that neat?

Link condoms with juicy taste faking good karma

Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they look plain ugly otherwise.

Here is how you can do this for a complete Web page. This link is nofollow’ed. The JavaScript code below changed its REL value to “dofollow”. When you put this code at the bottom of your pages, it will un-condomize all your nofollow’ed links.
<script type="text/javascript">
if (document.getElementsByTagName) {
var aElements = document.getElementsByTagName("a");
for (var i=0; i<aElements.length; i++) {
var relvalue = aElements[i].rel.toUpperCase();
if (relvalue.match("NOFOLLOW") != "null") {
aElements[i].rel = "dofollow";
}
}
}
</script>

(You’ll find still condomized links on this page. That’s because the JavaScript routine above changes only links placed above it.)

When you add JavaScript routines like that to your pages, you’ll increase their page loading time. IOW you slow them down. Also, you should add a note to your linking policy to avoid confused advertisers who chase toolbar PageRank.

Updates: Obviously Google distrusts me, how come? Four days after the link discovery the search quality archangel requested the nofollow’ed URI –without query string– possibly to check whether I serve different stuff to bots and people. As if I’d cloak, laughable. (Or an assclown linked the URI without condom.)
Day five: Google’s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren’t interested in client-sided links at all?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Why storing URLs with truncated trailing slashes is an utterly idiocy

Yahoo steals my trailing slashesWith some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like Yahoo thinks circumcising URLs is cool. Proper URL canonicalization might, for example, screw your blog’s reputation at Technorati.

In fact the problem is not your URL canonicalization, e.g. 301 redirects from http://example.com to http://example.com/ respectively http://example.com/directory to http://example.com/directory/, but crappy software that removes trailing forward slashes from your URLs.

Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that “http://example.com” as anchor text makes any sense when the URL in HREF points to “http://example.com/”). Spreading invalid URLs is not funny. People as well as Web robots take invalid URLs from your pages for various purposes. Many usages of invalid URLs are capable to damage the search engine rankings of the link destinations. You can’t control that, hence don’t screw our URLs. Never. Period.

Folks who don’t agree with the above said read on.

    TOC:

  • What is a trailing slash? About URLs, directory URIs, default documents, directory indexes, …
  • How to rescue stolen trailing slashes About Apache’s handling of directory requests, and rewriting respectively redirecting invalid directory URIs in .htaccess as well as in PHP scripts.
  • Why stealing trailing slashes is not cool Truncating slashes is not only plain robbery (bandwidth theft), it often causes malfunctions at the destination server and 3rd party services as well.
  • How URL canonicalization irritates Technorati 301 redirects that “add” a trailing slash to directory URLs, respectively virtual URIs that mimic directories, seem to irritate Technorati so much that it can’t compute reputation, recent post lists, and so on.

What is a trailing slash?

The Web’s standards say (links and full quotes): The trailing path segment delimiter “/” represents an empty last path segment. Normalization should not remove delimiters when their associated component is empty. (Read the polite “should” as “must”.)

To understand that, lets look at the most common URL components:
scheme:// server-name.tld /path ?query-string #fragment
The (red) path part begins with a forward slash “/” and must consist of at least one byte (the trailing slash itself in case of the home page URL http://example.com/).

If an URL ends with a slash, it points to a directory’s default document, or, if there’s no default document, to a list of objects stored in a directory. The home page link lacks a directory name, because “/” after the TLD (.com|net|org|…) stands for the root directory.

Automated directory indexes (a list of links to all files) should be forbidden, use Options -Indexes in .htaccess to send such requests to your 403-Forbidden page.

In order to set default file names and their search sequence for your directories use DirectoryIndex index.html index.htm index.php /error_handler/missing_directory_index_doc.php. In this example: on request of http://example.com/directory/ Apache will first look for /directory/index.html, then if that doesn’t exist for /directory/index.htm, then /directory/index.php, and if all that fails, it will serve an error page (that should log such requests so that the Webmaster can upload the missing default document to /directory/).

The URL http://example.com (without the trailing slash) is invalid, and there’s no specification telling a reason why a Web server should respond to it with meaningful contents. Actually, the location http://example.com points to Null  (nil, zilch, nada, zip, nothing), hence the correct response is “404 - we haven’t got ‘nothing to serve’ yet”.

The same goes for sub-directories. If there’s no file named “/dir”, the URL http://example.com/dir points to Null too. If you’ve a directory named “/dir”, the canonical URL http://example.com/dir/ either points to a directory index page (an autogenerated list of all files) or the directory’s default document “index.(html|htm|shtml|php|…)”. A request of http://example.com/dir –without the trailing slash that tells the Web server that the request is for a directory’s index– resolves to “not found”.

You must not reference a default document by its name! If you’ve links like http://example.com/index.html you can’t change the underlying technology without serious hassles. Say you’ve a static site with a file structure like /index.html, /contact/index.html, /about/index.html and so on. Tomorrow you’ll realize that static stuff sucks, hence you’ll develop a dynamic site with PHP. You’ll end up with new files: /index.php, /contact/index.php, /about/index.php and so on. If you’ve coded your internal links as http://example.com/contact/ etc. they’ll still work, without redirects from .html to .php. Just change the DirectoryIndex directive from “… index.html … index.php …” to “… index.php … index.html …”. (Of course you can configure Apache to parse .html files for PHP code, but that’s another story.)

It seems that truncating default document names can make sense for services that deal with URLs, but watch out for sites that serve different contents under various extensions of “index” files (intentionally or not). I’d say that folks submitting their ugly index.html files to directories, search engines, top lists and whatnot deserve all the hassles that come with later changes.

How to rescue stolen trailing slashes

Since Web servers know that users are faulty by design, they jump through a couple of resource burning hoops in order to either add the trailing slash so that relative references inside HTML documents (CSS/JS/feed links, image locations, HREF values …) work correctly, or apply voodoo to accomplish that without (visibly) changing the address bar.

With Apache, DirectorySlash On enables this behavior (check whether your Apache version does 301 or 302 redirects, in case of 302s find another solution). You can also rewrite invalid requests in .htaccess when you need special rules:
RewriteEngine on
RewriteBase /content/
RewriteRule ^dir1$ http://example.com/content/dir1/ [R=301,L]
RewriteRule ^dir2$ http://example.com/content/dir2/ [R=301,L]

With content management systems (CMS) that generate virtual URLs on the fly, often there’s no other chance than hacking the software to canonicalize invalid requests. To prevent search engines from indexing invalid URLs that are in fact duplicates of canonical URLs, you’ll perform permanent redirects (301).

Here is a WordPress (header.php) example:
$requestUri = $_SERVER["REQUEST_URI"];
$queryString = $_SERVER["QUERY_STRING"];
$doRedirect = FALSE;
$fileExtensions = array(".html", ".htm", ".php");
$serverName = $_SERVER["SERVER_NAME"];
$canonicalServerName = $serverName;
 
// if you prefer http://example.com/* URLs remove the "www.":
$srvArr = explode(".", $serverName);
$canonicalServerName = $srvArr[count($srvArr) - 2] ."." .$srvArr[count($srvArr) - 1];
 
$url = parse_url ("http://" .$canonicalServerName .$requestUri);
$requestUriPath = $url["path"];
if (substr($requestUriPath, -1, 1) != "/") {
$isFile = FALSE;
foreach($fileExtensions as $fileExtension) {
if ( strtolower(substr($requestUriPath, strlen($fileExtension) * -1, strlen($fileExtension))) == strtolower($fileExtension) ) {
$isFile = TRUE;
}
}
if (!$isFile) {
$requestUriPath .= "/";
$doRedirect = TRUE;
}
}
$canonicalUrl = "http://" .$canonicalServerName .$requestUriPath;
if ($queryString) {
$canonicalUrl .= "?" . $queryString;
}
if ($url["fragment"]) {
$canonicalUrl .= "#" . $url["fragment"];
}
if ($doRedirect) {
@header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
@header("Location: $canonicalUrl");
exit;
}

Check your permalink settings and edit the values of $fileExtensions and $canonicalServerName accordingly. For other CMSs adapt the code, perhaps you need to change the handling of query strings and fragments. The code above will not run under IIS, because it has no REQUEST_URI variable.

Why stealing trailing slashes is not cool

This section expressed in one sentence: Cool URLs don’t change, hence changing other people’s URLs is not cool.

Folks should understand the “U” in URL as unique. Each URL addresses one and only one particular resource. Technically spoken, if you change one single character of an URL, the altered URL points to a different resource, or nowhere.

Think of URLs as phone numbers. When you call 555-0100 you reach the switchboard, 555-0101 is the fax, and 555-0109 is the phone extension of somebody. When you steal the last digit, dialing 555-010, you get nowhere.

Yahoo'ish fools steal our trailing slashesOnly a fool would assert that a phone number shortened by one digit is way cooler than the complete phone number that actually connects somewhere. Well, the last digit of a phone number and the trailing slash of a directory link aren’t much different. If somebody hands out an URL (with trailing slash), then use it as is, or don’t use it at all. Don’t “prettify” it, because any change destroys its serviceability.

If one requests a directory without the trailing slash, most Web servers will just reply to the user agent (brower, screen reader, bot) with a redirect header telling that one must use a trailing slash, then the user agent has to re-issue the request in the formally correct way. From a Webmaster’s perspective, burning resources that thoughtlessly is plain theft. From a user’s perspective, things will often work without the slash, but they’ll be quicker with it. “Often” doesn’t equal “always”:

  • Some Web servers will serve the 404 page.
  • Some Web servers will serve the wrong content, because /dir is a valid script, virtual URI, or page that has nothing to do with the index of /dir/.
  • Many Web servers will respond with a 302 HTTP response code (Found) instead of a correct 301-redirect, so that most search engines discovering the sneakily circumcised URL will index the contents of the canonical URL under the invalid URL. Now all search engine users will request the incomplete URL too, running into unnecessary redirects.
  • Some Web servers will serve identical contents for /dir and /dir/, that leads to duplicate content issues with search engines that index both URLs from links. Most Web services that rank URLs will assign different scorings to all known URL variants, instead of accumulated rankings to both URLs (which would be the right thing to do, but is technically, well, challenging).
  • Some user agents can’t handle (301) redirects properly. Exotic user agents might serve the user an empty page or the redirect’s “error message”, and Web robots like the crawlers sent out by Technorati or MSN-LiveSearch hang up respectively process garbage.

Does it really make sense to maliciously manipulate URLs just because some clueless developers say “dude, without the slash it looks way cooler”? Nope. Stealing trailing slashes in general as well as storing amputated URLs is a brain dead approach.

KISS (keep it simple, stupid) is a great principle. “Cosmetic corrections” like trimming URLs add unnecessary complexity that leads to erroneous behavior and requires even more code tweaks. GIGO (garbage in, garbage out) is another great principle that applies here. Smart algos don’t change their inputs. As long as the input is processible, they accept it, otherwise they skip it.

Exceptions

URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - “domain.co.uk” in print and “domain dot co dot uk” on radio is enough. The necessary redirect is cheaper than a visitor who doesn’t type in the canonical URL including scheme, www-prefix, and trailing slash.

How URL canonicalization seems to irritate Technorati

Due to the not exactly responsively (respectively swamped) Technorati user support parts of this section should be interpreted as educated speculation. Also, I didn’t research enough cases to come to a working theory. So here is just the story “how Technorati fails to deal with my blog”.

When I moved my blog from blogspot to this domain, I’ve enhanced the faulty WordPress URL canonicalization. If any user agent requests http://sebastians-pamphlets.com it gets redirected to http://sebastians-pamphlets.com/. Invalid post/page URLs like http://sebastians-pamphlets.com/about redirect to http://sebastians-pamphlets.com/about/. All redirects are permanent, returning the HTTP response code “301″.

I’ve claimed my blog as http://sebastians-pamphlets.com/, but Technorati shows its URL without the trailing slash.
…<div class="url"><a href="http://sebastians-pamphlets.com">http://sebastians-pamphlets.com</a> </div> <a class="image-link" href="/blogs/sebastians-pamphlets.com"><img …

By the way, they forgot dozens of fans (folks who “fave’d” either my old blogspot outlet or this site) too.
Blogs claimed at Technorati

I’ve added a description and tons of tags, that both don’t show up on public pages. It seems my tags were deleted, at least they aren’t visible in edit mode any more.
Edit blog settings at Technorati

Shortly after the submission, Technorati stopped to adjust the reputation score from newly discovered inbound links. Furthermore, the list of my recent posts became stale, although I’ve pinged Technorati with every update, and technorati received my update notifications via ping services too. And yes, I’ve tried manual pings to no avail.

I’ve gained lots of fresh inbound links, but the authority score didn’t change. So I’ve asked Technorati’s support for help. A few weeks later, in December/2007, I’ve got an answer:

I’ve taken a look at the issue regarding picking up your pings for “sebastians-pamphlets.com”. After making a small adjustment, I’ve sent our spiders to revisit your page and your blog should be indexed successfully from now on.

Please let us know if you experience any problems in the future. Do not hesitate to contact us if you have any other questions.

Indeed, Technorati updated the reputation score from “56″ to “191″, and refreshed the list of posts including the most recent one.

Of course the “small adjustment” didn’t persist (I assume that a batch process stole the trailing slash that the friendly support person has added). I’ve sent a follow-up email asking whether that’s a slash issue or not, but didn’t receive a reply yet. I’m quite sure that Technorati doesn’t follow 301-redirects, so that’s a plausible cause for this bug at least.

Since December 2007 Technorati didn’t update my authority score (just the rank goes up and down depending on the number of inbound links Technorati shows on the reactions page - by the way these numbers are often unreal and change in the range of hundreds from day to day).
Blog reactions and authority scoring at Technorati

It seems Technorati didn’t index my posts since then (December/18/2007), so probably my outgoing links don’t count for their destinations.
Stale list of recent posts at Technorati

(All screenshots were taken on February/05/2008. When you click the Technorati links today, it could hopefully will look differently.)

I’m not amused. I’m curious what would happen when I add
if (!preg_match("/Technorati/i", "$userAgent")) {/* redirect code */}

to my canonicalization routine, but I can resist to handle particular Web robots. My URL canonicalization should be identical both for visitors and crawlers. Technorati should be able to fix this bug without code changes at my end or weeky support requests. Wishful thinking? Maybe.

Update 2008-03-06: Technorati crawls my blog again. The 301 redirects weren’t the issue. I’ll explain that in a follow-up post soon.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Text link broker woes: Google’s smart paid link sniffers

Google's smart paid link sniffer at workAfter the recent toolbar PageRank massacre link brokers are in the spotlight. One of them, TNX beta1, asked me to post a paid review of their service. It took a while to explain that nobody can buy a sales pitch here. I offered to write a pitiless honest review for a low hourly fee, provided a sample on their request, but got no order or payment yet. Never mind. Since the topic is hot, here’s my review, paid or not.

So what does TNX offer? Basically it’s a semi-automated link exchange where everybody can sign up to sell and/or purchase text links. TNX takes 25% commission, 12.5% from the publisher, and 12.5% from the advertiser. They calculate the prices based on Google’s toolbar PageRank and link popularity pulled from Yahoo. For example a site putting five blocks of four links each on one page with toolbar PageRank 4/10 and four pages with a toolbar PR 3/10 will earn $46.80 monthly.

TNX provides a tool to vary the links, so that when an advertiser purchases for example 100 links it’s possible to output those in 100 variations of anchor text as well as surrounding text before and after the A element, on possibly 100 different sites. Also TNX has a solution to increase the number of links slowly, so that search engines can’t find a gazillion of uniformed links to a (new) site all of a sudden. Whether or not that’s sufficient to simulate natural link growth remains an unanswered question, because I’ve no access to their algorithm.

Links as well as participating sites are reviewed by TNX staff, and frequently checked with bots. Links shouldn’t appear on pages which aren’t indexed by search engines or viewed by humans, or on 404 pages, pages with long and ugly URLs and such. They don’t accept PPC links or offensive ads.

All links are outputted server sided, what requires PHP or Perl (ASP/ASPX coming soon). There is a cache option, so it’s not necessary to download the links from the TNX servers for each page view. TNX recommends renaming the /cache/ directory to avoid an easily detectable sign for the occurence of TNX paid links on a Web site. Links are stored as plain HTML, besides the target="_blank" attribute there is no obvious footprint or pattern on link level. Example:
Have a website? See this <a href="http://www.example.com" target="_blank">free affiliate program</a>.
Have a blog? Check this <a href="http://www.example.com" target="_blank">affiliate program with high comissions</a> for publishers.

Webmasters can enter any string as delimiter, for example <br /> or “•”:

Have a website? See this free affiliate program. • Have a blog? Check this affiliate program with high comissions for publishers.

Publishers can choose from 17 niches, 7 languages, 5 linkpop levels, and 7 toolbar PageRank values to target their ads.

From the system stats in the members area the service is widely used:

  • As of today [2007-11-06] we have 31,802 users (daily growth: +0.62%)
  • Links in the system: 31,431,380
  • Links created in last hour: 1,616
  • Number of pages indexed by TNX: 37,221,398

Long story short, TNX jumped through many hoops to develop a system which is supposed to trade paid links that are undetectable by search engines. Is that so?

The major weak point is the system’s growth and that its users are humans. Even if such a system would be perfect, users will make mistakes and reveal the whole network to search engines. Here is how Google has identified most if not all of the TNX paid links:

Some Webmasters put their TNX links in sidebars under a label that identifies them as paid links. Google crawled those pages, and stored the link destinations in its paid links database. Also, they devalued at least the labelled links, if not the whole page or even the complete site lost its ability to pass link juice because the few paid links aren’t condomized.

Many Webmasters implemented their TNX links in templates, so that they appear on a large number of pages. Actually, that’s recommended by TNX. Even if the advertisers have used the text variation tool, their URLs appeared multiple times on each site. Google can detect site wide links, even if not each and every link appears on all pages, and flags them accordingly.

Maybe even a few Googlers have signed up and served the TNX links on their personal sites to gather examples, although that wasn’t neccessary because so many Webmasters with URLs in their signatures have told Google in this DP thread that they’ve signed up and at least tested TNX links on their pages.

Next Google compared the anchor text as well as the surrounding text of all flagged links, and found some patterns. Of course putting text before and after the linked anchor text seems to be a smart way to fake a natural link, but in fact Webmasters applied a bullet-proof procedure to outsmart themselves, because with multiple occurences of the same text constellations pointing to an URL, especially when found on unrelated sites (different owners, hosts etc., topically irrelevancy plays no role in this context), paid link detection is a breeze. Linkage like that may be “natural” with regard to patterns like site wide advertising or navigation, but a lookup in Google’s links database revealed that the same text constellations and URLs were found on n  other sites too.

Now that Google had compiled the seed, each and every instance of Googlebot delivered more evidence. It took Google only one crawl cycle to identify most sites carrying TNX links, and all TNX advertisers. Paid link flags from pages on sites with a low crawling frequency were delivered in addition. Meanwhile Google has drawed a comprehensive picture of the whole TNX network.

I’ve developed such a link network many years ago (it’s defunct now). It was successful because only very experienced Webmasters controlling a fair amount of squeaky clean sites were invited. Allowing newbies to participate in such an organized link swindle is the kiss of death, because newbies do make newbie mistakes, and Google makes use of newbie mistakes to catch all participants. By the way, with the capabilities Google has today, my former approach to manipulate rankings with artificial linkage would be detectable with statistical methods similar to the algo outlined above, despite the closed circle of savvy participants.

From reading the various DP threads about TNX as well as their sales pitches, I’ve recognized a very popular misunderstanding of Google’s mentality. Folks are worrying whether an algo can detect the intention of links or not, usually focusing on particular links or linking methods. Google on the other hand looks at the whole crawlable Web. When they develop a paid link detection algo, they have a copy of the known universe to play with, as well as a complete history of each and every hyperlink crawled by Ms. Googlebot since 1998 or so. Naturally, their statistical methods will catch massive artificial linkage first, but fine tuning the sensitivity of paid link sniffers respectively creating variants to cover different linking patterns is no big deal. Of course there is always a way to hide a paid link, but nobody can hide millions of them.

Unfortunately, the unique selling point of the TNX service –that goes for all link brokers by the way– is manipulation of search engine rankings, hence even if they would offer nofollow’ed links to trade traffic instead of PageRank, most probably they would be forced to reduce the prices. Since TNX links are rather cheap, I’m not sure that will pay. It would be a shame when they decide to change the business model but it doesn’t pay for TNX, because the underlying concept is great. It just shouldn’t be used to exchange clean links. All the tricks developed to outsmart Google, like the text variation tool or not putting links on not exactly trafficked pages, are suitable to serve non-repetitive ads (coming with attractive CTRs) to humans.

I’ve asked TNX: I’ve decided to review your service on my blog, regardless whether you pay me or not. The result of my research is that I can’t recommend TNX in its current shape. If you still want a paid review, and/or a quote in the article, I’ve a question: Provided Google has drawn a detailed picture of your complete network, are you ready to switch to nofollow’ed links in order to trade traffic instead of PageRank, possibly with slightly reduced prices? Their answer:

We would be glad to accept your offer of a free review, because we don’t want to pay for a negative review.
Nobody can draw a detailed picture of our network - it’s impossible for one advertiser to buy links from all or a majority sites of our network. Many webmasters choose only relevant advertisers.
We will not switch to nofollow’ed links, but we are planning not to use Google PR for link pricing in the near future - we plan to use our own real-time page-value rank.

Well, it’s not necessary to find one or more links on all sites to identify a network.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Buying cheap viagra algorithmically

Since Google can’t manage to clean up [Buy cheap viagra] let’s do it ourselves. Go seek a somewhat trusted search blog mentioning “buy cheap viagra” somewhere in the archives and link to the post with a slightly diversified anchor text like “how to buy cheap viagra online“. Matt deserves a #1 spot by the way so spread many links …

Then when Matt is annoyed enough and Google has kicked out the unrelated stuff from this search hopefully my viagra spam will rank as deserved again ;)

Update a few hours later: Matt ranks #1 for [buy cheap viagra algorithmically]:
Matt Cutts's first spot for [buy cheap viagra algorithmically]
His ranking for [buy cheap viagra] fell about 10 positions to #17 but for [buy cheap viagra online] he’s still on the first SERP, now at position #10 (#3 yesterday). Interesting. It seems that Google’s newish turbo-blog-indexing influences the rankings of pages linked from blog posts relatively short dated but not exactly long lasting.

Related posts:
Negative SEO At Work: Buying Cheap Viagra From Google’s Very Own Matt Cutts - Unless You Prefer Reddit? Or Topix? by Fantomaster
Trust + keywords + link = Good ranking (or: How Matt Cutts got ranked for “Buy Cheap Viagra”) by Wiep



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google assists SERP Click-Through Optimization

Big Mama Google in her ongoing campaign to keep her search index clean assists Webmasters with reports allowing click-trough optimization of a dozen or so pages per Web site. Google launched these reports a while ago, but most Webmasters didn’t make the best use of them. Now that Vanessa has revealed her SEO secrets, lets discuss why and how Google helps increasing, improving, and targeting search engine traffic.

Google is not interested in gazillions of pages which rank high for (obscure) search terms but don’t get clicked from the SERPs. This clutter tortures the crawler and indexer, and it wastes expensive resources the query engine could use to deliver better results to the searchers.

Unfortunately, legions of clueless SEOs work hard to increase mount clutter by providing their clients with weekly ranking reports, what leads to even more pages which rank for (potentially money making) search phrases but appear on the SERPs with such crappy titles and snippets that not even a searcher coming with an IQ slightly below a slice of bread clicks them.

High rankings don’t pay the bills, converting traffic from SERPs on the other hand does. A nicely ranking page is an asset, which in most cases just needs a few minor tweaks to attract search engine users (mount clutter contains machine generated cookie-cutter pages too, but that’s a completely other story).

For example unattended pages gaining their SERP position from anchor text of links pointing to them often have a crappy click through rate (CTR). Say you’ve a page about a particular aspect of green widgets, which applies to widgets of all colors. For some reason folks preferring red widgets like your piece and link to it with “red widgets” as anchor text. The page will rank fine for [red widgets], but since “red widgets” is not mentioned on the page this keyword phrase doesn’t appear on the SERP’s snippets, not to speak of the linked title. Search engine users seeking for information on red widgets don’t click the link about green widgets, although it might be the best matching search result.

So here is the click-thru optimization process based on Google’s query stats (it doesn’t work with brand new sites nor more or less unindexed sites, because the data provided in Google’s Webmaster Tools are available, reliable and quite accurate for somewhat established sites only):

Login, choose a site and go to query stats. In an ideal world you’ll see two tables of rather identical keyword lists (all examples made up).

Top search queries Avg.
Pos.
Top SERP clicks Avg.
Pos.
1. web site design 5 1. web site design 4
2. google consulting 4 2. seo consulting 5
3. seo consulting 3 3. google consulting 2
4. web site structures 2 4. internal links 3
5. internal linkage 1 5. web site structure 3
6. crawlability 3 6. crawlability 5

The “Top search queries” table on the left shows positions for search phrases on the SERPs, regardless whether these pages got clicks or not. The “Top search query clicks” table on the right shows which search terms got clicked most, and where the landing pages were positioned on their SERPs. If good keywords appear in the left table but not in the right one, you’ve CTR optimization potentials.

The “average top position” might differ from todays SERPs, and it might differ for particular keywords even if those appear in the same line in both tables. Positioning fluctuation depends on a couple of factors. First, the position is recorded at the run time of each search query during the last 7 days, and within seven days a page can jump up and down on the SERPs. Second, positioning on for example UK SERPs can differ from US SERPs, so an average 3rd position may be a utterly useless value, when a page ranks #1 in the UK and gets a fair amount of traffic from UK SERPs, but ranks #8 on US SERPs and searchers don’t click it because the page is about a local event near Loch Nowhere in the highlands. Hence refine the reports by selecting your target markets in “location”, and if necessary “search type” too. Third, if these stats are generated based on very few searches and even fewer click throughs, they are totally and utterly useless for optimization purposes.

Lets say you’ve got a site with a fair amount of Google search engine traffic, the next step is identifying the landing pages involved (you get only 20 search queries, so the report covers only a fraction of your site’s pages). Pull these data from your referrer stats, or extract SERP referrers from your logs to create a crosstab of search terms from Google’s reports per landing page. Although the click data are from Google’s SERPs, it might make sense to do this job with a broader scope, that is including referrers from all major search engines.

Now perform the searches for your 20 keyword phrases (just click on the keywords on the report) to check how your pages look at the SERPs. If particular landing pages trigger search results for more than one search term, extract them all. Then load your landing page, and view its source. Read your page first rendered in your browser, then check out semantic hints in the source code, for example ALT or TITLE text and stuff like that. Look at the anchor text of incoming links (you can use link stats and anchor text stats from Google, We Build Pages Tools, …) and other ranking factors to understand why Google thinks this page is a good match for the search term. For each page, let the information sink before you change anything.

If the page is not exactly a traffic generator for other targeted keywords, you can optimize it with regard to a better CTR for the keyword(s) it ranks for. Basically that means use the keyword(s) naturally on all page areas where it makes sense, and provide each occurence with a context which hopefully makes it into the SERP’s snippet.

Make up a few natural sentences a searcher might have in mind when searching for your keyword(s). Write them down. Order them by their ability to fit the current page text in a natural way. Bear in mind that with personalized search Google could have scanned the searcher’s brain to add different contexts to the search query, so don’t concentrate too much on the keyword phrase alone, but on short sentences containing both the keyword(s), respectively their synonyms, and a sensible context as well.

There is no magic number like “use the keywords 5 times to get a #3 spot” or “7 occurences of a keyword gain you a #1 ranking”. Optimal keyword density is a myth, so just apply common sense by not annoying human readers. One readable sentence containing the keyword(s) might suffice. Also, emphasizing keywords (EM/I, STRONG/B, eye catching colors …) makes sense because it helps catching the attention of scanning visitors, but don’t over-emphasize because that looks crappy. The same goes for H2/H3/… headings. Structure your copy, but don’t write in headlines. When you emphasize a word or phrase in (bold) red, then don’t do that consistently but only in the most important sentence(s) of your page, and better only on the first visible screen of a longer page.

Work in your keyword+context laden sentences, but -again!- do it in a natural way. You’re writing for humans, not for algos which at this point already know what your page is all about and rank it properly. If your fine tuning gains you a better ranking that’s fine, but the goal is catching the attention of searchers reading (in most cases just skimming) your page title and a machine generated snippet on a search result page. Convince the algo to use your inserted sentence(s) in the snippet, not keyword lists from navigation elements or so.

Write a sensible summary of the page’s content, not more than 200-250 characters, and put that into the description meta tag. Do not copy the first paragraph or other text from the page. Write the summary from scratch instead, and mention the targeted keyword(s). The first paragraph on the page can exceed the length of the meta description to deliver an overview of the page’s message, and it should provide the same information, preferably in the first sentence, but don’t make it longish.

Check the TITLE tag in HEAD: when it is truncated on the SERP then shorten it so that the keyword becomes visible, perhaps move the keyword(s) to the beginning, or create a neat page title around the keyword(s). Do title changes very carefully, because the title is an important ranking factor and your changes could result in a ranking drop. Some CMSs change the URL without notice on changes of the title text, and you certainly don’t want to touch the URL at this point.

Make sure that the page title appears on the page too. Putting the TITLE tag’s content (or a slight variation) in a H1 element in BODY cannot hurt. If you for some weird reasons don’t use H-elements, then at least format it prominently (bold, different color but not red, bigger font size …).

If the page performs nice with a couple money terms and just has a crappy CTR for a particular keyword it ranks for, you can just add a link pointing to a (new) page optimized for that keyword(s), with the keyword(s) in the anchor text, preferably embedded in a readable sentence within the content (long enough to fill two lines under the linked title on the SERP), to improve the snippet. Adding a (prominent) link to a related topic should not impact rankings for other keywords too much, but the keywords submitted by searchers should appear in the snippet a short while after the next crawl. In such cases better don’t change the title, at least not now. If the page gained its ranking solely from anchor text of inbound links, putting the search term on the page can give it a nice boost.

Make sure you get an alert when Ms. Googlebot fetches the changed pages, and check out the SERPs and Google’s click stats a few days later. After a while you’ll get a pretty good idea of how Google creates snippets, and which snippets perform best on the SERPs. Repeat until success.

Related posts:
Google Quality Scores for Natural Search Optimization by Chris Silver Smith
Improve SERP-snippets by providing a good meta description tag by Raj Krishnan from Google’s Snippets Team



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

More anchor text analysis from Webmaster Central

If you didn’t spot my update posted a few hours ago, log in to Webmaster Central and view your anchor text stats. Find way more phrases and play with the variations, these should allow you to track down sources by quoted search queries. Also, the word-stats are back.
Have fun!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google’s Anchor Text Reports Encourage Spamming and Scraping

Despite the title-bait Google’s new Webmaster Toy is an interesting tool, thanks! I’ll look at it from a Webmaster’s perspective, all SEO hats in the wardrobe.

Danny tells us more, for example the data source:

A few more details about the anchor text data. First, it comes only from external links to your site. Anchor text you use on your own site isn’t counted.
Second, links to any subdomains you have are NOT included in the data.

From a quick view I doubted that Google filters internal links properly. So I’ve checked a few sites and found internal anchor text leading the new anchor text stats. Well, it’s not unusual that folks copy and paste links including anchor text. Using page titles as anchor text is also business as usual. But why the heck do page titles and shortcuts from vertical menu bars appear to be the most prominent external anchor text? With large sites having tons of inbounds that seems not so easy to investigate.

Thus I’ve looked at a tiny site of mine, this blog. It carries a few recent bollocks posts which were so useless that nobody bothered linking, I thought. Well, partly that’s the case, there were no links by humans, but enough fully automated links to get Ms. Googlebot’s attention. Thanks to scrapers, RSS aggregators, forums linking the author’s recent blog posts and so on, everything on this planet gets linked, so duplicated post titles make it in the anchor text stats.

Back to larger sites I found out that scraper sites and indexed search results were responsible for the extremely misleading ordering of Google’s anchor text excerpts. Both page types should not get indexed in the first place, and it’s ridiculous that crappy data, respectively irrelevantly accumulated data, dilute a well meant effort to report inbound linkage to Webmasters.

Unweighted ordering of inbound anchor text by commonness and limiting the number of listed phrases to 100 makes this report utterly useless for many sites, and so much the worse it sends a strong but wrong signal. Importance in the eye of the beholder gets expressed by the top of an ordered list or result set.

Joe Webmaster tracking down the sources of his top-10 inbound links finds a shitload of low-life pages, thinks hey, linkage is after all a game of large numbers when Google says those links are most important, launches a gazillion of doorways and joins every link farm out there. Bummer. Next week Joe Webmaster pops up in the Google Forum telling the world that his innocent site got tanked because he followed Google’s suggestions and we’re bothered with just another huge but useless thread discussing whether scraper links can damage rankings or not, finally inventing the “buffy blonde girl pointy stick penalty” on page 128.

Roundtrip to hat rack, SEO hat attached. I perfectly understand that Google is not keen on disclosing trust/quality/reputation scores. I can read these stats because I understand the anatomy (intention, data, methods, context), and perhaps I can even get something useful out of them. Explaining this to an impatient site owner who unwillingly bought the link quality counts argument but still believes that nofollow’ed links carry some mysterious weight because they appear in reversed citation results is a completely other story.

Dear Google, if you really can’t hand out information on link weighting by ordering inbound anchor text by trust or other signs of importance, then can you please at least filter out all the useless crap? This shouldn’t be that hard to accomplish, since even simple site-search queries for “scraping” reveal tons of definitely not index-worthy pages which already do not pass any search engine love to the link destinations. Thanks in advance!

When I write “Dear Google”, can Vanessa hear me? Please :)

Tags: ()

Update: Check your anchor text stats page every now and then, don’t miss out on updates :)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments