Archived posts from the 'Google' Category

You can’t escape from Google-Jail when …

spammers stuck in google jail… you’ve boosted your business Web site’s rankings with shitloads of crappy links. The 11th SEO commandment: Don’t promote your white hat sites with black hat link building methods! It may work for a while, but once you find your butt in Google-jail, there’s no way out. Not even a reconsideration request can help because you can’t provide its prerequisites.

When you’re caught eventually –penalized for tons of stinky links– and have to file a reinclusion request, Google wants you to remove all the shady links you’ve spread on the Web before they lift your penalty. Here is an example, well documented in a Google Groups thread started by a penalized site owner with official statements from Matt Cutts and John Müller from Google.

The site in question, a small family business from the UK, has used more or less every tactic from a lazy link builder’s textbook to create 40,000+ inbound links. Sponsored WordPress themes, paid links, comment spam, artificial link exchanges and whatnot.

Most sites that carry these links are in no way related to the penalized site, which deals with modern teak garden furniture and home furniture sets, for example porn galleries, Web designers, US city guides, obscure oriental blogs, job boards, or cat masturbation guides. (Don’t get me wrong. Of course not every link has to be topically related. Every link from a trusted page can pass PageRank, and can improve crawling, indexing, and so on.)

Google has absolutely no problem with unrelated links, unless a site’s link profile consists of way too many spammy and/or unrelated links. That does not mean that spreading a gazillion low-life links pointing to a competitor will get this site penalized or even banned. Negative SEO is not that simple. For an innocent site Google just ignores spammy inbound links, but most probably flags it for further investigations, both manually as well as algorithmically.

If on the other hand Google finds evidence that a site is actively involved in link monkey business of any kind, that’s a completely different story. Such evidence could be massively linking out to spammy places, hosting reciprocal links pages or FFA directories, unskillful (manual|automated) comment spam, signature links and mentions at places that trade links, textual contents made for (paid) link campaigns when reused too often, buying links from trackable services, (link request emails forwarded via) paid-link/spam reports, and so on.

Below is the “how to file a successful reconsideration request when your sins include link spam” from Googlers.

Matt Cutts:

The recommendation from your SEO guy led you directly into a pretty high-risk area; I doubt you really want pages like (NSAW) having sponsored links to your furniture site anyway. It’s definitely possible to extricate your site, but I would make an effort to contact the sites with your sponsored links and request that they remove the links, and then do a reconsideration request. Maybe in the text of your reconsideration request, I’d include a pointer to this thread as well.

John Müller:

You may want to consider what you can do to help clean up similar [=spammy] links on other people’s sites. Blogs and newspaper sites such as http://media.www.dailypennsylvanian.com sometimes receive short comments such as “dont agree”, apparently only for a link back to a site. These comments often use keywords from that site instead of a user name, perhaps “tree bench” for a furniture site or “sexy shoes” for a footwear site. If this kind of behavior might have taken place for your site, you may want to work on rectifying it and include some information on it in your reconsideration request. Given your situation, the person considering your reconsideration request might be curious about links like that.

Translation: We’ll ignore your weekly reconsideration requests unless you’ve removed all artificial links pointing to your site. You’re stuck in Google’s dungeon because they’ve thrown away the keys.

I’d guess that for a site that has filed a reinclusion request stating the site was involved in some sort of link monkey business, Google applies a more strict policy than with a site that was attacked by negative SEO methods. I highly doubt that when caught red-handed a lame excuse like “I didn’t create those links” is a tactic I could recommend, because Googlers hate it when an applicant lies in a reinclusion request.

Once caught and penalized, the “since when do inbound links count as negative votes” argument doesn’t apply. It’s quite clear that removing the traces (admitted as well as not admitted shady links) is a prerequisite for a penalty lift. And that even though Google has already discounted these links. That’s the same as with penalized doorway pages. Redirecting doorways to legit landing pages doesn’t count, Google wants to see a 410-Gone HTTP response code (or at least a 404) before they un-penalize a site.

I doubt that’s common knowledge to folks who promote their white hat sites with black hat methods. Getting links wiped out at places that didn’t check the intention of inserted links in the first place is a royal PITA, in other words, it’s impossible to get all shady links removed once you find your butt in Google-jail. That’s extremely uncomfortable for site owners who fell for questionable forum advice or hired a promotional service (no, I don’t call such assclowns SEOs) applying shady marketing methods without a clear and written warning that those are extremely risky, fully explained and signed by the client.

Maybe in some cases Google will un-penalize a great site although not all link spam was wiped out. However, the costs and efforts of preparing a successful resonsideration request are immense, not to speak of the massive loss of traffic and income.

As Barry mentioned, the thread linked above might be interesting for folks keen on an official confirmation that Google -60 penalties exist. I’d say such SERP penalties (aka red & yellow cards) aren’t exactly new, and it plays no role to which position a site penalized for guideline violations gets downranked. When I’ve lost a top spot for gaming Google, that’s kismet. I’m not interested in figuring out that 20k spammy links get me a -30 penalty, 40k shady links result in a -60 penalty, and 100k unnatural links qualify me for the famous -950 bashing (the numbers are made up of course). If I’d spam, then I’d just move on because I’d have already launched enough other projects to compensate the losses.

PS: While I was typing, Barry Schwartz posted his Google-Jail story at SE Roundtable.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!

Dear Google, please respect NOINDEXMatt Cutts asks us How should Google handle NOINDEX? That’s a tough question worth thinking twice before you submit a comment to Matt’s post. Here is Matt’s question, all the background information you need, and my opinion.

What is NOINDEX?

Noindex is an indexer directive defined in the Robots Exclusion Protocol (REP) from 1996 for use in robots meta tags. Putting a NOINDEX value in a page’s robots meta tag or X-Robots-Tag tells search engines that they shall not index the page content, but may follow links provided on the page.

To get a grip on NOINDEX’s role in the REP please read my Robots Exclusion Protocol summary at SEOmoz. Also, Google experiments with NOINDEX as crawler directive in robots.txt, more on that later.

How major search engines treat NOINDEX

Of course you could read a ton of my pamphlets to extract this information, but Matt’s summary is still accurate and easier to digest:

    [Matt Cutts on August 30, 2006]
  • Google doesn’t show the page in any way.
  • Ask doesn’t show the page in any way.
  • MSN shows a URL reference and cached link, but no snippet. Clicking the cached link doesn’t return anything.
  • Yahoo! shows a URL reference and cached link, but no snippet. Clicking on the cached link returns the cached page.

Personally, I’d prefer it if every search engine treated the noindex meta tag by not showing a page in the search results at all. [Meanwhile Matt might have a slightly different opinion.]

Google’s experimental support of NOINDEX as crawler directive in robots.txt also includes the DISALLOW functionality (an instruction that forbids crawling), and most probably URIs tagged with NOINDEX in robots.txt cannot accumulate PageRank. In my humble opinion the DISALLOW behavior of NOINDEX in robots.txt is completely wrong, and without any doubt in no way compliant to the Robots Exclusion Protocol.

Matt’s question: How should Google handle NOINDEX in the future?

To simplify Matt’s poll, lets assume he’s talking about NOINDEX as indexer directive, regardless where a Webmaster has put it (robots meta tag, X-Robots-Tag, or robots.txt).

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

Here are the arguments, or pros and cons, for each variant:

Google should completely drop a NOINDEX’ed page from their search results

Obviously that’s what most Webmasters would prefer:

This is the behavior that we’ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way — in fact, one of the only ways — to completely remove all traces of a site from Google (another way is our url removal tool). That’s incredibly useful for webmasters.

NOINDEX means don’t index, search engines must respect such directives, even when the content isn’t password protected or cloaked away (redirected or hidden for crawlers but not for visitors).

The corner case that Google discovers a link and lists it on their SERPs before the page that carries a NOINDEX directive is crawled and deindexed isn’t crucial, and could be avoided by a (new) NOINDEX indexer directive in robots.txt, which is requested by search engines quite frequently. Ok, maybe Google’s BlitzCrawler™ has to request robots.txt more often then.

Google should show a reference to NOINDEX’ed pages on their SERPs

Search quality and user experience are strong arguments:

Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google’s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like [3 linked examples] aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users (and thus for Google).

Search quality and searchers’ user experience is also a strong argument for totally delisting NOINDEX’ed pages, because most Webmasters use this indexer directive to keep stuff that doesn’t provide value for searchers out of the search indexes. <polemic>I mean, how much weight have a few Korean sites when it comes to decisions that affect the whole Web?</polemic>

If a Webmaster puts a NOINDEX directive by accident, that’s easy to spot in the site’s stats, considering the volume of traffic that Google controls. I highly doubt that a simple URI reference with an anchor text scrubbed from external links on Google SERPs would heal such a mistake. Also, Matt said that Google could add a NOINDEX check to the Webmaster Console.

The reference to the URI removal tools is out of context, because these tools remove an URI only for a short period of time and all removal requests have to be resubmitted repeatedly every few weeks. NOINDEX on the other hand is a way to keep an URI out of the index as long as this crawler directive is provided.

I’d say the sole argument for listing references to NOINDEX’ed pages that counts is misleading navigational searches. Of course that does not mean that Google may ignore the NOINDEX directive to show –with a linked reference– that they know a resource, despite the fact that the site owner has strictly forbidden such references on SERPs.

Something in between, Google should find a reasonable way to please both Webmasters and searchers

Quoting Matt again:

The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don’t want to show up in Google). Users are most discouraged when they search for a well-known site and can’t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don’t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.

Whether or not a site is popular must not impact a search engine’s respect for a Webmaster’s decision to keep search engines, and their users, out of her realm. That reads like “Hey, Google is popular, so we’ve the right to go to Mountain View to pillage the Googleplex, acquiring everything we can steal for the public domain”. Neither Webmasters nor search engines should mimic Robin Hood. Also, lots of Webmasters highly doubt that Google’s idea of (link) popularity should rule the Web. ;)

Whether or not a site is listed in the ODP directory is definitely not an indicator that can be applied here. Last time I looked the majority of the Web’s content wasn’t listed at DMOZ due to the lack of editors and various other reasons, and that includes gazillions of great and useful resources. I’m not bashing DMOZ here, but as a matter of fact it’s not comprehensive enough to serve as indicator for anything, especially not importance and popularity.

I strongly believe that there’s no such thing as a criterion suitable to mark out a two class Web.

My take: Yes, No, Depends

Google could enhance navigational queries –and even “I feel lucky” queries– that lead to a NOINDEX’ed page with a message like “The best matching result for this query was blocked by the site”. I wouldn’t mind if they mention the URI as long as it’s not linked.

In fact, the problem is the granularity of the existing indexer directives. NOINDEX is neither meant for nor capable of serving that many purposes. It is wrong to assign DISALLOW semantics to NOINDEX, and it is wrong to create two classes of NOINDEX support. Fortunately, we’ve more REP indexer directives that could play a role in this discussion.

NOODP, NOYDIR, NOARCHIVE and/or NOSNIPPET in combination with NOINDEX on a site’s home page, that is either a domain or subdomain, could indicate that search engines must not show references to the URI in question. Otherwise, if no other indexer directives elaborate NOINDEX, search engines could show references to NOINDEX’ed main pages. The majority of navigational search queries should lead to main pages, so that would solve the search quality issues.

Of course that’s not precise enough due to the lack of a specific directive that deals with references to forbidden URIs, but it’s way better than ignoring NOINDEX in its current meaning.

A fair solution: NOREFERENCE

If I’d make the decision at Google and couldn’t live with a best matching search result blocked  message, I’d go for a new REP tag:

“NOINDEX, NOREFERENCE” in a robots meta tag –respectively Googlebot meta tag– or X-Robots-Tag forbids search engines to show a reference on their SERPs. In robots.txt this would look like
NOINDEX: /
NOINDEX: /blog/
NOINDEX: /members/

NOREFERENCE: /
NOREFERENCE: /blog/
NOREFERENCE: /members/

Search engines would crawl these URIs, and follow their links as long as there’s no NOFOLLOW directive either in robots.txt or a page specific instruction.

NOINDEX without a NOREFERENCE directive would instruct search engines not to index a page, but allows references on SERPs. Supporting this indexer directive both in robots.txt as well as on-the-page (respectively in the HTTP header for X-Robots-Tags) makes it easy to add NOREFERENCE on sites that hate search engine traffic. Also, a syntax variant like NOINDEX=NOREFERENCE for robots.txt could tell search eniges how they have to treat NOINDEX statements on site level, or even on site area level.

Even more appealing would be NOINDEX=REFERENCE, because only the very few Webmasters that would like to see their NOINDEX’ed URIs on Google’s SERPs would have to add a directive to their robots.txt at all. Unfortunately, that’s not doable for Google unless they can convice three well known Korean sites to edit their robots.txt. ;)

 

By the way, don’t miss out on my draft asking for REP tag support in robots.txt!

Anyway: Dear Google, please don’t touch NOINDEX! :)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Nofollow still means don’t follow, and how to instruct Google to crawl nofollow’ed links nevertheless

painting a nofollow'ed link dofollowWhat was meant as a quick test of rel-nofollow once again (inspired by Michelle’s post stating that nofollow’ed comment author links result in rankings), turned out to some interesting observations:

  • Google uses sneaky JavaScript links (that mask nofollow’ed static links) for discovery crawling, and indexes the link destinations despite there’s no hard coded link on any page on the whole Web.
  • Google doesn’t crawl URIs found in nofollow’ed links only.
  • Google most probably doesn’t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.
  • Google most probably doesn’t pass anchor text of JavaScript links to the link destination.
  • Google doesn’t pass anchor text of (hard coded) nofollow’ed links to the link destination.

As for my inspiration, I guess not all links in Michelle’s test were truly nofollow’ed. However, she’s spot on stating that condomized author links aren’t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don’t pay too much attention on REL attributes when you spread your links.

As for my quick test explained below, please consider it an inspiration too. It’s not a full blown SEO test, because I’ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn’t influenced by external noise, for example scraped links and such stuff.

On 2008-02-22 06:20:00 I’ve put a new nofollow’ed link onto my sidebar: Zilchish Crap
<a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-a" rel="nofollow"><span id="repstuff-something-b">Zilchish Crap</span></a>
<script type="text/javascript">
handle=document.getElementById(‘repstuff-something-b’);
handle.firstChild.data=‘Nillified, Nil’;
handle=document.getElementById(‘repstuff-something-a’);
handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js1’;
handle.rel=‘dofollow’;
</script>

(The JavaScript code changes the link’s HREF, REL and anchor text.)

The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow’ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.

Google crawls URIs extracted from somewhat sneaky JavaScript code

20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element’s HREF:
66.249.72.5 2008-02-22 06:47:07 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js1

Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page:
handle=document.getElementById(‘a1’);
handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js2’;
handle.rel=‘dofollow’;

From the log:
66.249.72.5 2008-02-22 09:37:11 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js2

So far Google ignored the hidden JavaScript link to /repstuff/something.php?nil=js3 on the test page. Its code doesn’t change a static link, so that makes sense in the context of repeated statements like “Google ignores JavaScript links / treats them like nofollow’ed links” by Google reps.

Of course the JS code above is easy to analyze, but don’t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!

Google indexes pages that have only JavaScript links pointing to them

The next day I’ve checked the search index, and the results are interesting:

rel-nofollow-test search results

The first search result is the content of the URI with the query string parameter ?nil=js1, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI /repstuff/something.php without query string. There’s not a single real link to this URI elsewhere.

The second search result is a post URI where Google recognized the hard coded anchor text “zilchish crap”, but not the JS code that overwrites it with “Nillified, Nil”. With the SERP-URI parameter “&filter=0″ Google shows more posts that are findable with the search term [zilchish]. (Hey Matt and Brian, here’s room for improvement!)

Google doesn’t pass anchor text of nofollow’ed links to the link destination

A search for [zilchish site:sebastians-pamphlets.com] doesn’t show the testpage that doesn’t carry this term. In other words, so far the anchor text “zilchish crap” of the nofollow’ed sidebar link didn’t impact the test page’s rankings yet.

Google doesn’t treat anchor text of JavaScript links as textual content

A search for [nillified site:sebastians-pamphlets.com] doesn’t show any URIs that have “nil, nillified” as client sided anchor text on the sidebar, just the test page:

rel-nofollow-test search results

Results, conclusions, speculation

This test wasn’t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately “nil” and “nillified” appear both in the JS anchor text as well as on the page, so that’s for another post. However, it seems the JS anchor text isn’t indexed for the pages carrying the JS code, at least they don’t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination’s relevancy for “nil” or “nillified” as well.

Maybe Google’s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows.

However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I’m quite confident that condomized links still don’t pass reputation, nor topically relevance. My test page is unfindable for the nofollow’ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.

How to safely strip a link condom

And what’s the actual “news”? Well, say you’ve links that you must condomize because they’re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow’ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don’t count, so that’s perfectly in line with Google’s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don’t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn’t that neat?

Link condoms with juicy taste faking good karma

Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they look plain ugly otherwise.

Here is how you can do this for a complete Web page. This link is nofollow’ed. The JavaScript code below changed its REL value to “dofollow”. When you put this code at the bottom of your pages, it will un-condomize all your nofollow’ed links.
<script type="text/javascript">
if (document.getElementsByTagName) {
var aElements = document.getElementsByTagName("a");
for (var i=0; i<aElements.length; i++) {
var relvalue = aElements[i].rel.toUpperCase();
if (relvalue.match("NOFOLLOW") != "null") {
aElements[i].rel = "dofollow";
}
}
}
</script>

(You’ll find still condomized links on this page. That’s because the JavaScript routine above changes only links placed above it.)

When you add JavaScript routines like that to your pages, you’ll increase their page loading time. IOW you slow them down. Also, you should add a note to your linking policy to avoid confused advertisers who chase toolbar PageRank.

Updates: Obviously Google distrusts me, how come? Four days after the link discovery the search quality archangel requested the nofollow’ed URI –without query string– possibly to check whether I serve different stuff to bots and people. As if I’d cloak, laughable. (Or an assclown linked the URI without condom.)
Day five: Google’s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren’t interested in client-sided links at all?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Save bandwidth costs: Dynamic pages can support If-Modified-Since too

Conditional HTTP GET requests make Webmasters and Crawlers happyWhen search engine crawlers burn way too much of your bandwidth, this post is for you. Crawlers sent out by major search engines (Google, Yahoo and MSN/Live Search) support conditional GETs, that means they don’t fetch your pages if those didn’t change since the last crawl.

Of course they must fetch your stuff over and over again for this comparision, if your Web server doesn’t play nice with Web robots, as well as with other user agents that can  cache your pages and other Web objects like images. The protocol your Web server and the requestors use to handle caching is quite simple, but its implementation can become tricky. Here is how it works:

1st request Feb/10/2008 12:00:00

Googlebot requests /some-page.php from your server. Since Google has just discovered your page, there are no unusual request headers, just a plain GET.

You create the page from a database record which was modified on Feb/09/2008 10:00:00. Your server sends Googlebot the full page (5k) with an HTTP header
Date: Sun, 10 Feb 2008 12:00:00 GMT
Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT

(lets assume your server is located in Greenwich, UK), the HTTP response code is 200 (OK).

Bandwidth used: 5 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

2nd request Feb/17/2008 12:00:00

Googlebot found interesting links pointing to your page, so it requests /some-page.php again to check for updates. Since Google already knows the resource, Googlebot requests it with an additional HTTP header
If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT

where the date and time is taken from the Last-Modified header you’ve sent in your response to the previous request.

You didn’t change the page’s record in the database, hence there’s no need to send the full page again. Your Web server sends Googlebot just an HTTP header
Date: Sun, 17 Feb 2008 12:00:00 GMT
Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT

The HTTP response code is 304 (Not Modified). (Your Web server can suppress the Last-Modified header, because the requestor has this timestamp already.)

Bandwidth used: Less than 500 bytes for the HTTP header.

3rd request Feb/24/2008 12:00:00

Googlebot can’t resist to recrawl /some-page.php, again using the
If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT

header.

You’ve updated the database on Feb/23/2008 09:00:00 adding a few paragraphs to the article, thus you send Googlebot the full page (now 7k) with this HTTP header
Date: Sun, 10 Feb 2008 12:00:00 GMT
Last-Modified: Sat, 23 Feb 2008 09:00:00 GMT

and an HTTP response code 200 (OK).

Bandwidth used: 7 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

Further requests

Provided you don’t change the contents again, all further chats between Googlebot and your Web server regarding /some-page.php will burn less than 500 bytes of your bandwidth each. Say Googlebot requests this page weekly, that’s 370k saved bandwidth annually. You do the math. Even with a medium-sized Web site you most likely want to implement proper caching, right?

Not only Webmasters love conditional GET requests that save bandwidth costs and processing time, search engines aren’t keen on useless data transfers too. So lets see how you could respond efficiently to conditional GET requests from search engines. Apache handles caching of static files (e.g. .txt or .html files you upload with FTP) differently from dynamic contents (script outputs with or without a query string in the URI).

Static files

Fortunately, Apache comes with native support of the Last-Modified / If-Modified-Since / Not-Modified functionality. That means that crawlers and your Web server don’t produce too much network traffic when a requested static file  didn’t change since the last crawl.

You can test your Web server’s conditional GET support with your robots.txt, or, if even your robots.txt is a script, create a tiny HTML page with a text editor and upload it via FTP. Another neat tool to check HTTP headers is the Live Headers Extension for FireFox (bear in mind that testing crawler behavior with Web browsers is fault-prone by design).

If your second request of an unchanged static file results in a 200 HTTP response code, instead of a 304, call your hosting service. If it works and you’ve only static pages, then bookmark this article and move on.

Dynamic contents

Everything you output with server sided scripts is dynamic content by definition, regardless whether the URI has a query string or not. Even if you just read and print out a static file –that never changes– with PHP, Apache doesn’t add the Last-Modified header which forces crawlers to perform further requests with an If-Modified-Since header.

With dynamic content you can’t rely on Apache’s caching support, you must do it yourself.

The first step is figuring out where your CMS or eCommerce software hides the timestamps telling you the date and time of a page’s last modification. Usually a script pulls its stuff from different database tables, hence a page contains more than one area, or block, of dynamic contents. Every block might have a different last-modified timestamp, but not every block is important enough to serve as the page’s determinant last-modified date. The same goes for templates. Most template tweaks shouldn’t trigger a full blown recrawl, but some do, for example a new address or phone number if such information is present on every page.

For example a blog has posts, pages, comments, categories and other data sources that can change the sidebar’s contents quite frequently. On a page that outputs a single post or page, the last-modified date is determined by the post, respectively its last comment. The main page’s last-modified date is the modified-timestamp of the most recent post, and the same goes for its paginated continuations. A category page’s last-modified date is determined by the category’s most recent post, and so on.

New posts can change outgoing links of older posts when you use plugins that list related posts and stuff like that. There are many more reasons why search engines should crawl older posts at least monthly or so. You might need a routine that changes a blog page’s last-modified timestamp for example when it is a date more than 30 days or so in the past. Also, in some cases it could make sense to have a routine that can reset all timestamps reported as last-modified date for particular site areas, or even the whole site.

If your software doesn’t populate last-modified attributes on changes of all entities, then snap at the chance to consider database triggers, stored procedures, respectively changes of your data access layer. Bear in mind that not all changes of a record must trigger a crawler cache reset. For example a table storing textual contents like articles or product descriptions usually has a number of attributes that don’t affect crawling, thus it should have an attribute last updated  that’s changeable in the UI and serves as last-modified date in your crawler cache control (instead of the timestamp that’s changed automatically even on minor updates of attributes which are meaningless for HTML outputs).

Handling Last-Modified, If-Modified-Since, and Not-Modified HTTP headers with PHP/Apache

Below I provide example PHP code I’ve thrown together after midnight in a sleepless night, doped with painkillers. It doesn’t run on a production system, but it should get you started. Adapt it to your needs and make sure you test your stuff intensively. As always, my stuff comes as is  without any guarantees. ;)

First grab a couple helpers and put them in an include file you’ve available in all scripts. Since we deal with HTTP headers, you must not output anything before the logic that deals with conditional search engine requests, not even a single white space character, HTML DOCTYPE declaration …
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

In general, all user agents should support conditional GET requests, not only search engine crawlers. If you allow long lasting caching, which is fine with search engines that don’t need to crawl your latest Twitter message from your blog’s sidebar, you could leave your visitors with somewhat outdated pages if you serve them 304-Not-Modified responses too.

It might be a good idea to limit 304 responses to conditional GET requests from crawlers, when you don’t implement way shorter caching cycles for other user agents. The latter includes folks that spoof their user agent name as well as scrapers trying to steal your stuff masked as a legit spider. To verify legit search engine crawlers that (should) support conditional GET requests (from Google, Yahoo, MSN and Ask) you can grab my crawler detection routines here. Include them as well, then you can code stuff like that:

$isSpiderUA = checkCrawlerUA ();
$isLegitSpider = checkCrawlerIP (__FILE__);
if ($isSpiderUA && !$isLegitSpider) {
@header("Thou shalt not spoof", TRUE, 403);
exit;
// make sure your 403-Forbidden ErrorDocument directive in
// .htaccess points to a page that explains the issue!
}
if ($isLegitSpider) {
// insert your code dealing with conditional GET requests
}

Now that you’re sure that the requestor is a legit crawler from a major search engine, look at the HTTP request header it has submitted to your Web server.

// lookup the HTTP request header for a possible conditional GET
$ifModifiedSinceTimestamp = getIfModifiedSince();
// if the request is not conditional, don’t send a 304
$canSend304 = FALSE;
if ($ifModifiedSinceTimestamp !== FALSE) {
$canSend304 = TRUE;

// Tells the requestor that you’ve recognized the conditional GET
$echoRequestHeader = "X-Requested-If-modified-since: "
.unixTimestamp2HttpDate($ifModifiedSinceTimestamp);
@header($echoRequestHeader, TRUE);
}

You don’t need to echo the If-Modified-Since HTTP-date in the response header, but this custom header makes testing easier.

Next get the page’s actual last-modified date/time. Here is an (incomplete) code sample for a WordPress single post page.

// select the requested post's comment_count, post_modified and
 // post_date values, then:
if ($wp_post_modified) {
$lastModified = date2UnixTimestamp($wp_post_modified);
}
else {
$lastModified = date2UnixTimestamp($wp_post_date);
}
if (intval($wp_comment_count) > 0) {
// select last comment from the WordPress database, then:
$lastCommentTimestamp = date2UnixTimestamp($wp_comment_date);
if ($lastCommentTimestamp > $lastModified) {
$lastModified = $lastCommentTimestamp;
}
}

The date2UnixTimestamp() function accepts MySQL datetime values as valid input. If you need to (re)write last-modified dates to a MySQL database, convert the Unix timestamps to MySQL datetime values with unixTimestamp2MySqlDatetime().

Your server’s clock isn’t necessarily synchronized with all search engines out there. To cover possible gaps you can use a last-modified timestamp that’s a little bit fresher than the actual last-modified date. In this example the timestamp reported to the crawler is last-modified + 13 hours, you can change the deviant in makeLastModifiedTimestamp().
$lastModifiedTimestamp = makeLastModifiedTimestamp($lastModified);

If you compare the timestamps later on, and the request isn’t conditional, don’t run into the 304 routine.
if ($ifModifiedSinceTimestamp === FALSE) {
// make things equal if the request isn't conditional
$ifModifiedSinceTimestamp = $lastModifiedTimestamp;
}

You may want to allow a full fetch if the requestor’s timestamp is ancient, in this example older than one month.
$tooOld = @strtotime("now") - (31 * 24 * 60 * 60);
if ($ifModifiedSinceTimestamp < $tooOld) {
$lastModifiedTimestamp = @strtotime("now");
$ifModifiedSinceTimestamp = @strtotime("now") - (1 * 24 * 60 * 60);
}

Setting the last-modified attribute to yesterday schedules the next full crawl after this fetch in 30 days (or later, depending on the actual crawl frequency).

Finally respond with 304-Not-Modified if the page wasn’t remarkably changed since the date/time given in the crawler’s If-Modified-Since header. Otherwise send a Last-Modified header with a 200 HTTP response code, allowing the crawler to fetch the page contents.
$lastModifiedHeader = "Last-Modified: " .unixTimestamp2HttpDate($lastModifiedTimestamp);
if ($lastModifiedTimestamp < $ifModifiedSinceTimestamp &&
$canSend304) {
@header($lastModifiedHeader, TRUE, 304);
exit;
}
else {
@header($lastModifiedHeader, TRUE);
}

When you’re testing your version of this script with a browser, it will send a standard HTTP request, and your server will return a 200-OK. From your server’s response your browser should recognize the “Last-Modified” header, so when you reload the page the browser should send an “If-Modified-Since” header and you should get the 304 response code if Last-Modified > If-Modified-Since. However, judging from my experience such browser based tests of crawler behavior, respectively responses to crawler requests, aren’t reliable.

Test it with this MS tool instead. I’ve played with it for a while and it works great. With the PHP code above I’ve created a 200/304 test page
http://sebastians-pamphlets.com/tools/last-modified-yesterday.php
that sends a “Last-Modified: Yesterday” response header, and should return a 304-Not Modified HTTP response code when you request it with an “If-Modified-Since: Today+” header, otherwise it should respond with 200-OK (this version returns 200-OK only but tells when it would  respond with a 304). You can use this URI with the MS-tool linked above to test HTTP requests with different If-Modified-Since headers.

Have fun and paypal me 50% of your savings. ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The hacker tool MSN-LiveSearch is responsible for brute force attacks

401 = Private Property, keep out!A while ago I’ve staged a public SEO contest, asking whether the 401 HTTP response code prevents from search engine indexing or not.

Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 HTTP response code that translates to a polite “Authorization Required”, meaning “Forbidden unless you provide valid authorization”.

Experience of life and common sense tell search engines, that when a Webmaster protects content with a user/password query, this content is not available to the public. Search engines that respect Webmasters/site owners do not point their users to protected content.

Also, that makes no sense for the search engine. Searchers submitting a query with keywords that match a protected URL would be pissed when they click the promising search result on the SERP, but the linked site responds with an unfriendly “Enter user and password in order to access [title of the protected area]”, that resolves to a harsh error message because the searcher can’t provide such information, and usually can’t even sign up from the 401 error page1.

Evil use of search resultsUnfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example /members/. Even when those systems can handle individual URLs, many Webmasters leave default URLs in place that are either guessable or well known on the Web.

Developing a script that searches for a string like /members/ in URLs and then “tests” the search results with brute force attacks is a breeze. Also, such scripts are available (for a few bucks or even free) at various places. Without the help of a search engine that provides the lists of protected URLs, the hacker’s job is way more complicated. In other words, search engines that list protected URLs on their SERPs willingly support and encourage hacking, content theft, and DOS-like server attacks.

Ok, lets look at the test results. All search engines have casted their votes now. Here are the winners:

Google :)

Once my test was out, Matt Cutts from Google researched the question and told me:

My belief from talking to folks at Google is that 401/forbidden URLs that we crawl won’t be indexed even as a reference, so .htacess password-protected directories shouldn’t get indexed as long as we crawl enough to discover the 401. Of course, if we discover an URL but didn’t crawl it to see the 401/Forbidden status, that URL reference could still show up in Google.

Well, that’s exactly the expected behavior, and I wasn’t surprised that my test results confirm Matt’s statement. Thanks to Google’s BlitzIndexing™ Ms. Googlebot spotted the 401 so fast, that the URL never showed up on Google’s SERPs. Google reports the protected URL in my Webmaster Console account for this blog as not indexable.

Yahoo :)

Yahoo’s crawler Slurp also fetched the protected URL in no time, and Yahoo did the right thing too. I wonder whether or not that’s going to change if M$ buys Yahoo.

Ask :)

Ask’s crawler isn’t the most diligent Web robot out there. However, somehow Ask has managed not to index a reference to my password protected URL.

And here is the ultimate loser:

MSN LiveSearch :(

Oh well. Obviously MSN LiveSearch is a must have in a deceitful cracker’s toolbox:

MSN LiveSearch indexes password protected URLs

As if indexing references to password protected URLs wouldn’t be crappy enough, MSN even indexes sitemap files that are referenced in robots.txt only. Sitemaps are machine readable URL submission files that have absolute no value for humans. Webmasters make use of sitemap files to mass submit their URLs to search engines. The sitemap protocol, that MSN officially supports, defines a communication channel between Webmasters and search engines - not searchers, and especially not scrapers that can use indexed sitemaps to steal Web contents more easily. Here is a screen shot of an MSN SERP:

MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)
MSN LiveSearch indexes unlinked sitemaps files (MSN Webmaster Tools)

All the other search engines got the sitemap submission of the test URL too, but none of them fell for it. Neither Google, Yahoo, nor Ask have indexed the sitemap file (they never index submitted sitemaps that have no inbound links by the way) or its protected URL.

Summary

All major search engines except MSN respect the 401 barrier.

Since MSN LiveSearch is well known for spamming, it’s not a big surprise that they support hackers, scrapers and other content thieves.

Of course MSN search is still an experiment, operating in a not yet ready to launch stage, and the big players made their mistakes in the beginning too. But MSN has a history of ignoring Web standards as well as Webmaster concerns. It took them two years to implement the pretty simple sitemaps protocol, they still can’t handle 301 redirects, their sneaky stealth bots spam the referrer logs of all Web sites out there in order to fake human traffic from MSN SERPs (MSN traffic doesn’t exist in most niches), and so on. Once pointed to such crap, they don’t even fix the simplest bugs in a timely manner. I mean, not complying to the HTTP 1.1 protocol from the last century is an evidence of incapacity, and that’s just one example.

 

Update Feb/06/2008: Last night I’ve received an email from Microsoft confirming the 401 issue. The MSN Live Search engineer said they are currently working on a fix, and he provided me with an email address to report possible further issues. Thank you, Nathan Buggia! I’m still curious how MSN Live Search will handle sitemap files in the future.

 


1 Smart Webmasters provide sign up as well as login functionality on the page referenced as ErrorDocument 401, but the majority of all failed logins leave the user alone with the short hard coded 401 message that Apache outputs if there’s no 401 error document. Please note that you shouldn’t use a PHP script as 401 error page, because this might disable the user/password prompt (due to a PHP bug). With a static 401 error page that fires up on invalid user/pass entries or a hit on the cancel button, you can perform a meta refresh to redirect the visitor to a signup page. Bear in mind that in .htaccess you must not use absolute URLs (http://… or https://…) in the ErrorDocument 401 directive, and that on the error page you must use absolute URLs for CSS, images, links and whatnot because relative URIs don’t work there!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google removes the #6 penalty/filter/glitch

Google removed the position six penaltyAfter the great #6 Penalty SEO Panel Google’s head of the webspam dept. Matt Cutts digged out a misbehaving algo and sent it back to the developers. Two hours ago he stated:

When Barry asked me about “position 6″ in late December, I said that I didn’t know of anything that would cause that. But about a week or so after that, my attention was brought to something that could exhibit that behavior.

We’re in the process of changing the behavior; I think the change is live at some datacenters already and will be live at most data centers in the next few weeks.

 

So everything is fine now. Matt penalizes the position-six software glitch, and lost top positions will revert to their former rankings in a while. Well, not really. Nobody will compensate income losses, nor the time Webmasters spent on forums discussing a suspected penalty that actually was a bug or a weird side effect. However, kudos to Google for listening to concerns, tracking down and fixing the algo. And thanks for the update, Matt.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Avoiding the well known #4 SERP-hero-penalty …

Seb the red claw… I just have to link to North South Media’s neat collection of Search Action Figures.

Paul pretty much dislikes folks who don’t link to him, so Danny Sullivan and Rand Fishkin are well advised to drop a link every now and then, and David Naylor better gives him an interview slot asap. ;)

Google’s numbered “penalties”, esp. #6

As for numeric penalties in general … repeat("Sigh", ) … enjoy this brains trust moderated by Marty Weintraub (unauthorized):

Marty: Folks, please welcome Aaron Wall, who recently got his #6 penalty removed!

Audience: clap(26) sphinn(26)

The Gypsy: Sorry Marty but come on… this is complete BS and there is NO freakin #6 filter just like the magical minus 90…900 bla bla bla. These anomalies NEVER have any real consensus on a large enough data set to even be considered a viable theory.

A Red Crab: As long as Bill can’t find a plus|minus-n-raise|penalty patent, or at least a white paper or so leaked out from Google, or for all I care a study that provides proof instead of weird assumptions based on claims of webmasters jumping on todays popular WMW band wagon that aren’t plausible nor verifiable, such beasts don’t exist. There are unexplained effects that might look like a pattern, but in most cases it makes no sense to gather a few examples coming with similarities because we’ll never reach the critical mass of anomalies to discuss a theory worth more than a thumbs-down click.

Marty: Maybe Aaron is joking. Maybe he thinks he has invented the next light bulb.

Gamermk: Aaron is grasping at straws on this one.

Barry Welford: I would like this topic to be seen by many.

Audience: clap(29) sphinn(29)

The Gypsy: It is just some people that have DECIDED on an end result and trying to make various hypothesis fit the situation (you know, like tobacco lobby scientists)… this is simply bad form IMO.

Danny Sullivan: Well, I’ve personally seen this weirdness. Pages that I absolutely thought “what on earth is that doing at six” rather than at the top of the page. Not four, not seven — six. It was freaking weird for several different searches. Nothing competitive, either.

I don’t know that sixth was actually some magic number. Personally, I’ve felt like there’s some glitch or problem with Google’s ranking that has prevented the most authorative page in some instances from being at the top. But something was going on.

Remember, there’s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a “sandbox like” effect.

The biggest problem I find with these types of theories is they often start with a specific example, sometimes that can be replicated, then they become a catch-all. Not ranking. Oh, it’s the sandbox. Well no — not if you were an established site, it wasn’t. The sandbox was typicaly something that hit brand new sites. But it became a common excuse for anything, producing confusion.

Jim Boykin: I’ll jump in and say I truely believe in the 6 filter. I’ve seen it. I wouldn’t have believed it if I hadn’t seen it happen to a few sites.

Audience: clap(31) sphinn(31)

A Red Crab: Such terms tend to become a life of their own, IOW an excuse for nearly every way a Webmaster can fuck up rankings. Of course Google’s query engine has thresholds (yellow cards or whatever they call them) that don’t allow some sites to rank above a particular position, but that’s a symtom that doesn’t allow back-references to a particular cause, or causes. It’s speculation as long as we don’t know more.

IncrediBill: I definitely believe it’s some sort of filter or algo tweak but it’s certainly not a penalty which is why I scoff at calling it such. One morning you wake up and Matt has turned all the dials to the left and suddenly some criteria bumps you UP or DOWN. Sites have been going up and down in Google SERPs for years, nothing new or shocking about that and this too will have some obvious cause and effect that could probably be identified if people weren’t using the shotgun approach at changing their site

G1smd: By the time anyone works anything out with Google, they will already be in the process of moving the goalposts to another country.

Slightly Shady SEO: The #6 filter is a fallacy.

Old School: It certainly occured but only affected certain sites.

Danny Sullivan: Perhaps it would have been better called a -5 penalty. Consider. Say Google for some reason sees a domain but decides good, but not sure if I trust it. Assign a -5 to it, and that might knock some things off the first page of results, right?

Look — it could all be coincidence, and it certainly might not necessarily be a penalty. But it was weird to see pages that for the life of me, I couldn’t understand why they wouldn’t be at 1, showing up at 6.

Slightly Shady SEO: That seems like a completely bizarre penalty. Not Google’s style. When they’ve penalized anything in the past, it hasn’t been a “well, I guess you can stay on the frontpage” penalty. It’s been a smackdown to prove a point.

Matt Cutts: Hmm. I’m not aware of anything that would exhibit that sort of behavior.

Audience: Ugh … oohhhh … you weren’t aware of the sandbox, either!

Danny Sullivan: Remember, there’s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a “sandbox like” effect.

Audience: Bah, humbug! We so want to believe in our lame excuses …

Tedster: I’m not happy with the current level of analysis, however, and definitely looking for more ideas.

Audience: clap(40) sphinn(40)


Of course the panel above is fictional, respectively assembled from snippets which in some cases change the message when you read them in their context. So please follow the links.

I wouldn’t go that far to say there’s no such thing as a fair amount of Web pages that deserve a #1 spot on Google’s SERPs, but rank #6 for unknown reasons (perhaps link monkey business, staleness, PageRank flow in disarray, anchor text repetitions, …). There’s something worth investigating.

However, I think that labelling a discussion of glitches or maybe filters that don’t behave based on a way too tiny dataset “#6 penalty” leads to the lame excuse for literally anything phenomenon.

Folks who don’t follow the various threads closely enough to spot the highly speculative character of the beast, will take it as fact and switch to winter sleep mode instead of enhancing their stuff like Aaron did. I can’t wait for the first “How to escape the Google -5 penalty” SEO tutorial telling the great unwashed that a “+5″ revisit-after meta tag will heal it.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Getting URLs outta Google - the good, the popular, and the definitive way

Keep out GoogleThere’s more and more robots.txt talk in the SEOsphere lately. That’s a good thing in my opinion, because the good old robots.txt’s power is underestimated. Unfortunately it’s quite often misused or even abused too, usually because folks don’t fully understand the REP (by following “advice” from forums instead of reading the real thing, or at least my stuff ).

I’d like to discuss the REP’s capabilities assumed to make sure that Google doesn’t index particular contents from three angles:

The good way
If the major search engines would support new robots.txt directives that Webmasters really need, removing even huge chunks of content from Google’s SERPs –without collateral damage– via robots.txt would be a breeze.
The popular way
Shamelessly stealing Matt’s official advice [Source: Remove your content from Google by Matt Cutts]. To obscure the blatant plagiarism, I’ll add a few thoughts.
The definitive way
Of course that’s not the ultimate way, but that’s the way Google’s cookies crumble, currently. In other words: Google is working on a leaner approach, but that’s not yet announced, thus you can’t use it; you still have to jump through many hoops.

The good way

Caution: Don’t implement code from this section, the robots.txt directives discussed here are not (yet/fully) supported by search engines!

Currently all robots.txt statements are crawler directives. That means that they can tell behaving search engines how to crawl a site (fetching contents), but they’ve no impact on indexing (listing contents on SERPs). I’ve recently published a draft discussing possible REP tags for robots.txt. REP tags are indexer directives known from robots meta tags and X-Robots-Tags, which –as on-page respectively per-URL directives– require crawling.

The crux is that REP tags must be assigned to URLs. Say you’ve a gazillion of printer friendly pages in various directories that you want to deindex at Google, putting the “noindex,follow,noarchive” tags comes with a shitload of work.

How cool would be this robots.txt code instead:
Noindex: /*printable
Noarchive: /*printable

Search engines would continue to crawl, but deindex previously indexed URLs respectively not index new URLs from
/articles/printable/*.htm
/manuals/printable/*.pdf
/products/descriptions/*.php?format=printable&product=*
...

provided those URLs aren’t disallow’ed. They would follow the links in those documents, so that PageRank gathered by printer friendly pages wouldn’t be completely wasted. To apply an implicit rel-nofollow to all links pointing to printer friendly pages, so that those can’t accumulate PageRank from internal or external links, you’d add
Norank: /*printable

to the robots.txt code block above.

If you don’t like that search engines index stuff you’ve disallow’ed in your robots.txt from 3rd party signals like inbound links, and that Google accumulates even PageRank for disallow’ed URLs, you’d put:
Disallow: /unsearchable/
Noindex: /unsearchable/
Norank: /unsearchable/

To fix URL canonicalization issues with PHP session IDs and other tracking variables you’d write for example
Truncate-variable sessionID: /

and that would fix the duplicate content issues as well as the problem with PageRank accumulated by throw-away URLs.

Unfortunately, robots.txt is not yet that powerful, so please link to the REP tags for robotx.txt “RFC” to make it popular, and proceed with what you have at the moment.

Matt Cutts was kind enough to discuss Google’s take on contents excluded from search engine indexing in 10 minutes or less here:

You really should listen, the video isn’t that long.

In the following I’ve highlighted a few methods Matt has talked about:

Don’t link (very weak)
Although Google usually doesn’t index unlinked stuff, this can happen due to crawling based on sitemaps. Also, the URL might appear in linked referrer stats on other sites that are crawlable, and folks can link from the cold.
.htaccess / .htpasswd (Matt’s first recommendation)
Since Google cannot crawl password protected contents, Matt declares this method to prevent content from indexing safe. I’m not sure what will happen when I spread a few strong links to somebody’s favorite smut collection, perhaps I’ll test some day whether Google and other search engines list such a reference on their SERPs.
robots.txt (weak)
Matt rightly points out that Google’s cool robots.txt validator in the Webmaster Console is a great tool to develop, test and deploy proper robots.txt syntax that effectively blocks search engine crawling. The weak point is, that even when search engines obey robots.txt, they can index uncrawled content from 3rd party sources. Matt is proud of Google’s smart capabilities to figure out suiteble references like the ODP. I agree totally and wholeheartedly. Hence robots.txt in its current shape doesn’t prevent content from showing up in Google and other engines as well. Matt didn’t mention