Archived posts from the 'SEO' Category

Get a grip on the Robots Exclusion Protocol (REP)

REP command hierarchyThanks to the very nice folks over at SEOmoz I was able to prevent this site from becoming a kind of REP/robots.txt blog. Please consider reading this REP round up:

Robots Exclusion Protocol 101

My REP 101  links to the various standards (robots.txt, REP tags, Sitemaps, microformats) the REP consists of, and provides a rough summary of each REP component. It explains the difference between crawler directives and indexer directives, and which command hierarchy search engines follow when REP directives put in different levels conflict.

Educate yourself on the REPWhy do I think that solid REP knowledge is important right now? Not only because of the confusion that exists thanks to the volume of crappy advice provided at every Webmaster hangout. Of course understanding the REP makes webmastering easier, thus I’m glad when my REP related pamphlets are considered somewhat helpful.

I’ve a hidden agenda, though. I predict that the REP is going to change shortly. As usual, its evolvement is driven by a major search engine, since the W3C and such organizations don’t bother with the conglomerate of quasi standards and RFCs known as the Robots Exclusion Protocol. In general that’s not a bad thing. Search engines deal with the REP every day, so they have a legitimate interest.

Unfortunately not every REP extension that search engines have invented so far is useful for Webmasters, some of them are plain crap. Learning from fiascos and riots of the past, the engines are well advised to ask Webmasters for feedback before they announce further REP directives.

I’ve a feeling that shortly a well known search engine will launch a survey regarding particular REP related ideas. I want that Webmasters are well aware of the REP’s complexity and functionality when they contribute their take on REP extensions. So please educate yourself. :)

My pamphlet discussing a possible standardization of REP tags as robots.txt directives could be a useful reference, also please watch the great video here. ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Do search engines index references to password protected smut?

how prudish are search enginesRecently Matt Cutts said that Google doesn’t index password protected content. I wasn’t sure whether or not that goes for all search engines. I thought that they might index at least references to protected URLs, like they all do with other uncrawlable content that has strong inbound links.

Well, SEO tests are dull and boring, so I thought I could have some fun with this one.

I’ve joked that I should use someone’s favorite smut collection to test it. Unfortunately, nobody was willing to trade porn passwords for link love or so. I’m not a hacker, hence I’ve created my own tiny collection of password protected SEO porn (this link is not exactly considered safe at work) as test case.

I was quite astonished that according to this post about SEO porn next to nobody in the SEOsphere optimizes adult sites (of course that’s not true). From the comments I figured that some folks at least surf for SEO porn evaluate the optimization techniques applied by adult Webmasters.

Ok, lets extend that. Out yourself as SEO porn savvy Internet marketer. Leave your email addy in the comments (dont forget to tell me why I should believe that you’re over 18), and I’ll email you the super secret password for my SEO porn members area (!SAW). Trust me, it’s worth it, and perfectly legit due to the strictly scientific character of this experiment. If you’re somewhat shy, use a funny pseudonym.

I’d very much appreciate a little help with linkage too. Feel free to link to http://sebastians-pamphlets.com/porn/ with an adequate anchor text of your choice, and of course without condom.

Get the finest SEO porn available on this planet!

I’ve got the password, now let me in!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Getting URLs outta Google - the good, the popular, and the definitive way

Keep out GoogleThere’s more and more robots.txt talk in the SEOsphere lately. That’s a good thing in my opinion, because the good old robots.txt’s power is underestimated. Unfortunately it’s quite often misused or even abused too, usually because folks don’t fully understand the REP (by following “advice” from forums instead of reading the real thing, or at least my stuff ).

I’d like to discuss the REP’s capabilities assumed to make sure that Google doesn’t index particular contents from three angles:

The good way
If the major search engines would support new robots.txt directives that Webmasters really need, removing even huge chunks of content from Google’s SERPs –without collateral damage– via robots.txt would be a breeze.
The popular way
Shamelessly stealing Matt’s official advice [Source: Remove your content from Google by Matt Cutts]. To obscure the blatant plagiarism, I’ll add a few thoughts.
The definitive way
Of course that’s not the ultimate way, but that’s the way Google’s cookies crumble, currently. In other words: Google is working on a leaner approach, but that’s not yet announced, thus you can’t use it; you still have to jump through many hoops.

The good way

Caution: Don’t implement code from this section, the robots.txt directives discussed here are not (yet/fully) supported by search engines!

Currently all robots.txt statements are crawler directives. That means that they can tell behaving search engines how to crawl a site (fetching contents), but they’ve no impact on indexing (listing contents on SERPs). I’ve recently published a draft discussing possible REP tags for robots.txt. REP tags are indexer directives known from robots meta tags and X-Robots-Tags, which –as on-page respectively per-URL directives– require crawling.

The crux is that REP tags must be assigned to URLs. Say you’ve a gazillion of printer friendly pages in various directories that you want to deindex at Google, putting the “noindex,follow,noarchive” tags comes with a shitload of work.

How cool would be this robots.txt code instead:
Noindex: /*printable
Noarchive: /*printable

Search engines would continue to crawl, but deindex previously indexed URLs respectively not index new URLs from
/articles/printable/*.htm
/manuals/printable/*.pdf
/products/descriptions/*.php?format=printable&product=*
...

provided those URLs aren’t disallow’ed. They would follow the links in those documents, so that PageRank gathered by printer friendly pages wouldn’t be completely wasted. To apply an implicit rel-nofollow to all links pointing to printer friendly pages, so that those can’t accumulate PageRank from internal or external links, you’d add
Norank: /*printable

to the robots.txt code block above.

If you don’t like that search engines index stuff you’ve disallow’ed in your robots.txt from 3rd party signals like inbound links, and that Google accumulates even PageRank for disallow’ed URLs, you’d put:
Disallow: /unsearchable/
Noindex: /unsearchable/
Norank: /unsearchable/

To fix URL canonicalization issues with PHP session IDs and other tracking variables you’d write for example
Truncate-variable sessionID: /

and that would fix the duplicate content issues as well as the problem with PageRank accumulated by throw-away URLs.

Unfortunately, robots.txt is not yet that powerful, so please link to the REP tags for robotx.txt “RFC” to make it popular, and proceed with what you have at the moment.

Matt Cutts was kind enough to discuss Google’s take on contents excluded from search engine indexing in 10 minutes or less here:

You really should listen, the video isn’t that long.

In the following I’ve highlighted a few methods Matt has talked about:

Don’t link (very weak)
Although Google usually doesn’t index unlinked stuff, this can happen due to crawling based on sitemaps. Also, the URL might appear in linked referrer stats on other sites that are crawlable, and folks can link from the cold.
.htaccess / .htpasswd (Matt’s first recommendation)
Since Google cannot crawl password protected contents, Matt declares this method to prevent content from indexing safe. I’m not sure what will happen when I spread a few strong links to somebody’s favorite smut collection, perhaps I’ll test some day whether Google and other search engines list such a reference on their SERPs.
robots.txt (weak)
Matt rightly points out that Google’s cool robots.txt validator in the Webmaster Console is a great tool to develop, test and deploy proper robots.txt syntax that effectively blocks search engine crawling. The weak point is, that even when search engines obey robots.txt, they can index uncrawled content from 3rd party sources. Matt is proud of Google’s smart capabilities to figure out suiteble references like the ODP. I agree totally and wholeheartedly. Hence robots.txt in its current shape doesn’t prevent content from showing up in Google and other engines as well. Matt didn’t mention Google’s experiments with Noindex: support in robots.txt, which need improvement but could resolve this dilemma.
Robots meta tags (Google only, weak with MSN/Yahoo)
The REP tag “noindex” in a robots meta element prevents from indexing, and, once spotted, deindexes previously listed stuff - at least at Google. According to Matt Yahoo and MSN still list such URLs as references without snippets. Because only Google obeys “noindex” totally by wiping out even URL-only listings and foreign references, robots meta tags should be considered a kinda weak approach too. Also, search engines must crawl a page to discover this indexer directive. Matt adds that robots meta tags are problematic, because they’re buried on the pages and sometimes tend to get forgotten when no longer needed (Webmasters might do forget to take the tag down, respectively add it later on when search engines policies change, or work in progress gets released respectively outdated contents are taken down). Matt forgot to mention the neat X-Robots-Tags that can be used to apply REP tags in the HTTP header of non-HTML resources like images or PDF documents. Google’s X-Robots-Tag is supported by Yahoo too.
Rel-nofollow (kind of weak)
Although condoms totally remove links from Google’s link graphs, Matt says that rel-nofollow should not be used as crawler or indexer directive. Rel-nofollow is for condomizing links only, also other search engines do follow nofollow’ed links and even Google can discover the link destination from other links they gather on the Web, or grab from internal links inadvertently lacking a link condom. Finally, rel-nofollow requires crawling too.
URL removal tool in GWC (Matt’s second recommendation)
Taking Matt’s enthusiasm while talking about Google’s neat URL terminator into account, this one should be considered his first recommendation. Google provides tools to remove URLs from their search index since five years at least (way longer IIRC). Recently the Webmaster Central team has integrated those, as well as new functionality, into the Webmaster Console, donating it a very nice UI. The URL removal tools come with great granularity, and because the user’s site ownership is verified, it’s pretty powerful, safe, and shows even the progress for each request (the removal process lasts a few days). Its UI is very flexible and allows even revoking of previous removal requests. The wonderful little tool’s sole weak point is that it can’t remove URLs from the search index forever. After 90 days or possibly six months the erased stuff can pop up again.

Summary: If your site isn’t password protected, and you can’t live with indexing of disallow’ed contents, you must remove unwanted URLs from Google’s search index periodically. However, there are additional procedures that can support –but not guarantee!– deindexing. With other search engines it’s even worse, because those don’t respect the REP like Google, and don’t provide such handy URL removal tools.

The definitive way

Actually, I think Matt’s advice is very good. As long as you don’t need a permanent solution, and if you lack the programming skills to develop such a beast that works with all (major) search engines. I mean everybody can insert a robots meta tag or robots.txt statement, and everybody can semiyearly repeat URL removal requests with the neat URL terminator, but most folks are scared when it comes to conditional manipulation of HTTP headers to prevent stuff from indexing. However, I’ll try to explain quite safe methods that actually work (with Apache, not IIS) in the following examples.

First of all, if you really want that search engines don’t index your stuff, you must allow them to crawl it. And no, that’s not an oxymoron. At the moment there’s no such thing as an indexer directive on site-level. You can’t forbid indexing in robots.txt. All indexer directives require crawling of the URLs that you want to keep out of the SERPs. Of course that doesn’t mean you should serve search engine crawlers a book from each forbidden URL.

Lets start with robots.txt. You put
User-agent: *
Disallow: /images/
Disallow: /movies/
Disallow: /unsearchable/
 
User-agent: Googlebot
Disallow:
Allow: /
 
User-agent: Slurp
Disallow:
Allow: /

The first section is just a fallback.

(Here comes a rather brutal method that you can use to keep search engines out of particular directories. It’s not suitable to deal with duplicate content, session IDs, or other URL canonicalization. More on that later.)

Next edit your .htaccess file.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_URI} ^/unsearchable/
RewriteCond %{REQUEST_URI} !\.php
RewriteRule . /unsearchable/output-content.php [L]
</IfModule>

If you’ve .php pages in /unsearchable/ then remove the second rewrite condition, put output-content.php into another directory, and edit my PHP code below so that it includes the PHP scripts (don’t forget to pass the query string).

Now grab the PHP code to check for search engine crawlers here and include it below. Your script /unsearchable/output-content.php looks like:
<?php
@include("crawler-stuff.php"); // defines variables and functions
$isSpider = checkCrawlerIP ($requestUri);
if ($isSpider) {
@header("HTTP/1.1 403 Thou shalt not index this", TRUE, 403);
@header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir");
exit;
}
 
$arr = explode("#", $requestUri);
$outputFileName = $arr[0];
$arr = explode("?", $outputFileName);
$outputFileName = $_SERVER["DOCUMENT_ROOT"] .$arr[0];
if (substr($outputFileName, -1, 1) == "/") {
$outputFileName .= "index.html";
}
if (file_exists($outputFileName)) {
// send the content type header
$contentType = "text/plain";
if (stristr($outputFileName, ".html")) $contentType ="text/html";
if (stristr($outputFileName, ".css")) $contentType ="text/css";
if (stristr($outputFileName, ".js")) $contentType ="text/javascript";
if (stristr($outputFileName, ".png")) $contentType ="image/png";
if (stristr($outputFileName, ".jpg")) $contentType ="image/jpeg";
if (stristr($outputFileName, ".gif")) $contentType ="image/gif";
if (stristr($outputFileName, ".xml")) $contentType ="application/xml";
if (stristr($outputFileName, ".pdf")) $contentType ="application/pdf";
@header("Content-type: $contentType");
@header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir");
readfile($outputFileName);
exit;
}
 
// That’s not the canonical way to call the 404 error page. Don’t copy, adapt:
@header("HTTP/1.1 307 Oups, I displaced $outputFileName", TRUE, 307);
@header("Location: http://sebastians-pamphlets.com/404/");
exit;
?>

What does the gibberish above do? In .htaccess we rewrite all requests for resources stored in /unsearchable/ to a PHP script, which checks whether the request is from a search engine crawler or not.

If the requestor is a verified crawler (known IP or IP and host name belong to a major search engine’s crawling engine), we return an unfriendly X-Robots-Tag and an HTTP response code 403 telling the search engine that access to our content is forbidden. The search engines should assume that a human visitor receives the same response, hence they aren’t keen on indexing these URLs. Even if a search engine lists an URL on the SERPs by accident, it can’t tell the searcher anything about the uncrawled contents. That’s unlikely to happen actually, because the X-Robots-Tag forbids indexing (Ask and MSN might ignore these directives).

If the requestor is a human visitor, or an unknown Web robot, we serve the requested contents. If the file doesn’t exist, we call the 404 handler.

With dynamic content you must handle the query string and (expected) cookies yourself. PHP’s readfile() is binary safe, so the script above works with images or PDF documents too.

If you’ve an original search engine crawler coming from a verifiable server feel free to test it with this page (user agent spoofing doesn’t qualify as crawler, come back in a week or so to check whether the engines have indexed the unsearchable stuff linked above).

The method above is not only brutal, it wastes all the juice from links pointing to the unsearchable site areas. To rescue the PageRank, change the script as follows:

$urlThatDesperatelyNeedsPageRank = "http://sebastians-pamphlets.com/about/";
if ($isSpider) {
@header("HTTP/1.1 301 Moved permanently", TRUE, 301);
@header("Location: $urlThatDesperatelyNeedsPageRank");
exit;
}

This redirects crawlers to the URL that has won your internal PageRank lottery. Search engines will/shall transfer the reputation gained from inbound links to this page. Of course page by page redirects would be your first choice, but when you block entire directories you can’t accomplish this kind of granularity.

By the way, when you remove the offensive 403-forbidden stuff in the script above and change it a little more, you can use it to apply various X-Robots-Tags to your HTML pages, images, videos and whatnot. When a search engine finds an X-Robots-Tag in the HTTP header, it should ignore conflicting indexer directives in robots meta tags. That’s a smart way to steer indexing of bazillions of resources without editing them.

Ok, this was the cruel method; now lets discuss cases where telling crawlers how to behave is a royal PITA, thanks to the lack of indexer directives in robots.txt that provide the required granularity (Truncate-variable, Truncate-value, Order-arguments, …).

Say you’ve session IDs in your URLs. That’s one (not exactly elegant) way to track users or affiliate IDs, but strictly forbidden when the requestor is a search engine’s Web robot.

In fact, a site with unprotected tracking variables is a spider trap that would produce infinite loops in crawling, because spiders following internal links with those variables discover new redundant URLs with each and every fetch of a page. Of course the engines found suitable procedures to dramatically reduce their crawling of such sites, what results in less indexed pages. Besides joyless index penetration there’s another disadvantage - the indexed URLs are powerless duplicates that usually rank beyond the sonic barrier at 1,000 results per search query.

Smart search engines perform high sophisticated URL canonicalization to get a grip on such crap, but Webmasters can’t rely on Google & Co to fix their site’s maladies.

Ok, we agree that you don’t want search engines to index your ugly URLs, duplicates, and whatnot. To properly steer indexing, you can’t just block the crawlers’ access to URLs/contents that shouldn’t appear on SERPs. Search engines discover most of those URLs when following links, and that means that they’re ready to assign PageRank or other scoring of link popularity to your URLs. PageRank / linkpop is a ranking factor you shouldn’t waste. Every URL known to search engines is an asset, hence handle it with care. Always bother to figure out the canonical URL, then do a page by page permanent redirect (301).

For your URL canonicalization you should have an include file that’s available at the very top of all your scripts, executed before PHP sends anything to the user agent (don’t hack each script, maintaining so many places handling the same stuff is a nightmare, and fault-prone). In this include file put the crawler detection code and your individual routines that handle canonicalization and other search engine friendly cloaking routines.

View a Code example (stripping useless query string variables).

How you implement the actual canonicalization routines depends on your individual site. I mean, if you’ve not the coding skills necessary to accomplish that you wouldn’t read this entire section, wouldn’t you?

    Here are a few examples of pretty common canonicalization issues:

  • Session IDs and other stuff used for user tracking
  • Affiliate IDs and IDs used to track the referring traffic source
  • Empty values of query string variables
  • Query string arguments put in different order / not checking the canonical sequence of query string arguments (ordering them alphabetically is always a good idea)
  • Redundant query string arguments
  • URLs longer than 255 bytes
  • Server name confusion, e.g. subdomains like “www”, “ww”, “random-string” all serving identical contents from example.com
  • Case issues (IIS/clueless code monkeys handling GET-variables/values case-insensitive)
  • Spaces, punctuation, or other special characters in URLs
  • Different scripts outputting identical contents
  • Flawed navigation, e.g. passing the menu item to the linked URL
  • Inconsistent default values for variables expected from cookies
  • Accepting undefined query string variables from GET requests
  • Contentless pages, e.g. outputted templates when the content pulled from the database equals whitespace or is not available

Summary

Hiding contents from all search engines requires programming skills that many sites can’t afford. Even leading search engines like Google don’t provide simple and suitable ways to deindex content –respectively to prevent content from indexing– without collateral damage (lost/wasted PageRank). We desperately need better tools. Maybe my robots.txt extensions are worth an inspection.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Ping the hell out of Technorati’s reputation algo

Ping your inbound links for technorati reputationIf your Technorati reputation factor sucks ass then read on, otherwise happily skip this post.

Technorati calculates a blog’s authority/reputation based on its link popularity, counting blogroll links from the linking blogs main pages as well as links within the contents of their posts. Links older than six months after their very first discovery don’t count.

Unfortunately, Technorati is not always able to find all your inbound links, usually because clueless bloggers forget to ping them, hence your blog might be undervalued. You can change that.

Compile a list of blogs that link to you and are unknown at Technorati, then introduce them below to a cluster ping orgy. Technorati will increase your authority rating after indexing those blogs.

Enter one blog home page URL per line, all lines properly delimited with a “\n” (new line, just hit [RETURN]; “\r” crap doesn’t work). And make sure that all these blogs have an auto-discovery link pointing to a valid feed in their HEAD section. Do NOT ping Technorati with post-URIs! Invest the time to click through to the blog’s main page and submit the blog-URI instead. Post-URI pings get mistaken for noise and trigger spam traps, that means their links will not  increase your Technorati authority/rank.

 

Results:


</p> <p style="color:red; font-weight:bolder;">It seems your user agent can&#8217;t ping Technorati. Go get a <a href="http://www.mozilla.com/en-US/firefox/">browser</a>.</p> <p>

Actually, this tool pings other services than Technorati too. Pingable contents make it on the SERPs, not only at Technorati.

If you make use of URL canonicalization routines that add a trailing slash to invalid URLs like http://example.com then make sure that you claim your blog at Technorati with the trailing slash.

Please note that this tool is experimental and expects a Web standard friendly browser. It might not work for you, and I’ll remove it if it gets abused.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

No more RSS feeds in Google’s search results

Google killing RSS feedsFolks try all sorts of naughty things when by accident a blog’s feed outranks the HTML version of a post. Usually that happened mostly to not that popular blogs, or with very old posts and categorized feeds that contain ancient articles.

The problem seems to be that Google’s Web search doesn’t understand the XML structure of feeds, so that a feed’s textual contents get indexed like stuff from text files. Due to “subscribe” buttons and other links, feeds can gather more PageRank than some HTML pages. Interestingly .xml is considered an unknown file type, and advanced search doesn’t provide a way to search within XML files.

Now that has changed1. Googler Bogdan Stănescu posts on the German Webmaster blog2 We remove feeds from our search results:

As Webmasters many of you were probably worried that your RSS or Atom feeds could outrank the accompanying HTML pages in Google’s search results. The emergence of feeds in our search results could be a poor user experience:

1. Feeds increase the probability that the user gets the same search result twice.

2. Users who click on the feed link on a SERP may miss out on valuable content, which is only available on the HTML page referenced in the XML file.

For these reasons, we have removed feeds from our Web search results - with the exception of podcasts (feeds with media files).

[…] We are aware that in addition to the podcasts out there some feeds exist that are not linked with an HTML page, and that is why it is not quite ideal to remove all feeds from the search results. We’re still open for feedback and suggestions for improvements to the handling of feeds. We look forward to your comments and questions in the crawling, indexing and ranking section of our discussion forum for Webmasters. [Translation mine]

I’m not yet sure whether or not that’s ending in a ban of all/most XML documents. I hope they suppress RSS/Atom feeds only, and provide improved ways to search for and within other XML resources.

So what does that mean for blog SEO? Unless Google provides a procedure to prevent feeds from accumulating PageRank whilst allowing access for blog search crawlers that request feeds (I believe something like that is in the works), it’s still a good idea to nofollow all feed links, but there’s absolutely no reason to block them in robots.txt any more.

I think that’s a great move into the right direction, but a preliminary solution, though. The XML structure of feeds isn’t that hard to parse, and there are only so many ways to extract the URL of the HTML page. Then when a relevant feeds lands in a raw result set, Google should display a link to the HTML version on the SERP. What do you think?


1 Danny reminded me that according to Matt Cutts that’s going on for a few months now.

2 24 hours later Google published the announcement in English language too.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Upgrading from IIS/ASP to Apache/PHP

Upgrade from Windows/IIS/ASP to Unix/Apache/PHPOnce you’re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won’t work, and you can’t run MS-Access “databases” and such stuff under Apache.

Here is my idea of a smoothly migration from IIS/ASP to Apache/PHP. Grab any Unix box from your hoster’s portfolio and start over.

(Recently I got a tiny IIS/ASP site about uses & abuses of link condoms and moved it to an Apache server. I’m well known for brutal IIS rants, but so far I didn’t discuss a way out of such a dilemma, so I thought blogging this move could be a good idea.)

I don’t want to make this piece too complex, so I skip database and code migration strategies. Read Mike Hillyer’s article Migrating from Microsoft Access/MS-SQL to MySQL, and try tools like ASP to PHP. (With my tiny link condom site I overwrote the ASP code with PHP statements in my primitive text editor.)

From an SEO perspective such an upgrade comes with pitfalls:

  • Changing file extensions from .asp to .php is not an option. We want to keep the number of unavoidable redirects as low as possible.
  • Default.asp is usually not configured as a valid default document under Apache, hence requests of http://example.com/ run into 404 errors.
  • Basic server name canonicalization routines (www vs. non-www) from ASP scripts are not convertible.
  • IIS-URIs are not case sensitive, that means that /Default.asp will 404 on Apache when the filename is /default.asp. Usually there are lowercase/uppercase issues with query string variables and values as well.
  • Most probably search engines have URL variants in their indexes, so we want to adapt their URL canonicalization, at least where possible.
  • HTML editors like Microsoft Visual Studio tend to duplicate the HTML code of templated page areas. Instead of editing menus or footers in all scripts we want to encapsulate them.
  • If the navigation makes use of relative links, we need to convert those to absolute URLs.
  • Error handling isn’t convertible. Improper error handling can cause decreasing search engine traffic.

Running /default.asp, /home.asp etc. as PHP scripts

When you upload an .asp file to an Apache Web server, most user agents can’t handle it. Browsers treat them as unknown file types and force downloads instead of rendering them. Next those files aren’t parsed for PHP statements, provided you’ve rewritten the ASP code already.

To tell Apache that .asp files are valid PHP scripts outputting X/HTML, add this code to your server config or your .htaccess file in the root:
AddType text/html .asp
AddHandler application/x-httpd-php .asp

The first line says that .asp files shall be treated as HTML documents, and should force the server to send a Content-Type: text/html HTTP header. The second line tells Apache that it must parse .asp files for PHP code.

Just in case the AddType statement above doesn’t produce a Content-Type: text/html header, here is another way to tell all user agents requesting .asp files from your server that the content type for .asp is text/html. If you’ve mod_headers available, you can accomplish that with this .htaccess code:
<IfModule mod_headers.c>
SetEnvIf Request_URI \.asp is_asp=is_asp
Header set "Content-type" "text/html" env=is_asp
Header set imagetoolbar "no"
</IfModule>

(The imagetoolbar=no header tells IE to behave nicely; you can use this directive in a meta tag too.)
If for some reason mod_headers doesn’t work well with mod_setenvif, giving 500 error codes or so, then you can set the content-type with PHP too. Add this to a PHP script file which is included in all your scripts at the very top:
@header("Content-type: text/html", TRUE);

Instead of “text/html” alone, you can define the character set too: “text/html; charset=UTF-8″

Sanitizing the home page URL by eliminating “default.asp”

Instead of slowing down Apache by defining just another default document name (DirectoryIndex index.html index.shtml index.htm index.php [...] default.asp), we get rid of “/default.asp” with this “/index.php” script:
<?php
@require("default.asp");
?>

Now every request of http://example.com/ executes /index.php which includes /default.asp. This works with subdirectories too.

Just in case someone requests /default.asp directly (search engines keep forgotten links!), we perform a permanent redirect in .htaccess:
Redirect 301 /default.asp http://example.com/
Redirect 301 /Default.asp http://example.com/

Converting the ASP code for server name canonicalization

If you find ASP canonicalization routines like
<%@ Language=VBScript %>
<%
if strcomp(Request.ServerVariables("SERVER_NAME"), "www.example.com", vbCompareText) = 0 then
Response.Clear
Response.Status = "301 Moved Permanently"
strNewUrl = Request.ServerVariables("URL")
if instr(1,strNewUrl, "/default.asp", vbCompareText) > 0 then
strNewUrl = replace(strNewUrl, "/Default.asp", "/")
strNewUrl = replace(strNewUrl, "/default.asp", "/")
end if
if Request.QueryString <> "" then
Response.AddHeader "Location","http://example.com" & strNewUrl & "?" & Request.QueryString
else
Response.AddHeader "Location","http://example.com" & strNewUrl
end if
Response.End
end if
%>

(or the other way round) at the top of all scripts, just select and delete. This .htaccess code works way better, because it takes care of other server name garbage too:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^example\.com [NC]
RewriteRule (.*) http://example.com/$1 [R=301,L]

(you need mod_rewrite, that’s usually enabled with the default configuration of Apache Web servers).

Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value

Probably a M$ developer didn’t read more than the scheme and server name chapter of the URL/URI standards, at least I’ve no better explanation for the fact that these clowns made the path and query string segment of URIs case-insensitive. (Ok, I have an idea, but nobody wants to read about M$ world domination plans.)

Just because –contrary to Web standards– M$ finds it funny to serve the same contents on request of /Home.asp as well as /home.ASP, such crap doesn’t fly on the World Wide Web. Search engines –and other Web services which store URLs– treat them as different URLs, and consider everything except one version duplicate content.

Creating hyperlinks in HTML editors by picking the script files from the Windows Explorer can result in HREF values like “/Script.asp”, although the file itself is stored with an all-lowercase name, and the FTP client uploads “/script.asp” to the Web server. There are more ways to fuck up file names with improper use of (leading) uppercase characters. Typos like that are somewhat undetectable with IIS, because the developer surfing the site won’t get 404-Not found responses.

Don’t misunderstand me, you’re free to camel-case file names for improved readability, but then make sure that the file system’s notation matches the URIs in HREF/SRC values. (Of course hyphened file names like “buy-cheap-viagra.asp” top the CamelCased version “BuyCheapViagra.asp” when it comes to search engine rankings, but don’t freak out about keywords in URLs, that’s ranking factor #202 or so.)

Technically spoken, converting all file names, variable names and values as well to all-lowercase is the simplest solution. This way it’s quite easy to 301-redirect all invalid requests to the canonical URLs.

However, each redirect puts search engine traffic at risk. Not all search engines process 301 redirects as they should (MSN Live Search for example doesn’t follow permanent redirects and doesn’t pass the reputation earned by the old URL over to the new URL). So if you’ve good SERP positions for “misspelled” URLs, it might make sense to stick with ugly directory/file names. Check your search engine rankings, perform [site:example.com] search queries on all major engines, and read the SERP referrer reports from the old site’s server stats to identify all URLs you don’t want to redirect. By the way, the link reports in Google’s Webmaster Console and Yahoo’s Site Explorer reveal invalid URLs with (internal as well as external) inbound links too.

Whatever strategy fits your needs best, you’ve to call a script handling invalid URLs from your .htaccess file. You can do that with the ErrorDocument directive:
ErrorDocument 404 /404handler.php

That’s safe with static URLs without parameters and should work with dynamic URIs too. When you –in some cases– deal with query strings and/or virtual URIs, the .htaccess code becomes more complex, but handling virtual paths and query string parameters in the PHP scripts might be easier:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /404handler.php [L]
</IfModule>

In both cases Apache will process /404handler.php if the requested URI is invalid, that is if the path segment (/directory/file.extension) points to a file that doesn’t exist.

And here is the PHP script /404handler.php:
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)
(Edit the values in all lines marked with “// change this”.)

This script doesn’t handle case issues with query string variables and values. Query string canonicalization must be developed for each individual site. Also, capturing misspelled URLs with nice search engine rankings should be implemented utilizing a database table when you’ve more than a dozen or so.

Lets see what the /404handler.php script does with requests of non-existing files.

First we test the requested URI for invalid URLs which are nicely ranked at search engines. We don’t care much about duplicate content issues when the engines deliver targeted traffic. Here is an example (which admittedly doesn’t rank for anything but illustrates the functionality): both /sample.asp as well as /Sample.asp deliver the same content, although there’s no /Sample.asp script. Of course a better procedure would be renaming /sample.asp to /Sample.asp, permanently redirecting /sample.asp to /Sample.asp in .htaccess, and changing all internal links accordinly.

Next we lookup the all lowercase version of the requested path. If such a file exists, we perform a permanent redirect to it. Example: /About.asp 301-redirects to /about.asp, which is the file that exists.

Finally, if everything we tried to find a suitable URI for the actual request failed, we send the client a 404 error code and output the error page. Example: /gimme404.asp doesn’t exist, hence /404handler.php responds with a 404-Not Found header and displays /error.asp, but /error.asp directly requested responds with a 200-OK.

You can easily refine the script with other algorithms and mappings to adapt its somewhat primitive functionality to your project’s needs.

Tweaking code for future maintenance

Legacy code comes with repetition, redundancy and duplication caused by developers who love copy+paste respectively copy+paste+modify, or Web design software that generates static files from templates. Even when you’re not willing to do a complete revamp by shoving your contents into a CMS, you must replace the ASP code anyway, what gives you the opportunity to encapsulate all templated page areas.

Say your design tool created a bunch of .asp files which all contain the same sidebars, headers and footers. When you move those files to your new server, create PHP include files from each templated page area, then replace the duplicated HTML code with <?php @include("header.php"); ?>, <?php @include("sidebar.php"); ?>, <?php @include("footer.php"); ?> and so on. Note that when you’ve HTML code in a PHP include file, you must add <?php ?> before the first line of HTML code or contents in included files. Also, leading spaces, empty lines and such which don’t hurt in HTML, can result in errors with PHP statements like header(), because those fail when the server has sent anything to the user agent (even a single space, new line or tab is too much).

It’s a good idea to use PHP scripts that are included at the very top and bottom of all scripts, even when you currently have no idea what to put into those. Trust me and create top.php and bottom.php, then add the calls (<?php @include("top.php"); ?> […] <?php @include("bottom.php"); ?>) to all scripts. Tomorrow you’ll write a generic routine that you must have in all scripts, and you’ll happily do that in top.php. The day after tomorrow you’ll paste the GoogleAnalytics tracking code into bottom.php. With complex sites you need more hooks.

Using absolute URLs on different systems

Another weak point is the use of relative URIs in links, image sources or references to feeds or external scripts. The lame excuse of most developers is that they need to test the site on their local machine, and that doesn’t work with absolute URLs. Crap. Of course it works. The first statement in top.php is
@require($_SERVER["SERVER_NAME"] .".php");

This way you can set the base URL for each environment and your code runs everywhere. For development purposes on a subdomain you’ve a “dev.example.com.php” include file, on the production system example.com the file name resolves to “www.example.com.php”:
<?php
$baseUrl = “http://example.com”;
?>

Then the menu in sidebar.php looks like:
<?php
$classVMenu = "vmenu";
print "
<img src=\"$baseUrl/vmenuheader.png\" width=\"128\" height=\"16\" alt=\"MENU\" />
<ul>
<li><a class=\"$classVMenu\" href=\"$baseUrl/\">Home</a></li>
<li><a class=\"$classVMenu\" href=\"$baseUrl/contact.asp\">Contact</a></li>
<li><a class=\"$classVMenu\" href=\"$baseUrl/sitemap.asp\">Sitemap</a></li>

</ul>
";
?>

Mixing X/HTML with server sided scripting languages is fault-prone and makes maintenance a nightmare. Don’t make the same mistake as WordPress. Avoid crap like that:
<li><a class="<?php print $classVMenu; ?>" href="<?php print $baseUrl; ?>/contact.asp"></a></li>

Error handling

I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root’s .htaccess file:
ErrorDocument 401 /get-the-fuck-outta-here.asp
ErrorDocument 403 /get-the-fudge-outta-here.asp
ErrorDocument 404 /404handler.php
ErrorDocument 410 /410-gone-forever.asp
ErrorDocument 503 /410-down-for-maintenance.asp
# …
Options -Indexes

Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script:
ErrorDocument 401 /error.php?errno=401
ErrorDocument 403 /error.php?errno=403
ErrorDocument 404 /404handler.php
ErrorDocument 410 /error.php?errno=410
ErrorDocument 503 /error.php?errno=503
# …
Options -Indexes

Note that relative URLs in pages or scripts called by ErrorDocument directives don’t work. Don’t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that. If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that’s bad enough.) For more information please read my pamphlet Why error handling is important.

Last but not least create a robots.txt file in the root. If you’ve nothing to hide from search engine crawlers, this one will suffice:
User-agent: *
Disallow:
Allow: /

I’m aware that this tiny guide can’t cover everything. It should give you an idea of the pitfalls and possible solutions. If you’re somewhat code-savvy my code snippets will get you started, but hire an expert when you plan to migrate a large site. And don’t view the source code of link-condom.com pages where I didn’t implement all tips from this tutorial. ;)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Advantages of a smart robots.txt file

Write a smart robots.txtA loyal reader of my pamphlets asked me:

I foresee many new capabilities with robots.txt in the future due to this [Google’s robots.txt experiments]. However, how the hell can a webmaster hide their robots.txt from the public while serving it up to bots without doing anything shady?

That’s a great question. On this blog I’ve a static robots.txt, so I’ve set up a dynamic example using code snippets from other sites: this robots.txt is what a user sees, and here is what various crawlers get on request of my robots.txt example. Of course crawlers don’t request a robots.txt file with a query string identifying themselves (/robots.txt?crawlerName=*) like in the preview links above, so it seems you’ll need a pretty smart robots.txt file.

Before I tell you how to smarten a robots.txt file, lets define the anatomy of a somewhat intelligent robots.txt script:

  • It exists. It’s not empty. I’m not kidding.
  • A smart robots.txt detects and verifies crawlers to serve customized REP statements to each spider. Customized code means a section for the actual search engine, and general crawler directives. Example:
    User-agent: Googlebot-Image
    Disallow: /
    Allow: /cuties/*.jpg$
    Allow: /hunks/*.gif$
    Allow: /sitemap*.xml$
    Sitemap: http://example.com/sitemap-images.xml
     
    User-agent: *
    Disallow: /cgi-bin/

    This avoids confusion, because complex static robots.txt files with a section for all crawlers out there –plus a general section for other Web robots– are fault-prone, and might exceed the maximum file size some bots can handle. If you fuck up a single statement in a huge set of instructions, this may lead to the exitus of the process parsing your robots.txt, what results in no crawling at all, or possibly crawling of forbidden areas. Checking the syntax per engine with a lean robots.txt is way easier (supported robots.txt syntax: Google, Yahoo, Ask and MSN/LiveSearch - don’t use wildcards with MSN because they don’t really support them, that means at MSN wildcards are valid to match filetypes only).
  • A smart robots.txt reports all crawler requests. This helps with tracking when you change something. Please note that there’s a lag between the most recent request of robots.txt and the moment a search engine starts to obey it, because all engines cache your robots.txt.
  • A smart robots.txt helps identifying unknown Web robots, at least those which bother requesting it (ask Bill how to fondle rogue bots). From a log of suspect requests of your robots.txt you can decide whether particular crawlers need special instructions or not.
  • A smart robots.txt helps maintaining your crawler IP list.

Here is my step by step “how to create a smart robots.txt” guide. As always: if you suffer from IIS/ASP go search for reliable hosting (*ix/Apache).

In order to make robots.txt a script, tell your server to parse .txt files for PHP. (If you serve other .txt files than robots.txt, please note that you must add <?php ?> as first line to all .txt files on your server!) Add this line to your root’s .htaccess file:
AddType application/x-httpd-php .txt

Next grab the PHP code for crawler detection from this post. In addition to the functions checkCrawlerUA() and checkCrawlerIP() you need a function delivering the right user agent name, so please welcome getCrawlerName() in your PHP portfolio:

View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

(If your instructions for Googlebot, Googlebot-Mobile and Googlebot-Image are identical, you can put them in one single “Googlebot” section.)

And here is the PHP script “/robots.txt”. Include the general stuff like functions, shared (global) variables and whatnot.
<?php
@require($_SERVER["DOCUMENT_ROOT"] ."/code/generalstuff.php");

Probably your Web server’s default settings aren’t suitable to send out plain text files, hence instruct it properly.
@header("Content-Type: text/plain");
@header("Pragma: no-cache");
@header("Expires: 0");

If a search engine runs wild requesting your robots.txt too often, comment out the “no-cache” and “expires” headers.

Next check whether the requestor is a verifiable search engine crawler. Lookup the host name and do a reverse DNS lookup.
$isSpider = checkCrawlerIP($requestUri);

Depending on $isSpider log the request either in a crawler log or an access log gathering suspect requests of robots.txt. You can store both in a database table, or in a flat file if you operate a tiny site. (Write the logging function yourself.)
$standardStatement = "User-agent: * \n Disallow: /cgi-bin/ \n\n";
print $standardStatement;
if ($isSpider) {
$lOk = writeRequestLog("crawler");
$crawlerName = getCrawlerName();
}
else {
$lOk = writeRequestLog("suspect");
exit;
}

If the requestor is not a search engine crawler you can verify, send a standard statement to the user agent and quit. Otherwise call getCrawlerName() to name the section for the requesting crawler.

Now you can output individual crawler directives for each search engine, respectively their specialized crawlers.
$prnUserAgent = "User-agent: ";
$prnContent = "";
if ("$crawlerName" == "Googlebot-Image") {
$prnContent .= "$prnUserAgent $crawlerName\n";
$prnContent .= "Disallow: /\n";
$prnContent .= "Allow: /cuties/*.jpg$\n";
$prnContent .= "Allow: /hunks/*.gif$\n";
$prnContent .= "Allow: /sitemap*.xml$\n";
$prnContent .= "Sitemap: http://example.com/sitemap-images.xml\n\n";
}
if ("$crawlerName" == "Mediapartners-Google") {
$prnContent .= "$prnUserAgent $crawlerName \n Disallow:\n\n";
}

print $prnContent;
?>

Say the user agent is Googlebot-Image, the code above will output this robots.txt:
User-agent: *
Disallow: /cgi-bin/
 
User-agent: Googlebot-Image
Disallow: /
Allow: /cuties/*.jpg$
Allow: /hunks/*.gif$
Allow: /sitemap*.xml$
Sitemap: http://example.com/sitemap-images.xml

(Please note that crawler sections must be delimited by an empty line, and that if there’s a section for a particular crawler, this spider will ignore the general directives. Please consider reading more pamphlets discussing robots.txt and dull stuff like that.)

That’s it. Adapt. Enjoy.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Validate your robots.txt - Googlebot becomes smarter

Validate your robots.txt!Last week I reported that Google experiments with new crawler directives for use in robots.txt. Today Google has confirmed that Googlebot understands experimental REP syntax like Noindex:.

That means that forgotten –and, until recently, ignored– statements in your robots.txt might change the crawler’s behavior all of a sudden, without notice. I don’t know for sure which experimental crawler directives Google has implemented, but for example a line like
Noindex: /
in your robots.txt will now deindex your complete Web site.

“Noindex:” is not defined in the Robots Exclusion Protocol from 1994, and not mentioned in Google’s official documents.

John Müller from Google Zürich states:

At the moment we will usually accept the “noindex” directive in the robots.txt, but we are not yet at a point where we are willing to set it into stone and announce full support.

[…] I just want to remind everyone again that this is something that may still change over time. Be careful when playing with things like this.

My understanding of “be careful” is:

  • Create a separate section for Googlebot. Do not rely on directives addressing all Web robots. Especially when you’ve a Googlebot section already, Google’s crawler will ignore directives set under “all user agents” and process only the Googlebot section. Repeat all statements under User-agent: * in User-agent: Googlebot to make sure that Googlebot obeys them.
  • RTFM
  • Do not use other crawler directives than
    Disallow:
    Allow:
    Sitemap:
    in the Googlebot section.
  • Don’t mess-up pattern matching.
    * matches a sequence of characters
    $ specifies the end of the URL
    ? separates the path from the query string, you can’t use it as wildcard!
  • Validate your robots.txt with the cool robots.txt analyzer in your Google Webmaster Console.

Folks put the funniest stuff into their robots.txt, for example images or crawl delays like “Don’t crawl this site during our office hours”. Crawler directives from robots meta tags aren’t very popular, but they appear in many robots.txt files. Hence it makes sound sense to use what people express, regardless the syntax errors.

Also, having the opportunity to manage page specific crawler directives like “noindex”, “nofollow”, “noarchive” and perhaps even “nopreview” on site level is a huge time saver, and eliminates many points of failure. Kudos to Google for this initiative, I hope it will make it into the standards.

I’ll test the experimental robots.txt directives and post the results. Perhaps I’ll set up a live test like this one.

Take care.


Update: Here is the live test of suspected respectively desired new crawler directives for robots.txt. I’ve added a few unusual statements to my robots.txt and uploaded scripts to monitor search engine crawling. The test pages provide links to search queries so you can check whether Google indexed them or not.

Please don’t link to the crawler traps, I’ll update this post with my findings. Of course I appreciate links, so here is the canonical URL:
http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex-your-site/#live-robots-txt-test

Please note that you should not make use of the crawler directives below on production systems! Bear in mind that you can achive all that with simple X-Robots-Tags in the HTTP headers. That’s a bullet-proof way to apply robots meta tags to files without touching them, and it works with virtual URIs too. X-Robots-Tags are sexy, but many site owners can’t handle them due to various reasons, whereas corresponding robots.txt syntax would be usable for everybody (not suffering from restrictive and/or free hosts).

Noindex:

robots.txt:
Noindex: /repstuff/noindex.php

Expected behavior:
No crawling/indexing. It seems Google interprets “Nofollow:” as “Disallow:”.
Desired behavior:
“Follow:” is the REP’s default, hence Google should fetch everything and follow the outgoing links, but shouldn’t deliver Noindex’ed contents on the SERPs, not even as URL-only listings.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noindex.php Blocked by line 30: Noindex: /repstuff/noindex.php
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled (possibly caused by an outdated robots.txt cache).
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noindex.php.
2007-11-23: indexed and cached a page linked only from noindex.php.
(If an outdated robots.txt cache falsely allowed crawling, the search result(s) should disappear shortly after the next crawl.)
2007-11-26: deindexed, the same goes for the linked page (without recrawling).
2007-12-07: appeared under “URLs restricted by robots.txt” in GWC.
2007-12-17: I consider this case closed. Noindex: blocks crawling, deindexes previously indexed pages, and is suspected to block incoming PageRank.

Nofollow:

robots.txt:
Nofollow: /repstuff/nofollow.php

Expected behavior:
Crawling, indexing, and following the links as if there’s no “Nofollow:”.
Desired behavior:
Crawling, indexing, and ignoring outgoing links.
Google’s robots.txt validator:
Line 31: Nofollow: /repstuff/nofollow.php Syntax not understood
http://sebastians-pamphlets.com/repstuff/nofollow.php Allowed
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from nofollow.php (21 Nov 2007 23:19:37 GMT, for some reason not logged properly).
2007-11-23: indexed and cached a page linked only from nofollow.php.
2007-11-26: recrawled, deindexed, no longer cached. The same goes for the linked page.
2007-11-28: cached again, the timestamp on the cached copy “27 Nov 2007 01:11:12 GMT” doesn’t match the last crawl on “2007-11-26 16:47:11 EST” (EST = GMT-5).
2007-12-07: recrawled, still deindexed, cached. Linked page recrawled, cached.
2007-12-17: recrawled, still deindexed (probably caused by near duplicate content on noarchive.php and other pages involved in this test), cached copy dated 2007-12-07. Cache of the linked page still dated 2007-11-21. I consider this case closed. Nofollow: doesn’t work as expected, Google doesn’t support this statement.

Noarchive:

robots.txt:
Noarchive: /repstuff/noarchive.php

Expected behavior:
Crawling, indexing, following links, but no “Cached” links on the SERPs and no access to cached copies from the toolbar.
Desired behavior:
Crawling, indexing, following links, but no “Cached” links on the SERPs and no access to cached copies from the toolbar.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noarchive.php Allowed
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noarchive.php.
2007-11-23: indexed and cached a page linked only from noarchive.php.
2007-11-26: recrawled, deindexed, no longer cached. The linked page was deindexed without recrawling.
2007-11-28: cached again, the timestamp on the cached copy “27 Nov 2007 01:11:19 GMT” doesn’t match the last crawl on “2007-11-26 16:47:18 EST” (EST = GMT-5).
2007-11-29: recrawled, cache not yet updated.
2007-12-07: recrawled. Linked page recrawled.
2007-12-08: recrawled.
2007-12-11: recrawled the linked page, which is cached but not indexed.
2007-12-12: recrawled.
2007-12-17: still indexed, cached copy dated 2007-12-08. I consider this case closed. Noarchive: doesn’t work as expected, actually it does nothing although according to the robots.txt validator that’s supported –or at least known and accepted– syntax.

(It looks like Google understands Nosnippet: too, but I didn’t test that.)

Nopreview:

robots.txt:
Nopreview: /repstuff/nopreview.pdf

Expected behavior:
None, unfortunately.
Desired behavior:
No “view as HTML” links on the SERPs. Neither “nosnippet” nor “noarchive” suppress these helpful preview links, which can be pretty annoying in some cases. See NOPREVIEW: The missing X-Robots-Tag.
Google’s robots.txt validator:
Line 33: Nopreview: /repstuff/nopreview.pdf Syntax not understood
http://sebastians-pamphlets.com/repstuff/nopreview.pdf Allowed
Status:
Crawler requests of nopreview.pdf are logged here.
Google’s crawler / indexer:
2007-11-21: crawled the nopreview-pdf and the log page nopreview.php.
2007-11-23: indexed and cached the log file nopreview.php.
[2007-11-23: I replaced the PDF document with a version carrying a hidden link to an HTML file, and resubmitted it via Google’s add-url page and a sitemap.]
2007-11-26: The old version of the PDF is cached as a “view-as-HTML” version without links (considering the PDF was a captured print job, that’s a pretty decent result), and appears on SERPs for a quoted search. The page linked from the PDF and the new PDF document were not yet crawled.
2007-12-02: PDF recrawled. Googlebot followed the hidden link in the PDF and crawled the linked page.
2007-12-03: “View as HTML” preview not yet updated, the linked page not yet indexed.
2007-12-04: PDF recrawled. The preview link reflects the content crawled on 12/02/2007. The page linked from the PDF is not yet indexed.
2007-12-07: PDF recrawled. Linked page recrawled.
2007-12-09: PDF recrawled.
2007-12-10: recrawled linked page.
2007-12-14: PDF recrawled. Cached copy of the linked page dated 2007-12-11.
2007-12-17: I consider this case closed. Neither Nopreview: nor Noarchive: (in robots.txt since 2007-12-04) are suitable to suppress the HTML preview of PDF files.

Noindex: Nofollow:

robots.txt:
Noindex: /repstuff/noindex-nofollow.php
Nofollow: /repstuff/noindex-nofollow.php

Expected behavior:
No crawling/indexing, invisible on SERPs.
Desired behavior:
No crawling/indexing, and no URL-only listings, ODP titles/descriptions and stuff like that on the SERPs. “Noindex:” in combination with “Nofollow:” is a paraphrased “Disallow:”.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noindex-nofollow.php Blocked by line 35: Noindex: /repstuff/noindex-nofollow.php
Line 36: Nofollow: /repstuff/noindex-nofollow.php Syntax not understood
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noindex-nofollow.php.
2007-11-23: indexed and cached a page linked only from noindex-nofollow.php.
2007-11-26: deindexed without recrawling, the same goes for the linked page.
2007-11-29: the cached copy retrieved on 11/21 reappeared.
2007-12-08: appeared under “URL restricted by robots.txt” in my GWC acct.
2007-12-17: Case closed, see Noindex: above.

Noindex: Follow:

robots.txt:
Noindex: /repstuff/noindex-follow.php
Follow: /repstuff/noindex-follow.php

Expected behavior:
No crawling/indexing, hence unfollowed links.
Desired behavior:
Crawling, following and indexing outgoing links, but no SERP listings.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noindex-follow.php Blocked by line 38: Noindex: /repstuff/noindex-follow.php
Line 39: Follow: /repstuff/noindex-follow.php Syntax not understood
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noindex-follow.php.
2007-11-23: indexed and cached a page linked only from noindex-follow.php.
2007-11-26: deindexed without recrawling, the same goes for the linked page.
2007-12-08: appeared under “URL restricted by robots.txt” in my GWC acct.
2007-12-17: Case closed, see Noindex: above. Google didn’t crawl respectively deindexed despite the Follow: directive.

Index: Nofollow:

robots.txt:
Index: /repstuff/index-nofollow.php
Nofollow: /repstuff/index-nofollow.php

Expected behavior:
Crawling/indexing, following links.
Desired behavior:
Crawling/indexing but ignoring outgoing links.
Google’s robots.txt validator:
Line 41: Index: /repstuff/index-nofollow.php Syntax not understood
Line 42: Nofollow: /repstuff/index-nofollow.php Syntax not understood
http://sebastians-pamphlets.com/repstuff/index-nofollow.php Allowed
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from from index-nofollow.php.
2007-11-23: indexed and cached a page linked only from from index-nofollow.php.
2007-11-26: recrawled and deindexed. The linked page was deindexed witout recrawling.
2007-11-28: cached again, the timestamp on the cached copy “27 Nov 2007 01:11:26 GMT” doesn’t match the last crawl on “2007-11-26 16:47:25 EST” (EST = GMT-5).
2007-12-02: recrawled, the cached copy has vanished.
2007-12-07: recrawled. Linked page recrawled.
2007-12-08: recrawled.
2007-12-09: recrawled.
2007-12-10: recrawled.
2007-12-17: cached under 2007-12-10, not indexed. Linked page not cached, not indexed. I consider this case closed. Google currently doesn’t support Index: nor Nofollow:.

(I didn’t test Noodp: and Unavaliable_after [RFC 850 formatted timestamp]:, although both directives would make sense in robots.txt too.)

2007-11-20:
Added the experimental statements to robots.txt.

2007-11-21:
Linked the test pages. Google crawled all of them, including the pages submitted via links on test pages.

2007-11-23:
Most (all but the PDF document) URLs appear on search result pages. If an outdated robots.txt cache falsely allowed crawling although the WC-validator said “Blocked”, the search results should disappear shortly after the next crawl. I’ve created a sitemap for all URLs above and submitted it. Although I’ve –for the sake of this experiment– cloaked text as well as links and put white links on white background, luckily there is no “we caught you black hat spammer” message in my Webmaster Console. Googlebot nicely followed the cloaked links and indexed everything.

2007-11-26:
Google recrawled a few pages (noarchive.php, index-nofollow.php and nofollow.php), then deindexed all of them. Only the PDF document is indexed, and Google created a “view-as-HTML” preview from this captured print job. It seems that Google crawled something from another host than “*.googlebot.com”, unfortunately I didn’t log all requests. Probably the deindexing was done by a sneaky bot discovering the simple cloaking. Since the linked URLs are out and 3rd party links to them can’t ruin the experiment any longer, I’ve stopped cloaking and show the same text/links to bots and users (actually, users see one more link but that should be fine with Google). There’s still no “thou shalt not cloak” message in my GWC account. Well, those pages are fairly new, perhaps not fully settled in the search index, so lets see what happens next.

2007-11-28
The PDF file as well as the three pages recrawled on 11/26/2007 21:45:00 GMT were reindexed, but the timestamp on the cached copies says “retrieved on 27 Nov 2007 01:15:00 GMT”. Maybe the date/time displayed on cached page copies doesn’t reflect Ms. Googlebot’s “fetched” timestamp, but the time the indexer pulled the page out of the centralized crawl results cache 3.5 hours after crawling.

It seems the “Noarchive:” directive doesn’t work, because noarchive.php was crawled and indexed twice providing a cached page copy. My “Nopreview:” creation isn’t supported either, but maybe Dan Crow’s team picks it up for a future update of their neat X-Robots-Tags (I hope so).

The noindex’ed pages (noindex.php, noindex-nofollow.php and noindex-follow.php) weren’t recrawled and remain deindexed. Interestingly, they don’t appear under “URLs blocked by robots.txt” in my GWC account. Provided the first crawling and indexing on 11/21/2007 was a “mistake” caused by a way too long cached robots.txt, and the second crawl on 11/26/2007 obeyed the “Noindex:” but ignored the (implicit) “Follow:”, it seems that indeed Google interprets “Noindex:” in robots.txt as “Disallow:”. If that is so and if it’s there to stay, they’re going to totally mess up the REP.

<rant> I mean, promoting a rel-nofollow microformat that –at least at launchtime– didn’t share its semantics with the REP’s meta tags nor the –later introduced– X-Robots-Tags was evil bad enough. Ok, meanwhile they’ve corrected this conspiracy flaw by altering the rel-nofollow semantics step by step until “nofollow” in the REL attribute actually means nofollow  and no longer pass no reputation, at least at Google. Other engines still handle rel-nofollow according to the initial and officially still binding standard, and a gazillion Webmasters are confused as hell. In other words only a few search geeks understand what rel-nofollow is all about, but Google jauntily penalizes the great unwashed for not complying to the incomprehensible. By the way, that’s why I code rel="nofollow crap". Standards should be clear and unambiguous. </rant>

If Google really would introduce a “Noindex:” directive in robots.txt that equals “Disallow:”, that would be totally evil. A few sites out there might have an erroneous “Noindex:” statement in their robots.txt that could mean “Disallow:”, and it’s nice that Google tries to do them a favor. Screwing the REP for the sole purpose of complying to syntax errors on the other hand makes no sense. “Noindex” means crawl it, follow its links, but don’t index it. Semantically “Noindex: Nofollow:” equals “Disallow:”, but a “Noindex:” alone implies a “Follow:”, hence crawling is not only allowed but required.

I really hope that we watch an experiment in its early stage, and that Google will do the right thing eventually. Allowing the REP’s page specific crawler directives in robots.txt is a fucking brilliant move, because technically challenged publishers can’t handle the HTTP header’s X-Robots-Tag, and applying those directives to groups of URIs is a great method to steer crawling and indexing not only with static sites.

Dear Google engineers, please consider the nopreview directive too, and implement (no)index, (no)follow, noarchive, nosnippet, noodp/noydir and unavailable_after with the REP’s meaning. And while you’re at it, I want block level instructions in robots.txt too. For example
Area: /products/ DIV.hMenu,TD#bNav,SPAN.inherited "noindex,nofollow"

could instruct crawlers to ignore duplicated properties in product descriptions and the horizontal menu as well as the navigation elements in a table cell with the DOM-ID “bNav” at the very bottom of all pages in /products/,
Area: / A.advertising REL="nofollow"

could condomize all links with the class name “advertising”, and so on.

2007-11-29
The pages linked from the test pages still don’t come up in search results, noarchive.php was recrawled and remains cached, the cached copy of noindex-nofollow.php retrieved on 11/21/2007 reappeared (probably a DC roller coaster issue).

2007-11-30
Three URLs remain indexed: nopreview.pdf, noarchive.php and noindex-nofollow.php. The cached copies show the content crawled on Nov/21/2007. Everything else is deindexed. That’s not to stay (index roller coaster).
As a side note: the URL from my first noindex-robots.txt test appeared in my GWC account under “URLs restricted by robots.txt (Nov/27/2007)”, three days after the unsuccessful crawl.

2007-12-02
A few pages were recrawled, Googlebot followed the hidden link in the PDF file.

2007-12-03
In my GWC crawl stats noindex-nofollow.php appeared under “URLs restricted by robots.txt”, but it’s still indexed.

2007-12-04
The preview (cache) of nopreview.pdf was updated. Since obviously Nopreview: doesn’t work, I’ve added
Noarchive: /repstuff/nopreview.pdf

to my robots.txt. Lets see whether Google removes the cache respectively the HTML preview or not.

2007-12-06
Shortly after the change in robots.txt (Noarchive: /repstuff/nopreview.pdf) Googlebot recrawled the PDF file on 12/04/2007. Today it’s still cached, the HTML preview is still available and linked from SERPs.

2007-12-07
Googlebot has recrawled a few pages. Everything except noarchive.php and nopreview.pdf is deindexed.

2007-12-17
I consider the test closed, but I’ll keep the test pages up so that you can monitor crawling and indexing yourself. Noindex: is the only directive that somewhat works, but it’s implemented completely wrong and is not acceptable in its current shape.

Interestingly the sitemaps report in my GWC account says that 9 pages from 9 submitted URLs were indexed. Obviously “indexed” means something like “crawled at least once, perhaps indexed, maybe not, so if you want to know that definitively then get your lazy butt to check the SERPs yourself”. How expensive would it be to tell something like “Total URLs in sitemap: 9 | Indexed URLs in sitemap: 2″?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Q&A: An undocumented robots.txt crawler directive from Google

What's the fuss about noindex in Google's robots.txt?Blogging should be fun every now and then. Today I don’t tell you anything new about Google’s secret experiments with the robots exclusion protocol. I ask you instead, because I’m sure you know your stuff. Unfortunately, the Q&A on undocumented robots.txt syntax from Google’s labs utilizes JavaScript, so perhaps it looks somewhat weird in your feed reader.

Q: Please look at this robots.txt file and figure out why it’s worth a Q&A with you, my dear reader:


User-Agent: *
Disallow: /
Noindex: /

Ok, click here to show the first hint.

I know, this one was a breeze, so here comes your challenge.
Q: Which crawler directive used in the robots.txt above was introduced 1996 in the Robots Exclusion Protocol (REP), but was not defined in its very first version from 1994?

Ok, click here to show the second hint.

Congrats, you are smart. I’m sure you don’t need to lookup the next answers.
Q: Which major search engine has a team permanently working on REP extensions and releases those quite frequently, and who is the engineer in charge?

Ok, click here to show the third hint.

Exactly. Now we’ve gathered all the pieces of this robots.txt puzzle.
Q: Could you please summarize your cognitions and conclusions?

Ok, click here to show the fourth hint.

Thank you, dear reader! Now lets see what we can dig out. If the appearance of a “Noindex:” directive in robots.txt is an experiment, it would make sense that Ms. Googlebot understands and obeys it. Unfortunetely, I sold all the source code I’ve stolen from Google and didn’t keep a copy for myself, so I need to speculate a little.

Last time I looked, Google’s cool robots.txt validator emulated crawler behavior, that means that the crawlers understood syntax the validator didn’t handle correctly. Maybe this was changed in the meantime, perhaps the validator pulls its code from the “real thing” now, or at least the “Noindex:” experiment may have found its way into the validator’s portfolio. So I thought that testing the newish robots.txt statement “Noindex:” in the Webmaster Console is worth a try. And yes, it told me that Googlebot understands this command, and interprets it as “Disallow:”.
Blocked by line 27: Noindex: /noindex/

Since validation is no proof of crawler behavior, I’ve set up a page “blocked” with a “Noindex:” directive in robots.txt and linked it in my sidebar. The noindex statement was in place long enough before I’ve uploaded and linked the spider trap, so that the engines shouldn’t use a cached robots.txt when they follow my links. My test is public, feel free to check out my robots.txt as well as the crawler log.

While I’m waiting for the expected growth of my noindex crawler log, I’m speculating. Why the heck would Google use a new robots.txt directive which behaves like the good old Disallow: statement? Makes no sense to me.

Lets not forget that this mysterious noindex statement was discovered in the robots.txt of Google’s ad server, not in the better known and closely watched robots.txt of google.com. Google is not the only search engine trying to better understand client sided code. None of the major engines should be interested in crawling ads for ranking purposes. The MSN/LiveSearch referrer spam fiasco demonstrates that search engine bots can fetch and render Google ads outputted in iFrames on pagead2.googlesyndication.com.

Since nobody supports Google’s X-Robots-Tag (sending “noindex” and other REP directives in the HTTP header) until today, maybe the engines have a silent deal that content marked with “Noindex:” in robots.txt shouldn’t be indexed. Microsoft’s bogus spam bot which doesn’t bother with robots.txt because it somewhat hapless tries to emulate a human surfer is not considered a crawler, it’s existence just proves that “software shop” is not a valid label for M$.

This theory has a few weak points, but it could point to something. If noindex in robots.txt really prevents from indexing of contents crawled by accident, or non-HTML contents that can’t supply robots meta tags, that would be a very useful addition to the robots exclusion protocol. Of course we’d then need Noarchive:, Nofollow: and Nopreview: too, probably more but I’m not really in a greedy mood today.

Back to my crawler trap. Refreshing the log reveals that 30 minutes after spreading links pointing to it, Googlebot has fetched the page. That seems to prove that the Noindex: statement doesn’t prevent from crawling, regardless the false (?) information handed out by Google’s robots.txt validator.

(Or didn’t I give Ms. Googlebot enough time to refetch my robots.txt? Dunno. The robots.txt copy in my Google Webmaster Console still doesn’t show the Noindex: statement, but I doubt that’s the version Googlebot uses because according to the last-downloaded timestamp in GWC the robots.txt has been changed at the time of the download. Never mind. If I was way too impatient, I still can test whether a newly discovered noindex directive in robots.txt actually deindexes stuff or not.)

On with the show. The next interesting question is: Will the crawler trap page make it in Google’s search index? Without the possibly non-effective noindex directive a few hundred links should be able to accomplish that. Alas, a quoted search query delivers zilch so far.

Of course I’ve asked Google for more information, but didn’t receive a conclusive answer so far. While waiting for an official statement, I take a break from live blogging this quick research in favor of terrorizing a few folks with respectless blog comments. Stay tuned. Be right back.


Well, meanwhile I had dinner, the kids fell asleep –hopefully until tomorrow morning–, but nothing else happened. A very nice and friendly Googler tries to find out what the noindex in robots.txt fuss is all about, thanks and I can’t wait! However, I suspect the info is either forgotten or deeply buried in some well secured top secret code libraries, hence I’ll push the red button soon.


Thanks to Google’s great Webmaster Central team, especially Susan, I learned that I was flogging a dead horse. Here is Google’s take on Noindex in robots.txt:

As stated in my previous note, I wasn’t aware that we recognized any directives other than Allow/Disallow/Sitemap, so I did some asking around.

Unfortunately, I don’t have an answer that I can currently give you. […] I can’t contribute any clarifications right now.

Thank you Susan!

Update: John Müller from Google has just confirmed that their crawler understands the Noindex: syntax, but it’s not yet set in stone.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Act out your sophisticated affiliate link paranoia

GOOD: paranoid affiliate linkMy recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google’s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:

Links purchased for advertising should be designated as such. This can be done in several ways, such as:
* Adding a rel=”nofollow” attribute to the <a> tag
* Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file

Also, Google doesn’t define paid links that clearly, so try this paid link definition instead before your read on. Here is my linking guide for the paranoid affiliate marketer.

Google recommends hiding of any content provided by affiliate programs from their crawlers. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program’s sales tools page count as duplicate content (snippet) in its worst occurance.

Pasting code copied from a merchant’s site into a page’s or template’s HTML is not exactly a smart way to put ads. Those ads aren’t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you’re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts.

There are tons of such scripts out there, so I don’t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you’ve a WordPress blog you can adapt them with ease.

Cover your ass with a linking policy

Googlers as well as hired guns do review Web sites for violations of Google’s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) full disclosure of your linking attitude can help to pass a human review by search engine staff. By the way, having a policy for dofollowed blog comments is also a good idea.

Since crawler directives like link condoms are for search engines (only), and those pay attention to your source code and hints addressing search engines like robots.txt, you should leave a note there too, look into the source of this page for an example. View sample HTML comment.

Block crawlers from your propaganda scripts

Put all your stuff related to advertising (scripts, images, movies…) in a subdirectory and disallow search engine crawling in your /robots.txt file:
User-agent: *
Disallow: /propaganda/

Of course you’ll use an innocuous name like “gnisitrevda” for this folder, which lacks a default document and can’t get browsed because you’ve a
Options -Indexes

statement in your .htaccess file. (Watch out, Google knows what “gnisitrevda” means, so be creative or cryptic.)

Crawlers sent out by major search engines do respect robots.txt, hence it’s guaranteed that regular spiders don’t fetch it. As long as you don’t cheat too much, you’re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn’t prevent you from surfing search engine staff, but I don’t tell you things you’d better hide from Matt’s gang.

Detect search engine crawlers

Basically there are three common methods to detect requests by search engine crawlers.

  1. Testing the user agent name (HTTP_USER_AGENT) for strings like “Googlebot”, “Slurp”, “MSNbot” or so which identify crawlers. That’s easy to spoof, for example PrefBar for FireFox lets you choose from a list of user agents.
  2. Checking the user agent name, and only when it indicates a crawler, verifying the requestor’s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.
  3. Maintaining a list of all search engine crawler IP addresses known to man, checking the requestor’s IP (REMOTE_ADDR) against this list. (That alone isn’t bullet-proof, but I’m not going to write a tutorial on industrial-strength cloaking IP delivery, I leave that to the real experts.)

For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won’t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants’ search engine rankings, and that’s evil in Google’s eyes.

We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine’s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor’s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn’t match a crawler name.

View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

Grab and implement the PHP source, then you can code statements like
$isSpider = checkCrawlerUA ();
...
if ($isSpider) {
$relAttribute = " rel=\"nofollow\" ";
}
...
$affLink = "<a href=\"$affUrl\" $relAttribute>call for action</a>";

or
$isSpider = checkCrawlerIP ($sponsorUrl);
...
if ($isSpider) {
// don't redirect to the sponsor, return a 403 or 410 instead
}

More on that later.

Don’t deliver your advertising to search engine crawlers

It’s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense’s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can’t guarantee indexing, because Google doesn’t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.

Here is how you could output a totally uncrawlable banner ad:
...
$isSpider = checkCrawlerIP ($PHP_SELF);
...
print "<div class=\"css-class-sidebar robots-nocontent\">";
// output RSS buttons or so
if (!$isSpider) {
print "<script type=\"text/javascript\" src=\"http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner\"></script>";
...
}
...
print "</div>\n";
...

Lets look at the code above. First we detect crawlers “without doubt” (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-’.crawl.yahoo.net’ host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I’ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).

Currently the robots-nocontent  class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn’t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google’s section targeting that’s a crappy crawler directive, though. However, it doesn’t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), “buy now”/”view basket” links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) … and of course ads.

Ads aren’t outputted when a crawler requests a page. Of course that’s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren’t relevant for indexing that’s not deceitful cloaking in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster’s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers.

Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren’t penalized. Since Web robots can’t pull out the plastic, there’s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (Example from Google: “Allow search bots to crawl your sites without session IDs or arguments that track their path through the site” is a call for search engine friendly URL cloaking.)

Is hiding ads from crawlers “safe with Google” or not?

BAD: uncloaked affiliate linkCloaking ads away is a double edged sword from a search engine’s perspective. Way too strictly interpreted that’s against the cloaking rule which states “don’t show crawlers other content than humans”, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google’s section targeting or Yahoo’s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don’t count them as on-the-page contents.

Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don’t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify MFA pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings.

Since in theory the contents of JavaScript ads aren’t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can’t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can’t argue seriously that hiding such tiny signals –which they use for the sole purposes of possible downranks– is against their guidelines.

For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn’t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.

Dear search engines, if it does not count, then you cannot demand to keep it crawlable. Well, a few super mega white hat trolls might disagree, and depending on the implementation on individual sites maybe hiding ads isn’t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don’t try to hide the fact that I run ads on this blog.

Usually I don’t argue with fair vs. unfair, because we talk about war business here, what means that everything goes. However, Google does everything to talk the whole Internet into obfuscating disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. ;) (Of course that’s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don’t care about bad press. So please don’t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)

Outputting ads with JavaScript, preferably in iFrames

Delivering adverts with JavaScript does not mean that one can’t use server sided scripting to adjust them dynamically. With content management systems it’s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn’t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval’ing the PHP code).

Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:

And here is the JS call of the PHP script:
<script type="text/javascript" src="http://sebastians-pamphlets.com/propaganda/output.js.php? adName=seobook&adServed=banner"></script>

The PHP script /propaganda/output.js.php evaluates the query string to pull the requested ad’s components. In case it’s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor’s location and whatnot, but that’s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad’s links. Once the HTML code is ready, it outputs a JavaScript statement:
document.write(‘<a href="http://sebastians-pamphlets.com/propaganda/router.php? adName=seobook&adServed=banner" title="DOWNLOAD THE BOOK ON SEO!"><img src="http://sebastians-pamphlets.com/propaganda/seobook/468-60.gif" width="468" height="60" border="0" alt="The only current book on SEO" title="The only current book on SEO" /></a>’);
which the browser executes within the script tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag.

Matt Cutts said that JavaScript links don’t prevent Googlebot from crawling, but that those links don’t count for rankings (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can’t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it’s fetchable by crawlers, so I wouldn’t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That’s why I use rel-nofollow –on crawler requests– even in JS ads.

Change your user agent name to Googlebot or so, install Matt’s show nofollow hack or something similar, and you’ll see that the affiliate-URL gets nofollow’ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can’t offend my readers with nofollow’ed links (although I don’t care about crawler spoofing, actually that’s a good procedure to let advertisers check out my linking attitude).

We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don’t count pages displayed in iFrames as on-page contents, especially not when the iFrame’s content is hosted on another domain. Here is an example straight from the horse’s mouth:
<iframe name="google_ads_frame" src="http://pagead2.googlesyndication.com/pagead/ads? very-long-and-ugly-query-string" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" frameborder="0" height="90" scrolling="no" width="728"></iframe>
In a noframes tag we could put a static ad for surfers using browsers which don’t support frames/iFrames.

If for some reasons you don’t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael’s Full Text/HTML Obfuscator Tool (hat tip to Donna).

Always redirect to affiliate URLs

There’s absolutely no point in using ugly affiliate URLs on your pages. Actually, that’s the last thing you want to do for various reasons.

  • For example, affiliate URLs as well as source codes can change, and you don’t want to edit tons of pages if that happens.
  • When an affiliate program doesn’t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you’d replace outdated ads completely with one mouse click or so.
  • Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own –often confusing– layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don’t track your outgoing traffic, some sponsors will cheat and you can’t prove your gut feelings.
  • Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don’t smell like affiliate URLs.

When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really manage your affiliate links. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.

Affiliate links provide great footprints. Unique URL parts respectively query string variable names gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The values identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn’t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that’s one trackable footprint you can avoid to some degree.

If the redirect-script’s location is on the same server (in fact it’s not thanks to symlinks) and not named “adserver” or so, chances are that a heuristic check won’t identify the link’s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren’t crawlable with paramaters, or similar “legit” stuff. However, you can’t fool smart algos forever, but if you’ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine’s point of view, and users might agree on this).

Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn’t benefit you (usually). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.

The behavior of an adserver URL masking an affiliate link

Lets look at the redirect-script’s URL from my code example above:
/propaganda/router.php?adName=seobook&adServed=banner
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert.

Now that we’ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor’s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor’s IP addy.

If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor’s landing page:
$sponsorUrl = "http://www.seobook.com/262.html";
$requestProtocol = $_SERVER["SERVER_PROTOCOL"];
$protocolArr = explode("/",$requestProtocol);
$protocolName = trim($protocolArr[0]);
$protocolVersion = trim($protocolArr[1]);
if (stristr($protocolName,"HTTP")
&& strtolower($protocolVersion) > "1.0" ) {
$httpStatusCode = 307;
}
else {
$httpStatusCode = 302;
}
$httpStatusLine = "$requestProtocol $httpStatusCode Temporary Redirect";
@header($httpStatusLine, TRUE, $httpStatusCode);
@header("Location: $sponsorUrl");
exit;

A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can’t use the 301 response code, because 301 redirects are always cachable. (More information on HTTP redirects.)

If the requestor is a major search engine’s crawler, we perform the most brutal bounce back known to man:
if ($isSpider) {
@header("HTTP/1.1 403 Sorry Crawlers Not Allowed", TRUE, 403);
@header("X-Robots-Tag: nofollow,noindex,noarchive");
exit;
}

The 403 response code translates to “kiss my ass and get the fuck outta here”. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn’t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine “forget this URL, don’t request it again”. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.

If you don’t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don’t need to live in fear of penalties.

NICE: prospering affiliate linkBefore you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.

Don’t deliver different (editorial) contents to users and crawlers, but also don’t serve ads to crawlers. They just don’t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.

Your ads look plain ugly with dotted borders in firebrick, hence don’t apply rel=”nofollow” to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.

No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers.

Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google’s job is not exactly funny.

I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you’re not paranoid, that does not mean Google will not chase you down. You really don’t need to be a so called black hat to displease Google. Not knowing respectively not understanding Google’s 12 commandments doesn’t prevent you from being spanked for sins you’ve never heard of. If you’re keen on Google’s nicely targeted traffic, better play by Google’s rules, leastwise on creawler requests.

Feel free to contribute your tips and tricks in the comments.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13  Next Page »