Archived posts from the 'XML-Sitemaps' Category

Get a grip on the Robots Exclusion Protocol (REP)

REP command hierarchyThanks to the very nice folks over at SEOmoz I was able to prevent this site from becoming a kind of REP/robots.txt blog. Please consider reading this REP round up:

Robots Exclusion Protocol 101

My REP 101  links to the various standards (robots.txt, REP tags, Sitemaps, microformats) the REP consists of, and provides a rough summary of each REP component. It explains the difference between crawler directives and indexer directives, and which command hierarchy search engines follow when REP directives put in different levels conflict.

Educate yourself on the REPWhy do I think that solid REP knowledge is important right now? Not only because of the confusion that exists thanks to the volume of crappy advice provided at every Webmaster hangout. Of course understanding the REP makes webmastering easier, thus I’m glad when my REP related pamphlets are considered somewhat helpful.

I’ve a hidden agenda, though. I predict that the REP is going to change shortly. As usual, its evolvement is driven by a major search engine, since the W3C and such organizations don’t bother with the conglomerate of quasi standards and RFCs known as the Robots Exclusion Protocol. In general that’s not a bad thing. Search engines deal with the REP every day, so they have a legitimate interest.

Unfortunately not every REP extension that search engines have invented so far is useful for Webmasters, some of them are plain crap. Learning from fiascos and riots of the past, the engines are well advised to ask Webmasters for feedback before they announce further REP directives.

I’ve a feeling that shortly a well known search engine will launch a survey regarding particular REP related ideas. I want that Webmasters are well aware of the REP’s complexity and functionality when they contribute their take on REP extensions. So please educate yourself. :)

My pamphlet discussing a possible standardization of REP tags as robots.txt directives could be a useful reference, also please watch the great video here. ;)

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

My plea to Google - Please sanitize your REP revamps

Standardization of REP tags as robots.txt directives

Google is confules on REP standards and robots.txtThis draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It’s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported by all major search engines. Furthermore it considers work in progress leaked out from Google.

In the following I’ll try to define a few robots.txt directives that Webmasters really need.

Show Table of Contents

Currently Google experiments with new robots.txt directives, that is REP tags like “noindex” adapted for robots.txt. That’s a welcomed and brilliant move.

Unfortunately, they got it totally wrong, again. (Skip the longish explanation of the rel-nofollow fiasco and my rant on Google’s current robots.txt experiments.)

Google’s last try to enhance the REP by adapting a REP tag’s value in another level was a miserable failure. Not because crawler directives on link-level are a bad thing, the opposite is true, but because the implementation of rel-nofollow confused the hell out of Webmasters, and still does.

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Don’t get me wrong, an instrument to steer search engine crawling and indexing on link level is a great utensil in a Webmaster’s toolbox. Rel-nofollow just lacks granularity, and it was sneakily introduced for the wrong purposes.

Recap: When Google launched rel-nofollow in 2005, they promoted it as a tool to fight comment spam.

From now on, when Google sees the attribute (rel=”nofollow”) on hyperlinks, those links won’t get any credit when we rank websites in our search results. This isn’t a negative vote for the site where the comment was posted; it’s just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists.

Technically spoken, this translates to “search engine crawlers shall/can use rel-nofollow links for discovery crawling, but indexers and ranking algos processing links must not credit link destinations with PageRank, anchor text, nor other link juice originating from rel-nofollow links”. Rel=”nofollow” meant rel=”pass-no-reputation”.

All blog platforms implemented the beast, and it seemed that Google got rid of a major problem (gazillions of irrelevant spam links manipulating their rankings). Not so the bloggers, because the spammers didn’t bother to check whether a blog dofollows inserted links or not. Despite all the condomized links the amount of blog comment spam increased dramatically, since the spammers were forced to attack even more blogs in order to earn the same amount of uncondomized links from blogs that didn’t update to a software version that supported rel-nofollow.

Experiment failed, move on to better solutions like Akismet, captchas or ajax’ed comment forms? Nope, it’s not that easy. Google had a hidden agenda. Fighting blog comment spam was just a snake oil sales pitch, an opportunity to establish rel-nofollow by jumping on a popular band wagon. In 2005 Google had mastered the guestbook spam problem already. Devaluing comment links in well structured pages like blog posts is as easy as doing the same with guestbook links, or identifying affiliate links. In other words, when Google launched rel-nofollow, blog comment spam was definitely not a major search quality issue any more.

Identifying paid links on the other hand is not that easy, because they often appear as editorial links within the content. And that was a major problem for Google, a problem that they weren’t able to solve algorithmically without cooperation of all webmasters, site owners, and publishers. Google actually invented rel-nofollow to get a grip on paid links. Recently they announced that Googlebot no longer follows condomized links (pre-Bigdaddy Google followed condomized links and indexed contents discovered from rel-nofollow links), and their cold war on paid links became hot.

Of course the sneaky morphing of rel-nofollow from “pass no reputation” to a full blown “nofollow” is just a secondary theater of war, but without this side issue (with regard to REP standardization) Google would have lost, hence it was decisive for the outcome of their war on paid links.

To stay fair, Danny Sullivan said twice that rel-nofollow is Dave Winer’s fault, and Google as the victim is not to blame.

Rel-nofollow is settled now. However, I don’t want to see Google using their enormous power to manipulate the REP for selfish goals again. I wrote this rel-nofollow recap because probably, or possibly, Google is just doing it once more:

Google’s “Noindex: in robots.txt” experiment

Google supports a Noindex: directive in robots.txt. It seems Google’s Noindex: blocks crawling like Disallow:, but additionally prevents URLs blocked with Noindex: both from accumulating PageRank as well as from indexing based on 3rd party signals like inbound links.

This functionality would be nice to have, but accomplishing it with “Noindex” is badly wrong. The REP’s “Noindex” value without an explicit “Nofollow” means “crawl it, follow its links, but don’t list it on SERPs”. With pagel-level directives (robots meta tags and X-Robots-Tags) Google handles “Noindex” exactly as defined, that means with an implicit “Follow”. Not so in robots.txt. Mixing crawler directives (Disallow:) with indexer directives (Noindex:) this way takes the “Follow” out of the game, because a search engine can’t follow links from uncrawled documents.

Webmasters will not understand that “Nofollow” means totally different things in robots.txt and meta tags. Also, this approach steals granularity that we need, for example for use with technically structured sitemap pages and other hubs.

According to Google their current interpretation of Noindex: in robots.txt is not yet set in stone. That means there’s an opportunity for improvement. I hope that Google, and other search engines as well, listen to the needs of Webmasters.

Dear Googlers, don’t take the above said as Google bashing. I know, and often wrote, that Google is the search engine that puts the most efforts in boring tasks like REP evolvement. I just think that a dog company like Google needs to take real-world Webmasters into the boat when playing with standards like the REP, for the sake of the cats. ;)

Recap: Existing robots.txt directives

The /path example in the following sections refers to any way to assign URIs to REP directives, not only complete URIs relative to the server’s root. Patterns can be useful to set crawler directives for a bunch of URIs:

  • *: any string in path or query string, including the query string delimiter “?”, multiple wildcards should be allowed.
  • $: end of URI
  • Trailing /: (not exactly a pattern) addresses a directory, its files and subdirectories, the subdirectorie’s files etc., for example
    • Disallow: /path/
      matches /path/index.html but not /path.html
    • /path
      matches both /path/index.html and /path.html, as well as /path_1.html. It’s a pretty common mistake to “forget” the trailing slash in crawler directives meant to disallow particular directories. Such mistakes can result in blocking script/page-URIs that should get crawled and indexed.

Please note that patterns aren’t supported by all search engines, for example MSN supports only file extensions (yet?).

User-agent: [crawler name]
Groups a set of instructions for a particular crawler. Crawlers that find their own section in robots.txt ignore the User-agent: * section that addresses all Web robots. Each User-agent: section must be terminated with at least one empty line.

Disallow: /path
Prevents from crawling, but allows indexing based on 3rd party information like anchor text and surrounding text of inbound links. Disallow’ed URLs can gather PageRank.

Allow: /path
Refines previous Disallow: statements. For example
Disallow: /scripts/
Allow: /scripts/page.php

tells crawlers that they may fetch or, but not any other URL in

Sitemap: [absolute URL]
Announces XML sitemaps to search engines. Example:

points all search engines that support Google’s Sitemaps Protocol to the sitemap locations. Please note that sitemap autodiscovery via robots.txt doesn’t replace sitemap submissions. Google, Yahoo and MSN provide Webmaster Consoles where you not only can submit your sitemaps, but follow the indexing process (wishful thinking WRT particular SEs). In some cases it might be a bright idea to avoid the default file name “sitemap.xml” and keep the sitemap URLs out of robots.txt, sitemap autodiscovery is not for everyone.

Recap: Existing REP tags

REP tags are values that you can use in a page’s robots meta tag and X-Robots-Tag. Robots meta tags go to the HTML document’s HEAD section
<meta name="robots" content="noindex, follow, noarchive" />

whereas X-Robots-Tags supply the same information in the HTTP header
X-Robots-Tag: noindex, follow, noarchive

and thus can instruct crawlers how to handle non-HTML resources like PDFs, images, videos, and whatnot.

    Widely supported REP tags are:

  • INDEX|NOINDEX - Tells whether the page may be indexed (listed on SERPs) or not
  • FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided in the document or not
  • NOODP - tells search engines not to use page titles and descriptions pulled from DMOZ on their SERPs.
  • NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
  • NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
  • NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
  • UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

Problems with REP tags in robots.txt

REP tags (index, noindex, follow, nofollow, all, none, noarchive, nosnippet, noodp, noydir, unavailable_after) were designed as page-level directives. Setting those values for groups of URLs makes steering search engine crawling and indexing a breeze, but also comes with more complexity and a few pitfalls as well.

  • Page-level directives are instructions for indexers and query engines, not crawlers. A search engine can’t obey REP tags without crawling the resource that supplies them. That means that not a single REP tag put as robots.txt statement shall be misunderstood as crawler directive.

    For example Noindex: /path must not block crawling, not even in combination with Nofollow: /path, because there’s still the implicit “archive” (= absence of Noarchive: /path). Providing a cached copy even of a not indexed page makes sense for toolbar users.

    Whether or not a search engine actually crawls a resource that’s tagged with “noindex, nofollow, noarchive, nosnippet” or so is up to the particular SE, but none of those values implies a Disallow: /path.

  • Historically, a crawler instruction on HTML element level overrules the robots meta tag. For example when the meta tag says “follow” for all links on a page, the crawler will not follow a link that is condomized with rel=”nofollow”.

    Does that mean that a robots meta tag overrules a conflicting robots.txt statement? Of course not in any case. Robots.txt is the gatekeeper, and so to say the “highest REP instance”. Actually, to this question there’s no absolute answer that satisfies everybody.

    A Webmaster sitting on a huge conglomerate of legacy code may want to totally switch to robots.txt directives, that means search engines shall ignore all the BS in ancient meta tags of pages created in the stone age of the Internet. Back then the rules were different. An alternative/secondary landing page’s “index,follow” from 1998 most probably doesn’t fly with 2008’s duplicate content filters and high sophisticated link pattern analytics.

    The Webmaster of a well designed brand new site on the other hand might be happy with a default behavior where page-level REP tags overrule site-wide directives in robots.txt.

  • REP tags used in robots.txt might refine crawler directives. For example a disallow’ed URL can accumulate PageRank, and may be listed on SERPs. We need at least two different directives ruling PageRank caluculation and indexing for uncrawlable resources (see below under Noodp:/Noydir:, Noindex: and Norank:).

    Google’s current approach to handle this with the Noindex: directive alone is not acceptable, we need a new REP tag to handle this case. Next up, when we introduce a new REP tag for use in robots.txt, we should allow it in meta tags and HTTP headers too.

  • In theory it makes no sense to maintain a directive that describes a default behavior. But why has the REP “follow” although the absence of “nofollow” perfectly expresses “follow”? Because of the way non-geeks think (try to explain why the value nil/null doesn’t equal empty/zero/blank to a non-geek. Not!).

    Implicit directives that aren’t explicitely named and described in the rules don’t exist for the masses. Even in the 10 commandments someone had to write “thou shalt not hotlink|scrape|spam|cloak|crosslink|hijack…” instead of a no-brainer like “publish unique and compelling content for people and make your stuff crawlable”. Unfortunately, that works the other way round too. If a statement (Index: or Follow:) is dependent on another one (Allow: respectively the absence of Disallow:) folks will whine, rant and argue when search engines ignore their stuff.

    Obviously we need at least Index:, Follow: and Archive to keep the standard usable and somewhat understandable. Of course crawler directives might thwart such indexer directives. Ignorant folks will write alphabetically ordered robots.txt files like
    Disallow: /cgi-bin/
    Disallow: /content/
    Follow: /cgi-bin/redirect.php
    Follow: /content/links/
    Index: /content/articles/

    without Allow: /content/links/, Allow: /content/articles/ and Allow: /cgi-bin/redirect.

    Whether or not indexer directives that require crawling can overrule the crawler directive Disallow: is open for discussion. I vote for “not”.

  • Applying REP tags on site-level would be great, but it doesn’t solve other problems like the need of directives on block and element level. Both Google’s section targeting as well as Yahoo’s robots-nocontent class name aren’t acceptable tools capable to instruct search engines how to handle content in particular page areas (advertising blocks, navigation and other templated stuff, links in footers or sidebar elements, and so on).

    Instead of editing bazillions of pages, templates, include files and whatnot to insert rel-nofollow/nocontent stuff for the sole purpose of sucking up to search engines, we need an elegant way to apply such micro-directives via robots.txt, or at least site-wide sets of instructions referenced in robots.txt. Once that’s doable, Webmasters will make use of such tools to improve their rankings, and not alone to comply to the ever changing search engine policies that cost the Webmaster community billions of man hours each year.

    I consider these robots.txt statements sexy:
    Nofollow a.advertising, div#adblock, span.cross-links: /path
    Noindex .inherited-properties, p#tos, p#privacy, p#legal: /path

    but that’s a wish list for another post. However, while designing site-wide REP statements we should at least think of block/element level directives.

Remember the rel-nofollow fiasco where a REP tag was used on HTML element level producing so much confusion and conflicts. Lets learn from past mistakes and make it perfect this time. A perfect standard can be complex, but it’s clear and unambiguous.

Priority settings

The REP’s command hierarchy must be well defined:

  1. robots.txt
  2. Page meta tags and X-Robots-Tags in the HTTP header. X-Robots-Tag values overrule conflicting meta tag values.
  3. [Future block level directives]
  4. Element level directives like rel-nofollow

That means, when crawling is allowed, page level instructions overrule robots.txt, and element level (or future block level) directives overrule page level instructions as well as robots.txt. As long as the Webmaster doesn’t revert the latter:

Priority-page-level: /path
Default behavior, directives in robots meta tags overrule robots.txt statements. Necessary to reset previous Priority-site-level: statements.

Priority-site-level: /path
Robots.txt directives overrule conflicting directives in robots meta tags and X-Robots-Tags.

Priority-site-level All: /path
Robots.txt directives overrule all directives in robots meta tags or provided elsewhere, because those are completely ignored for all URIs under /path. The “All” parameter would even dofollow nofollow’ed links when the robots.txt lacks corresponding Nofollow: statements.

Noindex: /path

Follow outgoing links, archive the page, but don’t list it on SERPs. The URLs can accumulate PageRank etcetera. Deindex previously indexed URLs.

[Currently Google doesn’t crawl Noindex’ed URLs and most probably those can’t accumulate PageRank, hence URLs in /path can’t distribute PageRank. That’s plain wrong. Those URLs should be able to pass PageRank to outgoing links when there’s no explicit Nofollow:, nor a “nofollow” meta tag respectively X-Robots-Tag.]

Norank: /path

Prevents URLs from accumulating PageRank, anchor text, and whatever link juice.

Makes sense to refine Disallow: statements in company with Noindex: and Noodp:/Noydir:, or to prevent TOS/contact/privacy/… pages and alike from sucking PageRank (nofollow’ing TOS links and stuff like that to control PageRank flow is fault-prone).

Nofollow: /path

The uber-link-condom. Don’t use outgoing links, not even internal links, for discovery crawling. Don’t credit the link destinations with any reputation (PageRank, anchor text, and whatnot).

Noarchive: /path

Don’t make a cached copy of the resource available to searchers.

Nosnippet: /path

List the resource with linked page title on SERPs, but don’t create a text snippet, and don’t reprint the description meta tag.

[Why don’t we have a REP tag saying “use my description meta tag or nothing”?]

Nopreview: /path

Don’t create/link an HTML preview of this resource. That’s interesting for subscriptions sites and applies mostly to PDFs, Word documents, spread sheets, presentations, and other non-HTML resources. More information here.

Noodp: /path

Don’t use the DMOZ title nor the DMOZ description for this URL on SERPs, not even when this resource is a non-HTML document that doesn’t supply its own title/meta description.

Noydir: /path

I’m not sure this one makes sense in robots.txt, because only Yahoo search uses titles and descriptions from the Yahoo directory. Anyway: “Don’t overwrite the page title listed on the SERPs with information pulled from the Yahoo directory, although I paid for it.”

Unavailable_after [date]: /path

Deindex the resource the day after [date]. The parameter [date] is put in any date or date/time format, if it lacks a timezone then GMT is assumed.

[Google’s RFC 850 obsession is somewhat weird. There are many ways to put a timestamp other than “25-Aug-2007 15:00:00 EST”.]

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

In the search index remove the unwanted variable/value pair(s) from the URL’s query string and transfer PageRank and other link juice to the matching URL without those parameters. If this “bare URL” redirects, or is uncrawlable for other reasons, index it with the content pulled from the page with the more complex URL.

Regardless whether the variable name or the variable’s value matches the pattern, “Truncate_*” statements remove a complete argument from the query string, that is &variable=value. If after the (last) truncate operation the query string is empty, the querystring delimiter “?” (questionmark) must be removed too.

Order-arguments [charset]: /path

Sort the query strings of all dynamic URLs by variable name, then within the ordered variables by their values. Pick the first URL from each set of identical results as canonical URL. Transfer PageRank etcetera from all dupes to the canonical URL.

Lots of sites out there were developed by coders who are utterly challenged by all things SEO. Most Web developers don’t even know what URL canonicalization means. Those sites suffer from tons of URLs that all serve identical contents, just because the query string arguments are put in random order, usually inventing a new sequence for each script, function, or include file. Of course most search engines run high sophisticated URL canonicalization routines to prevent their indexes from too much duplicate content, but those algos can fail because every Web site is different.

I totally can resist to suggest a Canonical-uri /: /Default.asp statement that gathers all IIS default-document-URI maladies. Also, case issues shouldn’t get fixed with Case-insensitive-uris: / but by the clueless developers in Redmond.

Will all this come true?

Well, Google has silently started to support REP tags in robots.txt, it totally makes sense both for search engines as well as for Webmasters, and Joe Webmaster’s life would be way more comfortable having REP tags for robots.txt.

A better question would be “will search engines implement REP tags for robots.txt in a way that Webmasters can live with it?”. Although Google launched the sitemaps protocol without significant help from the Webmaster community, I strongly feel that they desperately need our support with this move.

Currently it looks like they will fuck up the REP, respectively the robots.txt standard, hence go grab your AdWords rep and choke her/him until s/he promises to involve Larry, Sergey, Matt, Adam, John, and the whole Webmaster Support Team for the sake of common sense and the worldwide Webmaster community. Thank you!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google to change the Robots Exclusion Protocol again

Google jumping the sharkWeb crawler directives, partly standardized in the Robots Exclusion Protocol (REP), evolved since 1994. Nowadays we’ve to deal with a conglomerate of not binding de facto standards and microformats, all of them extended by various organizations. All search engines claim that they obey “the standard”, but they refer to their very own REP implementation. In fact, each search engine supports a proprietary set of REP directives, differently from other players as a rule.

Google is the search engine putting the most efforts into Robots Exclusion Protocol (REP) evolvements. Their XML Sitemaps handling submissions instead of crawl restrictions changed the REP to a wider scope, the X-Robots-Tag brought us robots meta tags for non-HTML resources like PDF documents, images or video clips, and with Unavailable_after Google made a few clueless news sites happy. With the rel-nofollow microformat on the other hand, respectively its sneaky morphing from a spam fighting tool to its current shape, Google made nobody happy. Yahoo contributed the well meant but half-assed “robots-nocontent” class name, and of course “noydir” (it’s unlikely that any other engine will support those).

Now Google is working on new robots.txt syntax, and I am, politely put, not amused. Here is why I fear that Google is going to totally mess up the REP:

Google supports a “Noindex:” directive in robots.txt, which is treated as “Disallow:”1). Of course that’s an experiment, but if this behavior doesn’t change we’ll get a beast that is –with regard to the confusion it will produce– way more evil than the rel-nofollow fiasco.

  • A noindex-alias for disallow makes no sense, even when such syntax errors are out there.
  • Mixing crawler directives (allow/disallow) with indexer directives (noindex) is not always a bright idea. It’s bad enough that most Webmasters still believe that “Googlebot ranks their stuff”. (Actually, in some cases it can make sense. For example “nofollow” in robots meta tags (or at least for Google in REL attributes too) is both a crawler instruction as well as an indexer directive.)
  • Noindex and disallow are completely different commands. The REP’s noindex directive means “crawl it, follow its links, but don’t list it on the SERPs”. Disallow forbids crawling, but allows indexing URLs from directory listings or other inbound links.

Standards should be clear and unambiguous. Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they’ve the power to fuck up the REP, but they also have “do no evil”.

Considering that Google is run by a bunch of smart engineers, I hope that they’ll do the right thing eventually. The right thing in this case is giving more power to REP evolvements, before questionable and selfish anti-search initiatives like ACAP ruin both the robots.txt consensus as well as the robots meta tag standard.

My idea of more power to REP evolvements is:

  • Sensible implementation of crawler/indexer-directives adapted from REP tags  in robots.txt. Applying page-level instructions ((no)index, (no)follow, noarchive, nosnippet, noodp/noydir, unavailable_after and hopefully nopreview) to groups of URIs is a great way to steer crawling and indexing, especially for sites which for various reasons cannot make use of the HTTP header’s X-Robots-Tag.
  • Implementation of block-level directives in robots.txt. Allowing Webmasters to apply crawler instructions like “noindex” or “nofollow” to particular page areas, like advertising blocks, duplicated text or repetitive navigation elements, addressed via HTML element names and class names and/or DOM-IDs, would be a very flexible instrument to steer crawling and indexing, and it could eleminate many points of failure.
  • Getting Webmasters, Publishers, SEOs and all major engines together to discuss possibly missing granularity and to develop a binding norm obeyed by all players.

The last one sounds like wishful thinking. The alternative is that Google (and, if possible, the bigger engines) talk with Webmasters and then launch the necessary REP extensions. The other engines will follow sooner or later. The publishers, although not getting all their desired ACAP restrictions, will be happy too. Standards like the Robots Exclusion Protocol should be developed by engineers.

1) Noindex: is not a plain Disallow:, there’s an interesting difference. In Google’s experiment both directives block crawling, but Disallow: allows URL-indexing based on 3rd party information, and Disallow:‘ed URLs can accumulate PageRank from internal as well as external links. Noindex:‘ed URLs on the other hand will not appear on SERPs as URL-only listing or with an ODP title and snippet, and I’m quite sure that they will not gather PageRank nor other link juice. That means links from any pages to such URLs get an implicit rel-nofollow in Google’s PageRank calculation, just like dangling links. This apparatus could be a great way to handle PageRank leaks (monthly blog archives, printer friendly pages and stuff like that), because shit happens, hence some links to such pages will slip through without condom. I admit that’s a neat idea, but its implementation is flawed because it doesn’t consider the implicit Follow: (that’s syntax Google doesn’t support in robots.txt). A better way to mark site areas which shall not gather PageRank without raping the REP would be a Norank: directive or so. Noindex: without a Nofollow: must not block crawling. Googlebot must fetch those URLs to follow their links.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

SEO-sanitizing a WordPress theme in 5 minutes

When you start a blog with , you get an overall good crawlability like with most blogging platforms. To get it ranked at search engines your first priority should be to introduce it to your communities acquiring some initial link love. However, those natural links come with disadvantages: canonicalization issues.

“Canonicalization”, what a geeky word. You won’t find it in your dictionary, you must ask Google for a definition. Do the search, the number one result leads to Matt’s blog. Please study both posts before you read on.

Most bloggers linking to you will copy your URLs from the browser’s address bar, or use the neat FireFox “Copy link location” thingy, what leads to canonical inbound links so to say. Others will type in incomplete URLs, or “correct” pasted URLs by removing trailing slashes, “www” prefixes or whatever. Unfortunately, usually both your Web server as well as WordPress are smart enough to find the right page, says your browser at least. What happens totally unseen in the background is that some of these page requests produce a 302-Found elsewhere response, and that search engine crawlers get feeded with various URLs all pointing to the same piece of content. That’s a bad thing with regard to search engine rankings (and enough stuff for a series of longish posts, so just trust me).

Lets begin the WordPress SEO-sanitizing with a fix of the most popular canonicalization issues. Your first step is to tell WordPress that you prefer sane and meaningful URLs without gimmicks. Go to the permalink options, check custom, type in /%postname%/ and save. Later on give each post a nice keyword rich title like “My get rich in a nanosecond scam” and a corresponding slug like “get-rich-in-a-nanosecond”. Next create a plain text file with this code

# Disallow directory browsing:
Options -Indexes
<IfModule mod_rewrite.c>
RewriteEngine On
# Fix www vs. non-www issues:
RewriteCond %{HTTP_HOST} !^your-blog\.com [NC]
RewriteRule (.*)$1 [R=301,L]
# WordPress permalinks:
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

and upload it in ASCII mode to your server’s root as “.htaccess” (if you don’t host your blog in the root or prefer the “www” prefix change the code accordingly). Change your-blog to your domain name respectively www\.your-domain-name.

This setup will not only produce user friendly post URLs like, it will also route all server errors to your theme’s error page. If you don’t blog in the root, learn here how you should handle HTTP errors outside the /blog/ directory (in any case you should use ErrorDocument directives to capture stuff WordPress can’t/shouldn’t handle, e.g. 401, 403, 5xx errors). Load 404.php in an ASCII editor to check whether it will actually send a 404 response. If the very first lines of code don’t look like

@header("HTTP/1.1 404 Not found", TRUE, 404);

then insert the code above and make absolutely sure that you’ve not a single whitespace (space, tab, new line) or visible character before the <?php (grab the code). It doesn’t hurt to make the 404 page friendlier by the way, and don’t forget to check the HTTP response code. Consider calling a 404grabber before you send the 404 header, this is a neat method to do page by page redirects capturing outdated URLs before the visitor gets the error page.

Next you need to hack header.php to fix canonicalization issues which the rewrite rule in the .htaccess file above doesn’t cover. By default WordPress delivers the same page for both the canonical URL and the crappy variant without the trailing slash. Unfortunately many people who are unaware of Web standards as well as scripts written by clueless assclowns remove trailing slashes to save bandwidth (lame excuse for this bullshit by the way, although even teeny search engines suffer from brain dead code monkeys implementing crap like that).

At the very top of header.php add this PHP code:

$requestUri = $_SERVER[”REQUEST_URI”];
$uriArr = explode(”#”, $requestUri);
$requestUriBase = $uriArr[0];
$requestUriFragment = $uriArr[1];
$uriArr = explode(”?”, $requestUriBase);
$requestUriBase = $uriArr[0];
$queryString = $_SERVER[”QUERY_STRING”];
if ( substr($requestUriBase, strlen($requestUriBase) -1, 1) != “/” ) {
$canonicalUrl = “”;
$canonicalUrl .= $requestUriBase .”/”;
if ($queryString) {
$canonicalUrl .= “?” .$queryString;
@header(”HTTP/1.1 301 Moved Permanently”, TRUE, 301);
@header(”Location: $canonicalUrl”);
(Again, not a single whitespace (space, tab, new line) or visible character before the <?php! Grab this code.)

Of course you need to change my URL in the canonicalUrl variable to yours but I don’t mind when you forget it. There’s no such thing as bad traffic. Beyond the canonicalization done above, at this point you can perform all sorts of URL checks and manipulations.

Now you understand why you’ve added the trailing slash in the permalink settings. Not only does the URL look better as a directory link, the trailing slash also allows you to canonicalize your URLs with ease. This works with all kind of WordPress URLs, even archives (although they shall not be crawlable), category archives, pages, of course the main page and whatnot. It can break links when your template has hard coded references to “index.php”, what is quite popular with the search form and needs a fix anyway, because it leads to at least two URLs serving identical contents.

It’s possible to achieve that with a hack in the blog root’s index.php, respectively a PHP script called in .htaccess to handle canonicalization and then including index.php. The index.php might be touched when you update WordPress, so that’s a file you shouldn’t hack. Use the other variant if you see serious performance issues by running through the whole WordPress logic before a possible 301-redirect is finally done in the template’s header.php.

While you’re at the header.php file, you should fix the crappy post titles WordPress generates by default. Prefixing title tags with your blog’s name is downright obscene, irritates readers, and kills your search engine rankings. Hence replace the PHP statement in the TITLE tag with
$pageTitle = wp_title(“”,false);
if (empty($pageTitle)) {
$pageTitle = get_bloginfo(”name”);
$pageTitle = trim($pageTitle);
print $pageTitle;
(Grab code.)

Next delete the references to the archives in the HEAD section:
<?php // wp_get_archives(’type=monthly&format=link’); ?>
The “//” tells PHP that the line contains legacy code, SEO wise at least. If your template comes with “index,follow” robots meta tags and other useless meta crap, delete these unnecessary meta tags too.

Well, there are a few more hacks which make sense, for example level-1 page links in the footer and so on, but lets stick with the mere SEO basics. Now we proceed with plugins you really need.

  • Install automated meta tags, activate and forget it.
  • Next grab Arne’s sitemaps plugin, activate it and uncheck the archives option in the settings. Don’t generate or submit a sitemap.xml before you’re done with the next steps!
  • Because you’re a nice gal/guy, you want to pass link juice to your commenters. Hence you install nofollow case by case or another dofollow plugin preventing you from nofollow insane.
  • Stephen Spencer’s SEO title tag plugin is worth a try too. I didn’t get it to work on this blog (that’s why I hacked the title tag’s PHP code in the first place), but that was probably caused by alzheimer light (=lame excuse for either laziness or goofiness) because it works fine on other blogs I’m involved with. Also, meanwhile I’ve way more code in the title tag, for example to assign sane titles to URLs with query strings, so I can’t use it here.
  • To eliminate the built-in death by pagination flaw –also outlined here–, you install PagerFix from Richard’s buddy Jaimie Sirovich. Activate it, then hack your templates (category-[category ID].php, category.php, archive.php and index.php) at the very bottom:
    // posts_nav_link(’ — ‘, __(’« Previous Page’), __(’Next Page »’));
    (Grab code.) The pager_fix() function replaces the single previous/next links with links pointing to all relevant pages, so that every post is maximal two clicks away from the main page respectively its categorized archive page. Clever.

Did I really promise that applying basic SEO to a WordPress blog is done in five minutes? Well, that’s a lie, but you knew that beforehand. I want you to change your sidebar too. First set the archives widget to display as a drop down list or remove it completely. Second, if you’ve more than a handful of categories, remove the categories widget and provide another topically organized navigation like category index pages. Linking to every category from every page dilutes relevancy with regard to search engine indexing, and is –with long lists of categories in tiny sidebar widgets– no longer helpful to visitors.

When you’re at your sidebar.php, go check whether the canonicalization recommended above broke the site search facility or not. If you find a line of HTML code like
<form id="searchform" method="get" action="./index.php">

then search is defunct. You should replace this line by
<form id="searchform" method="post” action=”<?php
$searchUrl = get_bloginfo(’url’);
if (substr($searchUrl, -1, 1) != “/”) {
$searchUrl .= “/”;
print $searchUrl; ?>”>
(grab code)
and here is why: The URL canonicalization routine adds a trailing slash to ./index.php what results in a 404 error. Next if the method is “get” you really want to replace that with “post” because firstly with regard to Google’s guildelines crawlable search results are a bad idea, and secondly GET-forms are a nice hook for negative SEO (that means that folks not exactly on your buddy list can get you ranked for all sorts of naughty out-of-context search terms).

Finally fire up your plain text editor again and create a robots.txt file:

User-agent: *
# …
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /2008/
Disallow: /2009/
Disallow: /2010/
# …
(If you go for the “www” thingy then you must write “” in the sitemaps-autodiscovery statement! The robots.txt goes to the root directory, change the paths to /blog/2007/ etcetera if you don’t blog in the root.)

You may ask why I tell you to remove all references to the archives. The answer is that firstly nobody needs them, and secondly they irritate search engines with senseless and superfluous content duplication. As long as you provide logical, topically organized and short paths to your posts, none of your visitors will browse the archives. Would you use the white pages to lookup a phone number when entries aren’t ordered alphabetically but by date of birth instead? Nope, solely blogging software produces crap like that as sole or at least primary navigation. There are only very few good reasons to browse a blog’s monthly archives, thus a selection list is the perfect navigational widget in this case.

Once you’ve written a welcome post, submit your sitemap to Google and Yahoo!, and you’re done with your basic WordPress SEOing. Bear in mind that you don’t receive shitloads of search engine traffic before you’ve acquired good inbound links. However, then you’ll do much better with the search engines when your crawlability is close to perfect.

Updated 09/03/2007 to add Richard’s pager_fix tip from the comments.

Updated 09/05/2007 Lucia made a very good point. When you copy the code from this page where WordPress “prettifies” even <code>PHP code</code>, you end up with crap. I had to learn that not every reader knows that code must be changed when copied from a page where WordPress replaces wonderful plain single as well as double quotes within code/pre tags with fancy symbols stolen from M$-Word. (Actually, I knew it but sometimes I’m bonelazy.) So here is the clean PHP code from above (don’t mess with the quotes, especially don’t replace double quotes with single quotes!).

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Hassles of submitting a blogspot XML-sitemap

Usually my posts make it into Google’s Web index within 2-3 hours, but not yesterday. Since Ms. Googlebot became lazy fetching my pamphlets, I thought she needs a hint. With one of the last updates Blogger’s feed URLs have changed, but lazy as I am I’ve still the ancient ATOM feed in my sitemaps account. So I grabbed the new URL from the LINK element in HEAD and submitted it as sitemap. Bugger me. Not enough tea this morning. I didn’t look at the URL, just copied and pasted it, then submitted the feed to no avail. Oups. Here is why it didn’t work: doesn’t come with build-in XML sitemaps, but one can use the feeds. That’s definitely not a perfect solution, because the feeds list only a few recent posts, but better than nothing.

Here are the standard feed URLS of a blogger blog at (replace “sebastianx” with your subdomain): (ATOM, posts) (RSS, posts) (ATOM, comments)

None of these can be used as a sitemap, because the post-URLs are not located under the sitemap path.

Fortunately, the old feeds still work, although they are served as “text/html” what can confuse things, so I’ve to stick with as “sitemap”.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Is XML Sitemap Autodiscovery for Everyone?

Referencing XML sitemaps in robots.txt was recently implemented by Google upon requests of webmasters going back to June, 2005, shortly after the initial launch of sitemaps. Yahoo, Microsoft, and Ask support it, whereby nobody knows when MSN is going to implement XML sitemaps at all.

Some folks argue that robots.txt introduced by the Robots Exclusion Protocol from 1994 should not get abused by inclusion mechanisms. Indeed this may create confusion, but it was done before, for example by search engines supporting Allow: statements introduced 1996. Also, the de facto Robots Exclusion Standard covers robots meta tags –where inclusion is the default– too. I think dogmatism is not helpful when the actual needs require evolvement.

So yes, the opportunity to address sitemaps in robots.txt is a good thing, but certainly not enough. It simplifies the process, that is auto detection of sitemaps eliminates a few points of failure. Webmasters don’t need to monitor which engines implemented the sitemaps protocol recently, and submit accordingly. They can just add a single line to their robots.txt file and the engines will do their job. Fire and forget is a good concept. However, the good news come with pitfalls.

But is this good thing actually good for everyone? Not really. Many publishers have no control over their server’s robots.txt file, for example publishers utilizing signup-and-instantly-start-blogging services or free hosts. As long as these platforms generate RSS feeds or other URL lists suitable as sitemaps, the publishers must submit to all search engines manually. Enhancing the sitemaps auto detection by looking at page meta data would be great: <meta name="sitemap" content="" /> or <link rel="sitemap" type="application/rss+xml" href="" /> would suffice.

So far the explicit diaspora. Others are barred from sitemap autodiscovery by lack of experience, technical skills, or manageable environments like at way to restrictive hosting services. Example: the prerequisites for sitemap autodetection include the ability to fix canonical issues. An XML sitemap containing www.domain.tld-URLs referenced as Sitemap: http://www.domain.tld/sitemap.xml in http://domain.tld/robots.txt is plain invalid. Crawlers following links without the “www” subdomain will request the robots.txt file without the “www” prefix. If a webmaster running this flawed but very common setup relies on sitemap autodetection, s/he will miss out on feedback respectively error alerts. On some misconfigured servers this may even lead to deindexing of all pages with relative internal links.

Hence please listen to Vanessa Fox stating that webmasters shall register their autodiscovered sitemaps at Webmaster Central and Site Explorer to get alerted on errors which an XML sitemap validator cannot spot, and to monitor the crawling process!

I doubt many SEO professionals and highly skilled Webmasters managing complex sites will make use of that new feature. They prefer to have things under control, and automated 3rd party polls are hard to manipulate. Probably they want to maintain different sitemaps per engine to steer their crawling accordingly. Although this can be accomplished by user agent based delivery of robots.txt, that additional complexity doesn’t make the submission process easier to handle. Only uber-geeks automate everything ;)

For example it makes no sense to present a gazillion of image- or video clip URLs to a search engine indexing textual contents only. Google handles different content types extremely simple for the site owner. One can put HTML pages, images, movies, PDFs, feeds, office documents and whatever else all in one sitemap and Google’s sophisticated crawling process delivers each URL to the indexer it belongs to. We don’t know (yet) how other engines will handle that.

Also, XML sitemaps are a neat instrument to improve crawling and indexing of particular contents. One search engine may nicely index insuffient linked stuff, whilst another engine fails to discover pages buried more than two link levels deep, badly needing the hints via sitemap. There are more good reasons to give each engine its own sitemap.

Last but not least there might be good reasons not to announce sitemap contents to the competition.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

In need of a "Web-Robot Directives Standard"

The Robots Exclusion Protocol from 1994 gets used and abused, best described by Lisa Barone citing Dan Crow from Google: “everyone uses it but everyone uses a different version of it”. De facto we’ve a Robots Exclusion Standard covering crawler directives in robots.txt and robots meta tags as well, said Dan Crow. Besides non-standardized directives like “Allow:”, Google’s Sitemaps Protocol adds more inclusion to the mix, now even closely bundled with robots.txt. There are more ways to put crawler directives. Unstructured (in the sense of independence from markup elements) like with Google’s section targeting, on link level applying the commonly disliked rel-nofollow microformat or XFN, and related thoughts on block level directives.

All in all that’s a pretty confusing conglomerate of inclusion and exclusion, utilizing many formats respectively markup elements, and lots of places to put crawler directives. Not really the sort of norm the webmaster community can successfully work with. No wonder that over 75,000 robots.txt files have pictures in them, that less than 35 percent of servers have a robots.txt file, that the average robots.txt file is 23 characters (”User-agent: * Disallow:”), gazillions of Web pages carry useless and unsupported meta tags like “revisit-after” … for more funny stats and valuable information see Lisa’s robots.txt summit coverage (SES NY 2007), also covered by Tamar (read both!).

How to structure a “Web-Robot Directives Standard”?

To handle redundancies as well as cascading directives properly, we need a clear and understandable chain of command. The following is just a first idea off the top of my head, and likely gets updated soon:

  • Robots.txt

    1. Disallows directories, files/file types, and URI fragments like query string variables/values by user agent.
    2. Allows sub-directories, file names and URI fragments to refine Disallow statements.
    3. Gives general directives like crawl-frequency or volume per day and maybe even folders, and restricts crawling in particular time frames.
    4. References general XML sitemaps accessible to all user agents, and specific XML sitemaps addressing particular user agents as well.
    5. Sets site-level directives like “noodp” or “noydir”.
    6. Predefines page-level instructions like “nofollow”, “nosnippet” or “noarchive” by directory, document type or URL fragments.
    7. Predefines block-level respectively element-level conditions like “noindex” or “nofollow” on class names or DOM-IDs by markup element. For example “DIV.hMenu,TD#bNav ‘noindex,nofollow’” could instruct crawlers to ignore the horizontal menu as well as navigation at the very bottom on all pages.
    8. Predefines attribute-level conditions like “nofollow” on A elements. For example “A.advertising REL ‘nofollow’” could tell crawlers to ignore links in ads, or “P#tos > A ‘nofollow’” could instruct spiders to ignore links in TOS excerpts found on every page in a P element with the DOM-ID “tos”.
    • XML Sitemaps

      1. Since robots.txt deals with inclusion now, why not add an optional URL specific “action” element allowing directives like “nocache” or “nofollow”? Also a “delete” directive to get outdated pages removed from search indexes would make sound sense.
      2. To make XML sitemap data reusable, and to allow centralized maintenance of page meta data, a couple of new optional URL elements like “title”, “description”, “document type”, “language”, “charset”, “parent” and so on would be a neat addition. This way it would be possible to visualize XML sitemaps as native (and even hierarchical) site maps.

      Robots.txt exclusions overrule URLs listed for inclusion in XML sitemaps.

      • Meta Tags

      • Page meta data overrule directives and information provided in robots.txt and XML sitemaps. Empty contents in meta tags suppress directives and values given in upper levels. Non-existent meta tags implicitly apply data and instructions from upper levels. The same goes for everything below.
        • Body Sections

        • Unstructured parenthesizing of parts of code certainly is undoable with XMLish documents, but may be a pragmatic procedure to deal with legacy code. Paydirt in HTML comments may be allowed to mark payload for contextual advertising purposes, but it’s hard to standardize. Lets leave that for proprietary usage.
          • Body Elements

          • Implementing a new attribute for messages to machines should be avoided for several good reasons. Classes are additive, so multiple values can be specified for most elements. That would allow to put standarized directives as class names, for example class=”menu robots-noindex googlebot-nofollow slurp-index-follow” where the first class addresses CSS. Such inline robot directives come with the same disadvantages as inline style assignments and open a can of worms so to say. Using classes and DOM-IDs just as a reference to user agent specific instructions given in robots.txt is surely the preferable procedure.
            • Element Attributes

            • More or less this level is a playground for microformats utilizing the A element’s REV and REL attributes.

Besides the common values “nofollow”, “noindex”, “noarchive”/”nocache” etc. and their omissible positive defaults “follow” and “index” etc., we’d need a couple more, for example “unapproved”, “untrusted”, “ignore” or “skip” and so on. There’s a lot of work to do.

In terms of of complexity, a mechanism as outlined above should be as easy to use as CSS in combination with client sided scripting for visualization purposes.

However, whatever better ideas are out there, we need a widely accepted “Web-Robot Directives Standard” as soon as possible.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

XML sitemap auto-discovery

Vanessa Fox makes the news: In addition to sitemap submissions you can add this line to your robots.txt file:


Google, Yahoo, MSN and Ask (new on board) then should fetch and parse the XML sitemap automatically. Next week or so the cool robots.txt validator will get an update too.

Question: Is XML Sitemap Autodiscovery for Everyone?

More info here.

Q: Does it work by user-agent?
A: Yes, add all sitemaps to robots.txt, then disallow by engine.

Q: Must I fix canonical issues before I use sitemap autodiscovery?
A: Yes, 301-redirect everything to your canonical server name, and choose a preferred domain at Webmaster Central.

Q: Can I submit all supported sitemap formats via robots.txt?
A: Yes, everything goes. XML, RSS, ATOM, plain text, Gzipped …

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google’s Sitemaps Team Interviewed

Recently I had the great opportunity to interview the Google Sitemaps Team: Shiva Shivakumar who started the program in June 2006; Vanessa Fox, the extremely helpful spokeswoman who blogs for the Sitemaps team and assists Webmasters in the newsgroup; her coworkers Michael and Andrey from Google’s Kirkland office, Grace and Patrik from the branch in Zurich, and Shal from the Googleplex in Mountain View. Matt Cutts chimed in with some good advice too.

I want to thank those friendly Googlers for taking the time to contribute loads of great technical advice and extremely valuable information to the Google Sitemaps Knowledge Base.

Besides Sitemaps related information, the interview provides the ultimate answer to the endless “404 vs. 410″ debate, explains the URL removal tool Matt Cutts was talking about yesterday at Webmasterradio, provides hints to optimize dynamic Web sites … hey, just read it to find all the gems:

Google Sitemaps Team Interview

Consider bookmarking the Google Sitemaps Info Page and subscribing to the Sitemaps Feed to get alerted on future stuff like that.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google Sitemaps

I plan to launch the Google Sitemaps Knowledge Base (BETA) soon. So if anybody has a neat Sitemaps related question which is of common interest, or an educational article I can publish exclusively, please submit it in the comments or drop me a message. Thanks in advance.

By the way, I’ve created a new RSS feed consolidating all Google Sitemaps stuff from the tutorial, FAQ, knowledge base (BETA), XML validation and whatever.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

  1 | 2  Next Page »