Archived posts from the 'robots.txt' Category

Yahoo! search going to torture Webmasters

According to Danny Yahoo! supports a multi-class nonsense called robots-nocontent tag. CRAP ALERT!

Can you senseless and cruel folks at Yahoo!-search imagine how many of my clients who’d like to use that feature have copied and pasted their pages? Do you’ve a clue how many sites out there don’t make use of SSI, PHP or ASP includes, and how many sites never heard of dynamic content delivery, respectively how many sites can’t use proper content delivery techniques because they’ve to deal with legacy systems and ancient business processes? Did you ask how common templated Web design is, and I mean the weird static variant, where a new page gets build from a randomly selected source page saved as new-page.html?

It’s great that you came out with a bastardized copy of Google’s somewhat hapless (in the sense of cluttering structured code) section targeting, because we dreadfully need that functionality across all engines. And I admit that your approach is a little better than AdSense section targeting because you don’t mark payload by paydirt in comments. But why the heck did you design it that crappy? The unthoughtful draft of a microformat from what you’ve “stolen” that unfortunate idea didn’t become a standard for very good reasons. Because it’s crap. Assigning multiple class names to markup elements for the sole purpose of setting crawler directives is as crappy as inline style assignments.

Well, due to my zero-bullshit tolerance I’m somewhat upset, so I repeat: Yahoo’s robots-nocontent class name is crap by design. Don’t use it, boycott it, because if you make use of it you’ll change gazillions of files for each and every proprietary syntax supported by a single search engine in the future. When the united search geeks can agree on flawed standards like rel-nofollow, they should be able to talk about a sensible evolvement of robots.txt.

There’s a way easier solution, which doesn’t require editing tons of source files, that is standardizing CSS-like syntax to assign crawler directives to existing classes and DOM-IDs. For example extent robots.txt syntax like:

A.advertising { rel: nofollow; } /* devalue aff links */

DIV.hMenu, TD#bNav { content:noindex; rel:nofollow; } /* make site wide links unsearchable */

Unsupported robots.txt syntax doesn’t harm, proprietary attempts do harm!

Dear search engines, get together and define something useful, before each of you comes out with different half-baked workarounds like section targeting or robots-nocontent class values. Thanks!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Is XML Sitemap Autodiscovery for Everyone?

Referencing XML sitemaps in robots.txt was recently implemented by Google upon requests of webmasters going back to June, 2005, shortly after the initial launch of sitemaps. Yahoo, Microsoft, and Ask support it, whereby nobody knows when MSN is going to implement XML sitemaps at all.

Some folks argue that robots.txt introduced by the Robots Exclusion Protocol from 1994 should not get abused by inclusion mechanisms. Indeed this may create confusion, but it was done before, for example by search engines supporting Allow: statements introduced 1996. Also, the de facto Robots Exclusion Standard covers robots meta tags –where inclusion is the default– too. I think dogmatism is not helpful when the actual needs require evolvement.

So yes, the opportunity to address sitemaps in robots.txt is a good thing, but certainly not enough. It simplifies the process, that is auto detection of sitemaps eliminates a few points of failure. Webmasters don’t need to monitor which engines implemented the sitemaps protocol recently, and submit accordingly. They can just add a single line to their robots.txt file and the engines will do their job. Fire and forget is a good concept. However, the good news come with pitfalls.

But is this good thing actually good for everyone? Not really. Many publishers have no control over their server’s robots.txt file, for example publishers utilizing signup-and-instantly-start-blogging services or free hosts. As long as these platforms generate RSS feeds or other URL lists suitable as sitemaps, the publishers must submit to all search engines manually. Enhancing the sitemaps auto detection by looking at page meta data would be great: <meta name="sitemap" content="" /> or <link rel="sitemap" type="application/rss+xml" href="" /> would suffice.

So far the explicit diaspora. Others are barred from sitemap autodiscovery by lack of experience, technical skills, or manageable environments like at way to restrictive hosting services. Example: the prerequisites for sitemap autodetection include the ability to fix canonical issues. An XML sitemap containing www.domain.tld-URLs referenced as Sitemap: http://www.domain.tld/sitemap.xml in http://domain.tld/robots.txt is plain invalid. Crawlers following links without the “www” subdomain will request the robots.txt file without the “www” prefix. If a webmaster running this flawed but very common setup relies on sitemap autodetection, s/he will miss out on feedback respectively error alerts. On some misconfigured servers this may even lead to deindexing of all pages with relative internal links.

Hence please listen to Vanessa Fox stating that webmasters shall register their autodiscovered sitemaps at Webmaster Central and Site Explorer to get alerted on errors which an XML sitemap validator cannot spot, and to monitor the crawling process!

I doubt many SEO professionals and highly skilled Webmasters managing complex sites will make use of that new feature. They prefer to have things under control, and automated 3rd party polls are hard to manipulate. Probably they want to maintain different sitemaps per engine to steer their crawling accordingly. Although this can be accomplished by user agent based delivery of robots.txt, that additional complexity doesn’t make the submission process easier to handle. Only uber-geeks automate everything ;)

For example it makes no sense to present a gazillion of image- or video clip URLs to a search engine indexing textual contents only. Google handles different content types extremely simple for the site owner. One can put HTML pages, images, movies, PDFs, feeds, office documents and whatever else all in one sitemap and Google’s sophisticated crawling process delivers each URL to the indexer it belongs to. We don’t know (yet) how other engines will handle that.

Also, XML sitemaps are a neat instrument to improve crawling and indexing of particular contents. One search engine may nicely index insuffient linked stuff, whilst another engine fails to discover pages buried more than two link levels deep, badly needing the hints via sitemap. There are more good reasons to give each engine its own sitemap.

Last but not least there might be good reasons not to announce sitemap contents to the competition.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

In need of a "Web-Robot Directives Standard"

The Robots Exclusion Protocol from 1994 gets used and abused, best described by Lisa Barone citing Dan Crow from Google: “everyone uses it but everyone uses a different version of it”. De facto we’ve a Robots Exclusion Standard covering crawler directives in robots.txt and robots meta tags as well, said Dan Crow. Besides non-standardized directives like “Allow:”, Google’s Sitemaps Protocol adds more inclusion to the mix, now even closely bundled with robots.txt. There are more ways to put crawler directives. Unstructured (in the sense of independence from markup elements) like with Google’s section targeting, on link level applying the commonly disliked rel-nofollow microformat or XFN, and related thoughts on block level directives.

All in all that’s a pretty confusing conglomerate of inclusion and exclusion, utilizing many formats respectively markup elements, and lots of places to put crawler directives. Not really the sort of norm the webmaster community can successfully work with. No wonder that over 75,000 robots.txt files have pictures in them, that less than 35 percent of servers have a robots.txt file, that the average robots.txt file is 23 characters (”User-agent: * Disallow:”), gazillions of Web pages carry useless and unsupported meta tags like “revisit-after” … for more funny stats and valuable information see Lisa’s robots.txt summit coverage (SES NY 2007), also covered by Tamar (read both!).

How to structure a “Web-Robot Directives Standard”?

To handle redundancies as well as cascading directives properly, we need a clear and understandable chain of command. The following is just a first idea off the top of my head, and likely gets updated soon:

  • Robots.txt

    1. Disallows directories, files/file types, and URI fragments like query string variables/values by user agent.
    2. Allows sub-directories, file names and URI fragments to refine Disallow statements.
    3. Gives general directives like crawl-frequency or volume per day and maybe even folders, and restricts crawling in particular time frames.
    4. References general XML sitemaps accessible to all user agents, and specific XML sitemaps addressing particular user agents as well.
    5. Sets site-level directives like “noodp” or “noydir”.
    6. Predefines page-level instructions like “nofollow”, “nosnippet” or “noarchive” by directory, document type or URL fragments.
    7. Predefines block-level respectively element-level conditions like “noindex” or “nofollow” on class names or DOM-IDs by markup element. For example “DIV.hMenu,TD#bNav ‘noindex,nofollow’” could instruct crawlers to ignore the horizontal menu as well as navigation at the very bottom on all pages.
    8. Predefines attribute-level conditions like “nofollow” on A elements. For example “A.advertising REL ‘nofollow’” could tell crawlers to ignore links in ads, or “P#tos > A ‘nofollow’” could instruct spiders to ignore links in TOS excerpts found on every page in a P element with the DOM-ID “tos”.
    • XML Sitemaps

      1. Since robots.txt deals with inclusion now, why not add an optional URL specific “action” element allowing directives like “nocache” or “nofollow”? Also a “delete” directive to get outdated pages removed from search indexes would make sound sense.
      2. To make XML sitemap data reusable, and to allow centralized maintenance of page meta data, a couple of new optional URL elements like “title”, “description”, “document type”, “language”, “charset”, “parent” and so on would be a neat addition. This way it would be possible to visualize XML sitemaps as native (and even hierarchical) site maps.

      Robots.txt exclusions overrule URLs listed for inclusion in XML sitemaps.

      • Meta Tags

      • Page meta data overrule directives and information provided in robots.txt and XML sitemaps. Empty contents in meta tags suppress directives and values given in upper levels. Non-existent meta tags implicitly apply data and instructions from upper levels. The same goes for everything below.
        • Body Sections

        • Unstructured parenthesizing of parts of code certainly is undoable with XMLish documents, but may be a pragmatic procedure to deal with legacy code. Paydirt in HTML comments may be allowed to mark payload for contextual advertising purposes, but it’s hard to standardize. Lets leave that for proprietary usage.
          • Body Elements

          • Implementing a new attribute for messages to machines should be avoided for several good reasons. Classes are additive, so multiple values can be specified for most elements. That would allow to put standarized directives as class names, for example class=”menu robots-noindex googlebot-nofollow slurp-index-follow” where the first class addresses CSS. Such inline robot directives come with the same disadvantages as inline style assignments and open a can of worms so to say. Using classes and DOM-IDs just as a reference to user agent specific instructions given in robots.txt is surely the preferable procedure.
            • Element Attributes

            • More or less this level is a playground for microformats utilizing the A element’s REV and REL attributes.

Besides the common values “nofollow”, “noindex”, “noarchive”/”nocache” etc. and their omissible positive defaults “follow” and “index” etc., we’d need a couple more, for example “unapproved”, “untrusted”, “ignore” or “skip” and so on. There’s a lot of work to do.

In terms of of complexity, a mechanism as outlined above should be as easy to use as CSS in combination with client sided scripting for visualization purposes.

However, whatever better ideas are out there, we need a widely accepted “Web-Robot Directives Standard” as soon as possible.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

XML sitemap auto-discovery

Vanessa Fox makes the news: In addition to sitemap submissions you can add this line to your robots.txt file:


Google, Yahoo, MSN and Ask (new on board) then should fetch and parse the XML sitemap automatically. Next week or so the cool robots.txt validator will get an update too.

Question: Is XML Sitemap Autodiscovery for Everyone?

More info here.

Q: Does it work by user-agent?
A: Yes, add all sitemaps to robots.txt, then disallow by engine.

Q: Must I fix canonical issues before I use sitemap autodiscovery?
A: Yes, 301-redirect everything to your canonical server name, and choose a preferred domain at Webmaster Central.

Q: Can I submit all supported sitemap formats via robots.txt?
A: Yes, everything goes. XML, RSS, ATOM, plain text, Gzipped …

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Why proper error handling is important

Misconfigured servers can prevent search engines from crawling and indexing. I admit that’s news of yesterday. However, standard setups and code copied from low quality resources are underestimated –but very popular– points of failure. According to Google a missing robots.txt file in combination with amateurish error handling can result in invisibility on Google’s SERPs. That’s a very common setup by the way.

Googler Jonathon Simon said:

This way [correct setup] when the Google crawler or other search engine checks for a robots.txt file, they get a 200 response if the file is found and a 404 response if it is not found. If they get a 200 response for both cases then it is ambiguous if your site has blocked search engines or not, reducing the likelihood your site will be fully crawled and indexed.

That’s a very carefully written warning, so I try to rephrase the message between the lines:

If you have no robots.txt and your server responds “Ok” (or 302 on a request of robots.txt followed by a 200 response on request of the error page) when Googlebot tries to fetch it, Googlebot might not be willing to crawl your stuff further, hence your pages will not make it in Google’s search index.

If you don’t suffer from IIS (Windows hosting is a horrible nightmare coming with more pitfalls than countable objects in the universe: go find a reliable host) here is a bullet-proof setup.

If you don’t have a robots.txt file yet, create one and upload it today:

User-agent: *

This tells crawlers that your whole domain is spiderable. If you want to exclude particular pages, file-types or areas of your site, refer to the robots.txt manual.

Next look at the .htaccess file in your server’s Web root directory. If your FTP client doesn’t show it, add “-a” to “external mask” in the settings and reconnect. If you find complete URLs in lines starting with “ErrorDocument”, your error handling is screwed up. What happens is that your server does a soft redirect to the given URL, which probably responds with “200-Ok”, and the actual error code gets lost in cyberspace. Sending 401 errors to absolute URLs will slow your server down to the performance of a single IBM-XT hosting, all other error directives pointing to absolute URLs result in crap. Here is a well formed .htaccess sample:

ErrorDocument 401 /get-the-fuck-outta-here.html
ErrorDocument 403 /get-the-fudge-outta-here.html
ErrorDocument 404 /404-not-found.html
ErrorDocument 410 /410-gone-forever.html
Options -Indexes
<Files “.ht*”>
deny from all
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.canonical-server-name\.com [NC]
RewriteRule (.*)$1 [R=301,L]

With “ErrorDocument” directives you can capture other clumsiness as well, for example 500 errors with /server-too-buzzy.html or so. Or make the error handling comfortable using /error.php?errno=[insert err#]. In any case avoid relative URLs (src attribute in IMG elements, CSS/feed links, href attributes of A elements …) on all landing pages. You can test actual HTTP response codes with online header checkers.

The other statements above do different neat things. Options -Indexes disallows directory browsing, the next block makes sure that nobody can read your server directives, and the last three lines redirect invalid server names to your canonical server address.

.htaccess is a plain ASCII file, it can get screwed when you upload it in binary mode or when you change it with a word processor. Best edit it with an ASCII/ANSI editor (vi, notepad) as htaccess.txt on your local machine (most FTP clients choose ASCII mode for text files) and rename it to “.htaccess” on the server. Keep in mind that file names are case sensitive.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Say No to NoFollow Follow-up

Say NO to NOFOLLOW - copyright jlh-design.comI don’t want to make this the nofollow-blog, but since more and more good folks don’t love the nofollow-beast any more, here is a follow-up on the recent nofollow discussion. Follow the no-to-nofollow trend here:

Loren Baker posts 13 very good reasons why rel=nofollow sucks. He got dugg, buried, but tons of responses in the comments, where most people state that rel=nofollow was a failure with regard to the current amount of comment spam, because the spammers spam for traffic, not link love. Well, that’s true, but rel=nofollow at least nullifies the impact spamming of unmoderated blogs had on search results, says Google. Good point, but is it fair to penalize honest comment authors by nofollow’ing their relevant links by default? Not really. The search engines should work harder on solving this problem algorithmically, and CMS vendors should go back to the white board to develop a reasonable solution. Matt Mullenweg from WordPress admits that “in hindsight, I don’t think nofollow had much of an effect [in fighting comment spam]”, and I hope this insight triggers a well thought out workflow replacing the unethical nofollow-by-default (see follow you, follow me).

At Google’s Webmaster Help Center regular posters nag Googlers with questions like Is rel=nofollow becoming the norm? Google’s search evangelist Adam Lasnik stepped in and states “As you might have noticed, many of the world’s most successful sites link liberally to other sites, and this sort of thing is often appreciated by and rewarded by visitors. And if you’re editorially linking to sites you can personally vouch for, I can’t see a reason to no-follow those.” and “On the whole [nofollow thingie], while Matt’s been pretty forthcoming and descriptive, I do think we Googlers on the whole can do a better job in explaining and justifying nofollow“. Thanks Adam, while explaining Google’s take on rel=nofollow to the great unwashed, why not start a major clean-up to extend this microformat and to make it useful, useable and less confusing for the masses?

While waiting for actions promised by the nofollow inventor, here is a good summary of nofollow clarifications by Googlers. I’ve a ton of respect for Matt, I know he listens and picks reasonable arguments even from negative posts, so stay tuned (I do hope my tiny revamp-nofollow campaign is not seen as negative press by the way).

A very good starting point to examine the destructive impact rel=nofollow had, has, and will have if not revamped, is Carsten Cumbrowski’s essay explaining why rel=nofollow leverages mistrust among people. I do not provide quotes because I want you all to read and reread this great article.

Robert Scoble rethinking his nofollow support says “I was wrong about “NoFollow” … I’m very concerned, for instance, about Wikipedia’s use of nofollow“. Scroll down, don’t miss out on the comments.

Michael Gray’s strong statement Google’s policy on No follow and reviews is hypocritical and wrong is worth a read, he’s backing his point of view providing a complete nofollow-history along with many quotes and nofollow-tidbits.

Tags: ()

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

« Previous Page  1 | 2 | 3