Archived posts from the 'Robots Meta Tags' Category

Handling Google’s neat X-Robots-Tag - Sending REP header tags with PHP

It’s a bad habit to tell the bad news first, and I’m guilty of that. Yesterday I linked to Dan Crow telling Google that the unavailable_after tag is useless IMHO. So todays post is about a great thing: REP header tags aka X-Robots-Tags, unfortunately mentioned as second news somewhat concealed in Google’s announcement.

The REP is not only a theatre, it stands for Robots Exclusion Protocol (robots.txt and robots meta tag). Everything you can shove into a robots meta tag on a HTML page can now be delivered in the HTTP header for any file type:

  • INDEX|NOINDEX - Tells whether the page may be indexed or not
  • FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided on the page or not
  • ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
  • NOODP - tells search engines not to use page titles and descriptions from the ODP on their SERPs.
  • NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
  • NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
  • NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
  • UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

So how can you serve X-Robots-Tags in the HTTP header of PDF files for example? Here is one possible procedure to explain the basics, just adapt it for your needs:

Rewrite all requests of PDF documents to a PHP script knowing wich files must be served with REP header tags. You could do an external redirect too, but this may confuse things. Put this code in your root’s .htaccess:

RewriteEngine On
RewriteBase /pdf
RewriteRule ^(.*)\.pdf$ serve_pdf.php

In /pdf you store some PDF documents and serve_pdf.php:


$requestUri = $_SERVER[’REQUEST_URI’];

if (stristr($requestUri, “my.pdf”)) {
header(’X-Robots-Tag: index, noarchive, nosnippet’, TRUE);
header(’Content-type: application/pdf’, TRUE);
readfile(’my.pdf’);
exit;
}


This setup routes all requests of *.pdf files to /pdf/serve_pdf.php which outputs something like this header when a user agent asks for /pdf/my.pdf:

Date: Tue, 31 Jul 2007 21:41:38 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.4
X-Powered-By: PHP/4.4.4
X-Robots-Tag: index, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: application/pdf

You can do that with all kind of file types. Have fun and say thanks to Google :)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Unavailable_After is totally and utterly useless

I’ve a lot of respect for Dan Crow, but I’m struggling with my understanding, or possible support, of the unavailable_after tag. I don’t want to put my reputation for bashing such initiatives from search engines at risk, so sit back and grab your popcorn, here comes the roasting:

As a Webmaster, I did not find a single scenario where I could or even would use it. That’s because I’m a greedy traffic whore. A bazillion other Webmasters are greedy too. So how the heck is Google going to sell the newish tag to the greedy masses?

Ok, from a search engine’s perspective unavailable_after makes sound sense. Outdated pages bind resources, annoy searchers, and in a row of useless crap the next bad thing after an outdated page is intentional Webspam.

So convincing the great unwashed to put that thingy on their pages inviting friends and family to granny’s birthday party on 25-Aug-2007 15:00:00 EST would improve search quality. Not that family blog owners care about new meta tags, RFC 850-ish date formats, or search engine algos rarely understanding that the announced party is history on Aug/26/2007. Besides there may be painful aftermaths worth submitting a desperate call for aspirins the day after in the comments, what would be news of the day after expiration. Kinda dilemma, isn’t it?

Seriously, unless CMS vendors support the new tag, tiny sites and clique blogs aren’t Google’s target audience. This initiative addresses large sites which are responsible for a huge amount of outdated contents in Google’s search index.

So what is the large site Webmaster’s advantage of using the unavailable_after tag? A loss of search engine traffic. A loss of link juice gained by the expired page. And so on. Losses of any kind are not that helpful when it comes to an overdue raise nor in salary negotiations. Hence the Webmaster asks for the sack when s/he implements Google’s traffic terminator.

Who cares about Google’s search quality problems when it leads to traffic losses? Nobody. Caring Webmasters do the right thing anyway. And they don’t need no more useless meta tags like unavailable_after. “We don’t need no stinking metas” from “Another Brick in the Wall Part Web 2.0″ expresses my thoughts perfectly.

So what separates the caring Webmaster from the ‘ruthless traffic junky’ who Google wants to implement the unavailable_after tag? The traffic junkie lets his stuff expire without telling Google about it’s state, is happy that frustrated searchers click the URL from the SERPs even years after the event, and enjoys the earnings from tons of ads placed above the content minutes after the party was over. Dear Google, you can’t convince this guy.

[It seems this is a post about repetitive “so whats”. And I came to the point before the 4th paragraph … wow, that’s new … and I’ve put a message in the title which is not even meant as link bait. Keep on reading.]

So what does the caring Webmaster do without the newish unavailable_after tag? Business as usual. Examples:

Say I run a news site where the free contents go to the subscription area after a while. I’d closely watch which search terms generate traffic, write a search engine optimized summary containing those keywords, put that on the sales pitch, and move the original article to the archives accessible to subscribers only. It’s not my fault that the engines think they point to the original article after the move. When they recrawl and reindex the page my traffic will increase because my summary fits their needs more perfectly.

Say I run an auction site. Unfortunately particular auctions expire, but I’m sure that the offered products will return to my site. Hence I don’t close the page, but I search my database for similar offerings and promote them under a H3 heading like “[product] (stuffed keywords) is hot” /H3 P buy [product] here: /P followed by a list of identical products for sale or similar auctions.

Say I run a poll expiring in two weeks. With Google’s newish near real time indexing that’s enough time to collect keywords from my stats, so the textual summary under the poll’s results will attract the engines as well as visitors when the poll is closed. Also, many visitors will follow the links to related respectively new polls.

From Google’s POV there’s nothing wrong with my examples, because the visitor gets what s/he was searching for, and I didn’t cheat. Now tell me, why should I give up these valuable sources of nicely targeted search engine traffic just to make Google happy? Rather I’d make my employer happy. Dear Google, you didn’t convince me.

Update: Tanner Christensen posted a remarkable comment at Sphinn:

I’m sure there is some really great potential for the tag. It’s just none of us have a need for it right now.

Take, for example, when you buy your car without a cup holder. You didn’t think you would use it. But then, one day, you find yourself driving home with three cups of fruit punch and no cup holders. Doh!

I say we wait it out for a while before we really jump on any conclusions about the tag.

John Andrews was the first to report an evil use of unavailable_after.

Also, Dan Crow from Google announced a pretty neat thing in the same post: With the X-Robots-Tag you can now apply crawler directives valid in robots meta tags to non-HTML documents like PDF files or images.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google nofollow’s itself

Awesome. Nofollow-insane at its best. Check the source of Google’s Webmaster Blog. In HEAD you’ll find an insane meta tag:
<meta name=”ROBOTS” content=”NOINDEX,NOFOLLOW” />

Well, that’s one of many examples. Read the support forums. Another case of Google nofollow’ing herself: Google fun

Matt thought that all teams understood the syntax and semantics of rel-nofollow. It seems to me that’s not the case. I really can’t blame Googlers applying rel-nofollow or even nofollow/noindex meta tags to everything they get a hand on. It is not understandable. It’s not useable. It’s misleading. It’s confusing. It should get buried asap.

Hat tip to John (JLH’s post).

Update 1: A friendly Googler just told me that a Blogger glitch (pertaining only Google blogs) inserted the crawler-unfriendly meta element, it should be solved soon. I thought this bug was fixed months ago ... if page.isPrivate == true by mistake then insert “<meta content=’NOINDEX,NOFOLLOW’ name=’ROBOTS’ />” … (made up)

Update 2: The ‘noindex,nofollow’ robots meta tag is gone now, and the Webmaster Central Blog got a neat new logo:
Google Webmaster Central Blog - Offic'ial news on crawling and indexing sites for the Google index (I’d add ALT and TITLE text: alt="Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index" title="Official news on crawling and indexing sites for the Google index")



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

In need of a "Web-Robot Directives Standard"

The Robots Exclusion Protocol from 1994 gets used and abused, best described by Lisa Barone citing Dan Crow from Google: “everyone uses it but everyone uses a different version of it”. De facto we’ve a Robots Exclusion Standard covering crawler directives in robots.txt and robots meta tags as well, said Dan Crow. Besides non-standardized directives like “Allow:”, Google’s Sitemaps Protocol adds more inclusion to the mix, now even closely bundled with robots.txt. There are more ways to put crawler directives. Unstructured (in the sense of independence from markup elements) like with Google’s section targeting, on link level applying the commonly disliked rel-nofollow microformat or XFN, and related thoughts on block level directives.

All in all that’s a pretty confusing conglomerate of inclusion and exclusion, utilizing many formats respectively markup elements, and lots of places to put crawler directives. Not really the sort of norm the webmaster community can successfully work with. No wonder that over 75,000 robots.txt files have pictures in them, that less than 35 percent of servers have a robots.txt file, that the average robots.txt file is 23 characters (”User-agent: * Disallow:”), gazillions of Web pages carry useless and unsupported meta tags like “revisit-after” … for more funny stats and valuable information see Lisa’s robots.txt summit coverage (SES NY 2007), also covered by Tamar (read both!).

How to structure a “Web-Robot Directives Standard”?

To handle redundancies as well as cascading directives properly, we need a clear and understandable chain of command. The following is just a first idea off the top of my head, and likely gets updated soon:

  • Robots.txt

    1. Disallows directories, files/file types, and URI fragments like query string variables/values by user agent.
    2. Allows sub-directories, file names and URI fragments to refine Disallow statements.
    3. Gives general directives like crawl-frequency or volume per day and maybe even folders, and restricts crawling in particular time frames.
    4. References general XML sitemaps accessible to all user agents, and specific XML sitemaps addressing particular user agents as well.
    5. Sets site-level directives like “noodp” or “noydir”.
    6. Predefines page-level instructions like “nofollow”, “nosnippet” or “noarchive” by directory, document type or URL fragments.
    7. Predefines block-level respectively element-level conditions like “noindex” or “nofollow” on class names or DOM-IDs by markup element. For example “DIV.hMenu,TD#bNav ‘noindex,nofollow’” could instruct crawlers to ignore the horizontal menu as well as navigation at the very bottom on all pages.
    8. Predefines attribute-level conditions like “nofollow” on A elements. For example “A.advertising REL ‘nofollow’” could tell crawlers to ignore links in ads, or “P#tos > A ‘nofollow’” could instruct spiders to ignore links in TOS excerpts found on every page in a P element with the DOM-ID “tos”.
    • XML Sitemaps

      1. Since robots.txt deals with inclusion now, why not add an optional URL specific “action” element allowing directives like “nocache” or “nofollow”? Also a “delete” directive to get outdated pages removed from search indexes would make sound sense.
      2. To make XML sitemap data reusable, and to allow centralized maintenance of page meta data, a couple of new optional URL elements like “title”, “description”, “document type”, “language”, “charset”, “parent” and so on would be a neat addition. This way it would be possible to visualize XML sitemaps as native (and even hierarchical) site maps.

      Robots.txt exclusions overrule URLs listed for inclusion in XML sitemaps.

      • Meta Tags

      • Page meta data overrule directives and information provided in robots.txt and XML sitemaps. Empty contents in meta tags suppress directives and values given in upper levels. Non-existent meta tags implicitly apply data and instructions from upper levels. The same goes for everything below.
        • Body Sections

        • Unstructured parenthesizing of parts of code certainly is undoable with XMLish documents, but may be a pragmatic procedure to deal with legacy code. Paydirt in HTML comments may be allowed to mark payload for contextual advertising purposes, but it’s hard to standardize. Lets leave that for proprietary usage.
          • Body Elements

          • Implementing a new attribute for messages to machines should be avoided for several good reasons. Classes are additive, so multiple values can be specified for most elements. That would allow to put standarized directives as class names, for example class=”menu robots-noindex googlebot-nofollow slurp-index-follow” where the first class addresses CSS. Such inline robot directives come with the same disadvantages as inline style assignments and open a can of worms so to say. Using classes and DOM-IDs just as a reference to user agent specific instructions given in robots.txt is surely the preferable procedure.
            • Element Attributes

            • More or less this level is a playground for microformats utilizing the A element’s REV and REL attributes.

Besides the common values “nofollow”, “noindex”, “noarchive”/”nocache” etc. and their omissible positive defaults “follow” and “index” etc., we’d need a couple more, for example “unapproved”, “untrusted”, “ignore” or “skip” and so on. There’s a lot of work to do.

In terms of of complexity, a mechanism as outlined above should be as easy to use as CSS in combination with client sided scripting for visualization purposes.

However, whatever better ideas are out there, we need a widely accepted “Web-Robot Directives Standard” as soon as possible.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2