Handling Google’s neat X-Robots-Tag - Sending REP header tags with PHP

It’s a bad habit to tell the bad news first, and I’m guilty of that. Yesterday I linked to Dan Crow telling Google that the unavailable_after tag is useless IMHO. So todays post is about a great thing: REP header tags aka X-Robots-Tags, unfortunately mentioned as second news somewhat concealed in Google’s announcement.

The REP is not only a theatre, it stands for Robots Exclusion Protocol (robots.txt and robots meta tag). Everything you can shove into a robots meta tag on a HTML page can now be delivered in the HTTP header for any file type:

  • INDEX|NOINDEX - Tells whether the page may be indexed or not
  • FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided on the page or not
  • ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
  • NOODP - tells search engines not to use page titles and descriptions from the ODP on their SERPs.
  • NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
  • NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
  • NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
  • UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

So how can you serve X-Robots-Tags in the HTTP header of PDF files for example? Here is one possible procedure to explain the basics, just adapt it for your needs:

Rewrite all requests of PDF documents to a PHP script knowing wich files must be served with REP header tags. You could do an external redirect too, but this may confuse things. Put this code in your root’s .htaccess:

RewriteEngine On
RewriteBase /pdf
RewriteRule ^(.*)\.pdf$ serve_pdf.php

In /pdf you store some PDF documents and serve_pdf.php:


$requestUri = $_SERVER[’REQUEST_URI’];

if (stristr($requestUri, “my.pdf”)) {
header(’X-Robots-Tag: index, noarchive, nosnippet’, TRUE);
header(’Content-type: application/pdf’, TRUE);
readfile(’my.pdf’);
exit;
}


This setup routes all requests of *.pdf files to /pdf/serve_pdf.php which outputs something like this header when a user agent asks for /pdf/my.pdf:

Date: Tue, 31 Jul 2007 21:41:38 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.4
X-Powered-By: PHP/4.4.4
X-Robots-Tag: index, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: application/pdf

You can do that with all kind of file types. Have fun and say thanks to Google :)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

12 Comments to "Handling Google's neat X-Robots-Tag - Sending REP header tags with PHP"

  1. dockarl on 1 August, 2007  #link

    Sebastian - Great post.

    I didn’t have any idea about that - plan to write a post referencing you.

    Cheers,

    Matt

  2. Sebastian on 2 August, 2007  #link

    Thanks Matt :)
    Here is another good link from Hamlet Batista:
    serving X-Robots-Tags with SetEnvIf and Header add in .htaccess - neat :)

  3. Hamlet Batista on 2 August, 2007  #link

    Sebastian - That is very clever. Good job!

    I thought about doing something similar, but decided the .htaccess solution might be easier for non-programmers.

    I thought about using ScriptAliasMatch to map files to a cgi script that would add the header and using PATH_INFO to find the file on disk.

  4. […] malware, etcetera. However, under some circumstances it would make sound sense to have a NOPREVIEW X-Robots-Tag, but unfortunately Google forgot to introduce it […]

  5. […] Dan Crow from Google announced a pretty neat thing in the same post: With the X-Robots-Tag you can now apply crawler directives valid in robots meta tags to non-HTML documents like PDF files […]

  6. Noobliminal on 13 September, 2007  #link

    I have figured an ‘alternative’ use for this new http header tag. Might be worth a check ;)

  7. Sebastian on 13 September, 2007  #link

    Noobliminal, your method works. To prevent you from such cheats by sneaky link mongers there’s only one route: don’t do link exchanges. ;)

  8. Noobliminal on 13 September, 2007  #link

    I’m not .. or .. am I? %\
    Regards.

  9. Alphane Moon on 19 December, 2007  #link

    Hi Sebastian,
    thank you for this article on the X-Robots-Tag, it is very useful and was precisely what I was looking for.

  10. […] be useful as wellGetting URLs outta Google – the good, the popular, and the definitive way Handling Google’s neat X-Robots-Tag – Sending REP header tags with PHPNasty Bots & UsersA lot of security relies on identifying nasty bots, detecting rogue activity […]

  11. Houston SEO on 16 December, 2009  #link

    Thanks for this interesting article, Just discovered this.. can be quite vicious

    [Are you aware that links dropped by ‘Houston SEO’ get removed across the boards, in case the comments submitted by you guys aren’t catched by spam filters beforehand?]

  12. Jeff on 5 February, 2010  #link

    This works really simply, and it keeps other headers (like Content-Length) that Apache applies to PDFs:

    Header set X-Robots-Tag “index, noarchive”

    You can put it in the server config, a Virtual Host section, or an .htaccess, for fine control of what PDFs it applies to (and/or adjust the regex).

    Jeff

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.